We scan semi structured documents with our OCR engine. It works and we have a pretty good accuracy but we want to improve our engine to be able to recognize certain paragraphs. This will help us improve the subsequent process steps.
The documents we scan are all machine written and have 300 dpi. Currently our engine is written in python and we would prefer to have it in python but java is also ok.
The goal of the overall project is the development of a web based API which can receive documents, analyze them and send back an output file.
The scope of the here presented project as a subproject is the extraction of text out of machine written documents which have been scanned before.
We have the following requirements:
• recognize text in document types: PDF, Jpeg, jpg, png, BMP, TIFF.
• Accuracy: 97 %
• Less than 3 secs for one document
• Enable multithreading to parallelize the text extraction
• Scalability for more than 1000 requests per Minute
• Detect paragraphs and classify them
• Transparent and fast Image preprocessing process is required
• Save the result in a variable and a txt-file
Windows-Server 2012, max. 1 Gigabyte ram, JAVA 7 or Python 3.6