Arabic OCR
من ويكي عربآيز
محتويات
Optical Character Recognition
OCR is the ability to scan a document (or grab a PDF file) and run an OCR program on it and it will generate, based on optical recognition and approximation, an editable text file. For an idea about OCR see http://www.students.cs.uu.nl/people/mjkammer/Work/intro_2_OCR.html
Current Status of Open Source Arabic OCR software
The only FOSS OCR system with Arabic support is Tesseract, help is needed in testing and training it.
Resources
Arabic OCR Links
Papers
- Automatic Recognition Using Zernike Moments As A Feature Extractor (Paper)
- Graph Based Segmentation .. (Paper)
- Structural Features Of Cursive Arabic Scripts (Paper)
- Multilingual Machine Printed OCR (Paper)
- Test of two Arabic OCR programs
- Performance Evaluation of two Arabic OCR products
Software (FOSS)
- Tesseract is an open source OCR, initially developed by HP, and released under the Apache License. 3.x versions has Arabic support.
- GOCR - included in Debian and other distributions. No Arabic support.
- GNU Ocrad "is an OCR [...] program based on a feature extraction method". No Arabic support.
Other Links
- How to encode image produced by a recognition system (mailing thread) http://lists.arabeyes.org/archives/general/2002/March/msg00001.html
- Rapidly Retargetable Translingual Detection http://tides.umiacs.umd.edu/description.html