Improving OCR for an under-resourced script using unsupervised word-spotting

Adi Silberpfennig, Lior Wolf, Nachum Dershowitz, Seraogi Bhagesh, Bidyut B. Chaudhuri

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Optical character recognition (OCR) quality, especially for under-resourced scripts like Bangla, as well as for documents printed in old typefaces, is a major concern. An efficient and effective pipeline for OCR betterment is proposed here. The method is unsupervised. It employs a baseline OCR engine as a black box plus a dataset of unlabeled document images. That engine is applied to the images, followed by a visual encoding designed to support efficient word spotting. Given a new document to be analyzed, the black-box recognition engine is first applied. Then, for each result, word spotting is carried out within the dataset. The unreliable OCR outputs of the retrieved word spotting results are then considered. The word that is the centroid of the set of OCR words, measured by edit distance, is deemed a candidate reading.

Original languageEnglish
Title of host publication13th IAPR International Conference on Document Analysis and Recognition, ICDAR 2015 - Conference Proceedings
PublisherIEEE Computer Society
Pages706-710
Number of pages5
ISBN (Electronic)9781479918058
DOIs
StatePublished - 20 Nov 2015
Event13th International Conference on Document Analysis and Recognition, ICDAR 2015 - Nancy, France
Duration: 23 Aug 201526 Aug 2015

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Volume2015-November
ISSN (Print)1520-5363

Conference

Conference13th International Conference on Document Analysis and Recognition, ICDAR 2015
Country/TerritoryFrance
CityNancy
Period23/08/1526/08/15

Fingerprint

Dive into the research topics of 'Improving OCR for an under-resourced script using unsupervised word-spotting'. Together they form a unique fingerprint.

Cite this