Relating Articles Textually and Visually

Nachum Dershowitz, Daniel Labenski, Adi Silberpfennig, Lior Wolf, Yaron Tsur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Historical documents have been undergoing large-scale digitization over the past years, placing massive image collections online. Optical character recognition (OCR) often performs poorly on such material, which makes searching within these resources problematic and textual analysis of such documents difficult. We present two approaches to overcome this obstacle, one textual and one visual. We show that, for tasks like finding newspaper articles related by topic, poor-quality OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by adding words with similar distributional representations. As an alternative to OCR-based methods, one can perform image-based search, using word spotting. Synthetic images are generated for every word in a lexicon, and word-spotting is used to compile vectors of their occurrences. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those obtained using noisy OCR. We report on experiments applying both methods, separately and together, on historical Hebrew newspapers, with their added problem of rich morphology.

Original languageEnglish
Title of host publicationProceedings - 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017
PublisherIEEE Computer Society
Pages274-280
Number of pages7
ISBN (Electronic)9781538635865
DOIs
StatePublished - 2 Jul 2017
Event14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017 - Kyoto, Japan
Duration: 9 Nov 201715 Nov 2017

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
Volume1
ISSN (Print)1520-5363

Conference

Conference14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017
Country/TerritoryJapan
CityKyoto
Period9/11/1715/11/17

Fingerprint

Dive into the research topics of 'Relating Articles Textually and Visually'. Together they form a unique fingerprint.

Cite this