TY - GEN
T1 - Relating Articles Textually and Visually
AU - Dershowitz, Nachum
AU - Labenski, Daniel
AU - Silberpfennig, Adi
AU - Wolf, Lior
AU - Tsur, Yaron
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Historical documents have been undergoing large-scale digitization over the past years, placing massive image collections online. Optical character recognition (OCR) often performs poorly on such material, which makes searching within these resources problematic and textual analysis of such documents difficult. We present two approaches to overcome this obstacle, one textual and one visual. We show that, for tasks like finding newspaper articles related by topic, poor-quality OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by adding words with similar distributional representations. As an alternative to OCR-based methods, one can perform image-based search, using word spotting. Synthetic images are generated for every word in a lexicon, and word-spotting is used to compile vectors of their occurrences. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those obtained using noisy OCR. We report on experiments applying both methods, separately and together, on historical Hebrew newspapers, with their added problem of rich morphology.
AB - Historical documents have been undergoing large-scale digitization over the past years, placing massive image collections online. Optical character recognition (OCR) often performs poorly on such material, which makes searching within these resources problematic and textual analysis of such documents difficult. We present two approaches to overcome this obstacle, one textual and one visual. We show that, for tasks like finding newspaper articles related by topic, poor-quality OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by adding words with similar distributional representations. As an alternative to OCR-based methods, one can perform image-based search, using word spotting. Synthetic images are generated for every word in a lexicon, and word-spotting is used to compile vectors of their occurrences. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those obtained using noisy OCR. We report on experiments applying both methods, separately and together, on historical Hebrew newspapers, with their added problem of rich morphology.
UR - http://www.scopus.com/inward/record.url?scp=85045204156&partnerID=8YFLogxK
U2 - 10.1109/ICDAR.2017.53
DO - 10.1109/ICDAR.2017.53
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85045204156
T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
SP - 274
EP - 280
BT - Proceedings - 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017
PB - IEEE Computer Society
T2 - 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017
Y2 - 9 November 2017 through 15 November 2017
ER -