TY - JOUR
T1 - Algebras for querying text regions
T2 - Expressive power and optimization
AU - Consens, Mariano P.
AU - Milo, Tova
N1 - Funding Information:
The authors thank Stephane Grumbach, Pekka Kilpelainen, Gonzalo Navarro, and Frank Tompa for discussions and comments related to the contents of this paper. The first author was financially supported in part by the Department of Computer Science at the University of Waterloo and by Grant CRD 147259 from the Natural Sciences and Engineering research Council of Canada. This work was done while the second author was at the University of Toronto and supported by the Institute for Robotics and Intelligent Systems.
PY - 1998/12
Y1 - 1998/12
N2 - There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to increased availability of documents in digital form. Document data has a natural hierarchical structure, which may be made explicit due to the use of markup conventions (as with SGML). An important aspect of managing structured and semistructured textual data consists of supporting the efficient retrieval of text components based both on their content and on their structure. In this paper we study issues related to the expressive power and optimization of a class of algebras that support combining string (or pattern) searches with queries on the hierarchical structure of the text. The region algebra studied is a set-at-a-time algebra for manipulating text regions (substrings of the text) that supports finding out nesting and ordering properties of the text regions. This algebra is part of the language in use in commercial text retrieval systems and can form the basis for supporting SQL-like access to textual data. By presenting a close relationship between the region algebra and the monadic first order theory of finite binary trees, we show that queries in the algebra can be optimized, in the sense that equivalence to less expensive expressions can be tested. This optimization can be difficult (co-NP-hard in the general case), but there is an important class of queries that can be optimized in polynomial time. On the negative side, we show that the language is incapable of capturing some important properties of the text structure, related to the nesting and ordering of text regions. We conclude by suggesting possible extensions to increase the expressive power of the language and consider one such example.
AB - There is a significant amount of interest in combining and extending database and information retrieval technologies to manage textual data. The challenge is becoming more relevant due to increased availability of documents in digital form. Document data has a natural hierarchical structure, which may be made explicit due to the use of markup conventions (as with SGML). An important aspect of managing structured and semistructured textual data consists of supporting the efficient retrieval of text components based both on their content and on their structure. In this paper we study issues related to the expressive power and optimization of a class of algebras that support combining string (or pattern) searches with queries on the hierarchical structure of the text. The region algebra studied is a set-at-a-time algebra for manipulating text regions (substrings of the text) that supports finding out nesting and ordering properties of the text regions. This algebra is part of the language in use in commercial text retrieval systems and can form the basis for supporting SQL-like access to textual data. By presenting a close relationship between the region algebra and the monadic first order theory of finite binary trees, we show that queries in the algebra can be optimized, in the sense that equivalence to less expensive expressions can be tested. This optimization can be difficult (co-NP-hard in the general case), but there is an important class of queries that can be optimized in polynomial time. On the negative side, we show that the language is incapable of capturing some important properties of the text structure, related to the nesting and ordering of text regions. We conclude by suggesting possible extensions to increase the expressive power of the language and consider one such example.
UR - http://www.scopus.com/inward/record.url?scp=0032299213&partnerID=8YFLogxK
U2 - 10.1006/jcss.1998.1564
DO - 10.1006/jcss.1998.1564
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:0032299213
SN - 0022-0000
VL - 57
SP - 272
EP - 288
JO - Journal of Computer and System Sciences
JF - Journal of Computer and System Sciences
IS - 3
ER -