TY - JOUR
T1 - Measuring the validity of peer-to-peer data for information retrieval applications
AU - Koenigstein, Noam
AU - Shavitt, Yuval
AU - Weinsberg, Ela
AU - Weinsberg, Udi
PY - 2012/2/23
Y1 - 2012/2/23
N2 - Peer-to-peer (p2p) networks are being increasingly adopted as an invaluable resource for various information retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information. This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. We identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music files shared in the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the popular content using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the "long tail", hence a much more exhaustive crawl is needed. Furthermore, we show that content and search queries are highly localized, indicating that location-crossing conclusions require a wide spread spatial crawl. Finally, we present techniques for overcoming noise originating from user generated content and for filtering non-informative data, while minimizing information loss.
AB - Peer-to-peer (p2p) networks are being increasingly adopted as an invaluable resource for various information retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information. This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. We identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music files shared in the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the popular content using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the "long tail", hence a much more exhaustive crawl is needed. Furthermore, we show that content and search queries are highly localized, indicating that location-crossing conclusions require a wide spread spatial crawl. Finally, we present techniques for overcoming noise originating from user generated content and for filtering non-informative data, while minimizing information loss.
KW - Information retrieval
KW - Measurement
KW - Peer-to-peer
UR - http://www.scopus.com/inward/record.url?scp=84859085799&partnerID=8YFLogxK
U2 - 10.1016/j.comnet.2011.10.026
DO - 10.1016/j.comnet.2011.10.026
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84859085799
SN - 1389-1286
VL - 56
SP - 1092
EP - 1102
JO - Computer Networks
JF - Computer Networks
IS - 3
ER -