TY - GEN
T1 - Boosting SimRank with Semantics
AU - Milo, Tova
AU - Somech, Amit
AU - Youngmann, Brit
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019
Y1 - 2019
N2 - The problem of estimating the similarity of a pair of nodes in an information network draws extensive interest in numerous fields, e.g., social networks and recommender systems. In this work we revisit SimRank, a popular and well studied similarity measure for information networks, that quantifies the similarity of two nodes based on the similarity of their neighbors. SimRank’s popularity stems from its simple, declarative definition and its efficient, scalable computation. However, despite its wide adaptation, it has been observed that for many applications SimRank may yield inaccurate similarity estimations, due to the fact that it focuses on the network structure and ignores the semantics conveyed in the node/edge labels. Therefore, the question that we ask is can SimRank be enriched with semantics while preserving its advantages? We answer the question positively and present SemSim, a modular variant of SimRank that allows to inject into the computation any semantic similarly measure, which satisfies three natural conditions. The probabilistic framework that we develop for SemSim is anchored in a careful modification of SimRank’s underlying random surfer model. It employs Importance Sampling along with a novel pruning technique, based on unique properties of SemSim. Our framework yields execution times essentially on par with the (semantic-less) SimRank, while maintaining negligible error rate, and facilitates direct adaptation of existing SimRank optimizations. Our experiments demonstrate the robustness of SemSim, even compared to task-dedicated measures.
AB - The problem of estimating the similarity of a pair of nodes in an information network draws extensive interest in numerous fields, e.g., social networks and recommender systems. In this work we revisit SimRank, a popular and well studied similarity measure for information networks, that quantifies the similarity of two nodes based on the similarity of their neighbors. SimRank’s popularity stems from its simple, declarative definition and its efficient, scalable computation. However, despite its wide adaptation, it has been observed that for many applications SimRank may yield inaccurate similarity estimations, due to the fact that it focuses on the network structure and ignores the semantics conveyed in the node/edge labels. Therefore, the question that we ask is can SimRank be enriched with semantics while preserving its advantages? We answer the question positively and present SemSim, a modular variant of SimRank that allows to inject into the computation any semantic similarly measure, which satisfies three natural conditions. The probabilistic framework that we develop for SemSim is anchored in a careful modification of SimRank’s underlying random surfer model. It employs Importance Sampling along with a novel pruning technique, based on unique properties of SemSim. Our framework yields execution times essentially on par with the (semantic-less) SimRank, while maintaining negligible error rate, and facilitates direct adaptation of existing SimRank optimizations. Our experiments demonstrate the robustness of SemSim, even compared to task-dedicated measures.
UR - https://www.scopus.com/pages/publications/85064903262
U2 - 10.5441/002/edbt.2019.05
DO - 10.5441/002/edbt.2019.05
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85064903262
T3 - Advances in Database Technology - EDBT
SP - 37
EP - 48
BT - Advances in Database Technology - EDBT 2019
A2 - Herschel, Melanie
A2 - Binnig, Carsten
A2 - Reinwald, Berthold
A2 - Kaoudi, Zoi
A2 - Galhardas, Helena
A2 - Fundulaki, Irini
PB - OpenProceedings.org
T2 - 22nd International Conference on Extending Database Technology, EDBT 2019
Y2 - 26 March 2019 through 29 March 2019
ER -