TY - JOUR
T1 - Code2vec
T2 - Learning distributed representations of code
AU - Alon, Uri
AU - Zilberstein, Meital
AU - Levy, Omer
AU - Yahav, Eran
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/1
Y1 - 2019/1
N2 - We present a neural model for representing snippets of code as continuous distributed vectors (łcode embed-dingsž). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We show that code vectors trained on this dataset can predict method names from files that were unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. A comparison of our approach to previous techniques over the same dataset shows an improvement of more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data and trained models are available at https://github.com/tech-srl/code2vec.
AB - We present a neural model for representing snippets of code as continuous distributed vectors (łcode embed-dingsž). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We show that code vectors trained on this dataset can predict method names from files that were unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. A comparison of our approach to previous techniques over the same dataset shows an improvement of more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data and trained models are available at https://github.com/tech-srl/code2vec.
KW - Big Code
KW - Distributed Representations
KW - Machine Learning
UR - http://www.scopus.com/inward/record.url?scp=85120101200&partnerID=8YFLogxK
U2 - 10.1145/3290353
DO - 10.1145/3290353
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85120101200
SN - 2475-1421
VL - 3
JO - Proceedings of the ACM on Programming Languages
JF - Proceedings of the ACM on Programming Languages
IS - POPL
M1 - 40
ER -