TY - GEN
T1 - Boosting automatic commit classification into maintenance activities by utilizing source code changes
AU - Levin, Stanislav
AU - Yehudai, Amiram
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery. All rights reserved.
PY - 2017/11/8
Y1 - 2017/11/8
N2 - Background: Understanding maintenance activities performed in a source code repository could help practitioners reduce uncertainty and improve cost-effectiveness by planning ahead and pre-allocating resources towards source code maintenance. Theresearch community uses 3 main classification categories for maintenance activities: Corrective: fault fixing; Perfective: system improvements; Adaptive: new feature introduction. Previous work in this area has mostly concentrated on evaluating commit classification (into maintenance activities) models in the scope of a single software project. Aims: In this work we seek to design a commit classification model capable of providing high accuracy and Kappa across different projects. In addition, we wish to compare the accuracy and kappa characteristics of classification models that utilize word frequency analysis, source code changes, and combination thereof. Method: We suggest a novel method for automatically classifying commits into maintenance activities by utilizing source code changes (e.g, statement added, method removed, etc.). The results we report are based on studying 11 popular open source projects from various professional domains from which we had manually classified 1151 commits, over 100 from each of the studied projects. Our models were trained using 85% of the dataset, while the remaining 15% were used as a test set. Results: Our method shows a promising accuracy of 76% and Cohen's kappa of 63% (considered "Good" in this context) for the test dataset, an improvement of over 20 percentage points, and a relative boost of ~40% in the context of cross-project classification. Conclusions: We show that by using source code changes in combination with commit message word frequency analysis we are able to considerably boost classification quality in a project agnostic manner.
AB - Background: Understanding maintenance activities performed in a source code repository could help practitioners reduce uncertainty and improve cost-effectiveness by planning ahead and pre-allocating resources towards source code maintenance. Theresearch community uses 3 main classification categories for maintenance activities: Corrective: fault fixing; Perfective: system improvements; Adaptive: new feature introduction. Previous work in this area has mostly concentrated on evaluating commit classification (into maintenance activities) models in the scope of a single software project. Aims: In this work we seek to design a commit classification model capable of providing high accuracy and Kappa across different projects. In addition, we wish to compare the accuracy and kappa characteristics of classification models that utilize word frequency analysis, source code changes, and combination thereof. Method: We suggest a novel method for automatically classifying commits into maintenance activities by utilizing source code changes (e.g, statement added, method removed, etc.). The results we report are based on studying 11 popular open source projects from various professional domains from which we had manually classified 1151 commits, over 100 from each of the studied projects. Our models were trained using 85% of the dataset, while the remaining 15% were used as a test set. Results: Our method shows a promising accuracy of 76% and Cohen's kappa of 63% (considered "Good" in this context) for the test dataset, an improvement of over 20 percentage points, and a relative boost of ~40% in the context of cross-project classification. Conclusions: We show that by using source code changes in combination with commit message word frequency analysis we are able to considerably boost classification quality in a project agnostic manner.
KW - Human factors
KW - Mining software repositories
KW - Predictive models
KW - Software maintenance
UR - http://www.scopus.com/inward/record.url?scp=85049036799&partnerID=8YFLogxK
U2 - 10.1145/3127005.3127016
DO - 10.1145/3127005.3127016
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85049036799
SN - 9781450353052
T3 - ACM International Conference Proceeding Series
SP - 97
EP - 106
BT - PROMISE 2017 - 13th International Conference on Predictive Models and Data Analytics in Software Engineering
PB - Association for Computing Machinery
T2 - 13th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2017
Y2 - 8 November 2017
ER -