Luigi Libero Lucio Starace, Ph.D.

Assistant Professor @ Università degli Studi di Napoli Federico II, Italy.

AI-based Fault-proneness Metrics for Source Code Changes

AuthorsFrancesco Altiero, Anna Corazza, Sergio Di Martino, Adriano Peron, and Luigi Libero Lucio Starace.
conferenceIWSM-MENSURA 2023 - Joint Conference of the 32nd International Workshop on Software Measurement (IWSM) and the 17th International Conference on Software Process and Product Measurement (MENSURA).

Abstract

In software evolution, some types of changes to the codebase (e.g.: a local variable renaming refactoring) are less likely to introduce faults than others (e.g.: changes involving control flow statements). Effectively estimating the fault-proneness of codebase changes can provide a number of advantages in the software process. For example, expensive and time-consuming regression testing, code review, or fault localization activities could be driven by fault-proneness, prioritizing the most critical changes to detect issues more rapidly. A number of works in the literature have focused on predicting the fault-proneness of software systems. Less work, however, has focused on the fault-proneness of evolutionary changes to a codebase, and existing approaches typically require project-specific historical data to be used effectively.

This paper presents a set of AI-based metrics designed to estimate the fault-proneness of source code changes. The proposed metrics are based on Tree Kernel functions and Transformer models, that have been largely and effectively used in the Natural Language Processing domain. Moreover, the proposed metrics can be used on any software project, and do not require fine-tuning with project-specific historical data. The effectiveness of the proposed metrics is assessed by applying them to a dataset of real-world source code evolution scenarios, and by comparing them against fault-proneness scores provided by a Software Engineering practitioner.

Results are promising and show that the proposed metrics are strongly correlated with human-defined fault-proneness scores, and could thus be used as a good proxy of costly human evaluations. The results also motivate further research on the application of these metrics to concrete scenarios such as regression testing.

Replication Package

The considered dataset, including the computed metrics and manually defined fault-proneness scores, is publicly available in the replication package at DOI. The replication package also includes all the code necessary to compute the proposed metrics, and data analytics scripts we used to analyze the raw data and produce plots and results discussed in this paper.