Development and Evaluation of a Deep Learning-Based Model for Source Code Quality Classification Using Industrial Data
JSEP Cover
PDF

Keywords

Source Code Quality
Deep Learning
Code Quality Classification
Domain Adaptation
Industrial Software Development

Abstract

In recent years, there has been growing interest in automatic source code classification technologies to improve software productivity. However, many organizations face difficulties adopting machine learning solutions due to security constraints that restrict the use of online tools. This study aims to develop and validate a deep learning-based model capable of operating entirely within a secure corporate environment to classify the quality of source code. The model, referred to as the Source Code Quality Classification Model (SCQC model), was trained and evaluated using both open-source software (OSS) and internal source code. First, a training dataset was constructed from OSS repositories, and the resulting model achieved an accuracy of up to 82.1%. To examine its generalizability, the model was applied to internal source code. The accuracy declined significantly due to differences in code structure and development practices, highlighting the critical importance of domain alignment. Further experiments with internal data demonstrated that restricting the target scope by programming language and product category could improve prediction accuracy. These findings suggest that it is feasible to build practical classification models when training data is tailored to the specific characteristics of the development environment. The results indicate a promising direction for implementing such models in real-world settings. However, challenges remain, including the preparation of high-quality labeled training data and adapting models to specific domains. Future work will focus on addressing these issues and exploring integration of the SCQC model into actual code review and quality assurance workflows.

PDF