High-quality vulnerability patch data is essential for understanding vulnerabilities in software systems. Accurate patch data sheds light on the nature of vulnerabilities, their origins, and effective remediation strategies. However, current data collection efforts prioritize rapid release over quality, leading to patches that are incomplete or contain extraneous changes. In addition to supporting vulnerability analysis, high-quality patch data improves automatic vulnerability prediction models, which require reliable inputs to predict issues in new or existing code. In this paper, we explore using large language models (LLMs) to filter vulnerability data by identifying and removing low-quality instances. Trained on large textual corpora including source code, LLMs offer new opportunities to improve data accuracy. Our goal is to leverage LLMs for reasoning-based assessments of whether a code hunk fixes a described vulnerability. We evaluate several prompting strategies and find that Generated Knowledge Prompting, where the model first explains a hunk’s effect, then assesses whether it fixes the bug, is most effective across three LLMs. Applying this filtering to the BigVul dataset, we show a 7%–9% improvement in prediction precision for three popular vulnerability prediction models. Recall declines slightly, 2%–8%, across models, likely reflecting the impact of reduced dataset size.
@article{dil2025,title={Towards higher quality software vulnerability data using LLM-based patch filtering},journal={Journal of Systems and Software},pages={112581},year={2025},issn={0164-1212},doi={10.1016/j.jss.2025.112581},url={https://www.sciencedirect.com/science/article/pii/S016412122500250X},author={Dil, Charlie and Chen, Hui and Damevski, Kostadin},keywords={Vulnerability patch quality, Automatic vulnerability prediction, Large language models},preprint={preprint/llmcleanjss.pdf},note={In press},}
2024
Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification
@article{uq2024,title={Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification},author={Chen, Hui and Zhao, Yunhua and Damevski, Kostadin},year={2024},journal={CoRR},volume={abs/2411.11659},url={https://arxiv.org/abs/2411.11659},doi={10.48550/arXiv.2411.11659},eprinttype={arXiv},eprint={2411.11659}}