I am currently conducting research for my Master’s thesis on the application of Code Large Language Models (Code LLMs) in Predictive Mutation Testing (PMT). Originally, PMT leverages statistical models for Java source code to predict which mutated code versions will be detected (Killed) by a test suite.
My Research Focus: This project aims to replicate PMT for C# and explore the effectiveness of various Code LLMs (CodeLlama, CodeBert, CodeT5) in predicting mutant survival.
Data Considerations:
Inputs: The model will be trained on features extracted from mutated code, including:
Tokenized mutated code
Number of tests covering the mutant
Static analysis metrics (e.g., McCabe Complexity)
Desired Output: The model will predict whether a mutant is “killed” (identified) or survives the test suite.
Technical Expertise: I possess proficiency in C# and utilize the Roslyn API for feature engineering. Stryker.NET is employed for mutation generation.
I am seeking valuable insights from the community in the following areas:
Algorithm Selection: Which machine learning algorithms or models are best suited for this specific PMT task with C# and Code LLMs?
Feature Engineering: What could be more effective features for enhancing model performance?
Examples and Tutorial: Are there any recommended resources or tutorials for implementing and evaluating machine learning models in this context?
Collaboration and Contribution: I am eager to learn from experienced researchers and welcome any suggestions or insights that could contribute to the success of this project.
Thank you for your time and consideration.