Date Thesis Awarded
5-2024
Access Type
Honors Thesis -- Open Access
Degree Name
Bachelors of Arts (BA)
Department
Computer Science
Advisor
Denys Poshyvanyk
Committee Members
Andreas Stathopolous
Ron Smith
Abstract
In recent years, tasks for automated software engineering have been achieved using Large Language Models trained on source code, such as Seq2Seq, LSTM, GPT, T5, BART and BERT. The inherent textual nature of source code allows it to be represented as a sequence of sub-words (or tokens), drawing parallels to prior work in NLP. Although these models have shown promising results according to established metrics (e.g., BLEU, CODEBLEU), there remains a deeper question about the extent of syntax knowledge they truly grasp when trained and fine-tuned for specific tasks.
To address this question, this thesis introduces a taxonomy of syntax errors, and a labeled set of LLM generated code containing syntax errors. The taxonomy was organized into Simple and Complex errors, describing the level of structural degradation caused by the syntax errors. We explored these over three different NLP datasets: Mostly Basic Python Problems (MBPP), the Code/Natural Language Challenge (CoNaLa), and HumanEval. With CoNaLa and MBPP having the task of code generation from natural language, and HumanEval having the task of code completion.
We ran a total of 4,941 prompts into the Mistral-7b-instruct-v.2 model, and encountered 130 syntax errors, or a 2.6\% error rate. When we restrict the samples to python code only, the error rate increases to 2.9\%. The most common simple error was an extra token, a space, added to the result. The most common complex error broke the assign relationship.
Recommended Citation
Granger, Cole, "Code Syntax Understanding in Large Language Models" (2024). Undergraduate Honors Theses. William & Mary. Paper 2181.
https://scholarworks.wm.edu/honorstheses/2181
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Included in
Artificial Intelligence and Robotics Commons, Data Science Commons, Programming Languages and Compilers Commons
Comments
Code can be found at:
https://github.com/WM-SEMERU/syntax-error-cole