Date Thesis Awarded

5-2024

Access Type

Honors Thesis -- Open Access

Degree Name

Bachelors of Arts (BA)

Department

Computer Science

Advisor

Denys Poshyvanyk

Committee Members

Andreas Stathopolous

Ron Smith

Abstract

In recent years, tasks for automated software engineering have been achieved using Large Language Models trained on source code, such as Seq2Seq, LSTM, GPT, T5, BART and BERT. The inherent textual nature of source code allows it to be represented as a sequence of sub-words (or tokens), drawing parallels to prior work in NLP. Although these models have shown promising results according to established metrics (e.g., BLEU, CODEBLEU), there remains a deeper question about the extent of syntax knowledge they truly grasp when trained and fine-tuned for specific tasks.

To address this question, this thesis introduces a taxonomy of syntax errors, and a labeled set of LLM generated code containing syntax errors. The taxonomy was organized into Simple and Complex errors, describing the level of structural degradation caused by the syntax errors. We explored these over three different NLP datasets: Mostly Basic Python Problems (MBPP), the Code/Natural Language Challenge (CoNaLa), and HumanEval. With CoNaLa and MBPP having the task of code generation from natural language, and HumanEval having the task of code completion.

We ran a total of 4,941 prompts into the Mistral-7b-instruct-v.2 model, and encountered 130 syntax errors, or a 2.6\% error rate. When we restrict the samples to python code only, the error rate increases to 2.9\%. The most common simple error was an extra token, a space, added to the result. The most common complex error broke the assign relationship.

Recommended Citation

Granger, Cole, "Code Syntax Understanding in Large Language Models" (2024). Undergraduate Honors Theses. William & Mary. Paper 2181.
https://scholarworks.wm.edu/honorstheses/2181