Date Thesis Awarded

5-2024

Access Type

Honors Thesis -- Open Access

Degree Name

Bachelors of Science (BS)

Department

Computer Science

Advisor

Denys Poshyvanyk

Committee Members

Adwait Nadkarni

Cristiano Fanelli

Martin White

Abstract

Large Language Models (LLMs) have the capability to model long-term dependencies in sequences of tokens, and are consequently often utilized to generate text through language modeling. These capabilities are increasingly being used for code generation tasks; however, LLM-powered code generation tools such as GitHub's Copilot have been generating insecure code and thus pose a cybersecurity risk. To generate secure code we must first understand why LLMs are generating insecure code. This non-trivial task can be realized through interpretability methods, which investigate the hidden state of a neural network to explain model outputs. A new interpretability method is rationales, which obtains the minimum subset of input tokens that lead to the model's output. Through obtaining rationales of insecure code, we are able to investigate the relationship between model inputs and LLM-generated insecure code tokens to further efforts in mitigating cybersecurity risks currently posed by LLM-generated code.

Our experiment conducts a case study on two common, pervasive, and severe real-world weaknesses: XSS injection (CWE-79) and SQL injection (CWE-89). We first collected data, then obtained rationales for our weak Python code samples via the greedy rationalization algorithm and a GPT-2 model. Thus, we were able to identify the specific tokens which lead to insecure token generation. We also explored an aggregation function for code rationales - structural code taxonomy - which allowed us to investigate rationales on the local and global levels. Our prototype study found good results: rationales for CWE-79 and CWE-89 code samples have different structural code taxonomy mappings. This implies that each LLM-generated weakness arises from different aspects of the code context, and thus efforts to mitigate insecure LLM-generated code must be precisely targeted to the weakness.

Recommended Citation

Danas, Lydia, "Security and Interpretability in Large Language Models" (2024). Undergraduate Honors Theses. William & Mary. Paper 2227.
https://scholarworks.wm.edu/honorstheses/2227

Comments

This honors thesis was written and accepted for departmental honors in Computer Science, also serving as partial fulfillment for the author's computer science minor. A second thesis (not honors) was written and accepted for partial fulfillment of the author's Bachelor's of Science in Physics. The second thesis, titled Data Analysis and Machine Learning on DIRC Hit Patterns for PID, explores a different application of neural networks: enhancement of subatomic particle classification methods through a novel neural network architecture, the Swin vision transformer.

Download

Share Feedback

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons, Information Security Commons

COinS

Undergraduate Honors Theses

Security and Interpretability in Large Language Models

Date Thesis Awarded

Access Type

Degree Name

Department

Advisor

Committee Members

Abstract

Recommended Citation

Comments

Included in

Browse

Search

Author Corner

Links

About Scholarworks

Links