Date of Award
11-2024
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Dr. Victor Winter
Abstract
This thesis investigates the performance of large language models (LLMs) from OpenAI, Anthropic, and Google on the HumanEval benchmark. This freely available benchmark automates the model evaluation process, ensuring consistent and objective evaluation. The HumanEval benchmark contains a dataset, consisting of 164 programming problems. Our analysis examines correlations between problem features, such as cyclomatic complexity, Halstead metrics, problem emphasis, a self-reflection metric we developed, and the models’ problem-solving capabilities on this dataset. XGBoost was utilized to identify the significance of various features in predicting pass@1 outcomes, with inconclusive findings suggesting areas for further investigation, including potential model overfitting and data bias. The study identifies potential limitations related to API errors, similarities between some problems in the dataset and a models training set, and resource constraints, emphasizing the need for future research with larger datasets and refined methodologies. In an additional investigation, problems in the dataset were placed into one of the following categories based on their emphasis: Language Comprehension, Algorithms, and Math. A self-reflection metric we developed was then used to compute the relative strength of a model for each problem in the dataset. Our findings indicated that all models were strongest on problems having primary emphasis on Algorithms, second strongest on problems having a primary emphasis on Language Comprehension, and weakest on problems having a primary emphasis on Math. In most cases, the Language Comprehension and Math strengths of models were similar. Our overall findings indicated that Claude 3.5 Sonnet had the strongest performance while GPT-3.5 Turbo had weakest performance.
Recommended Citation
Vornhagen, Austin, "EXPLORING THE RELATIONSHIPS BETWEEN PROBLEM FEATURES IN THE HUMANEVAL DATASET AND PASS@K" (2024). Computer Science Theses, Dissertations, and Student Creative Activity. 5.
https://digitalcommons.unomaha.edu/compscistudent/5
Files over 3MB may be slow to open. For best results, right-click and select "save as..."
Comments
The author holds the copyright to this work and any reuse or permission must be obtained from them directly.