Date of Award

11-2024

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Dr. Victor Winter

Abstract

This thesis investigates the performance of large language models (LLMs) from OpenAI, Anthropic, and Google on the HumanEval benchmark. This freely available benchmark automates the model evaluation process, ensuring consistent and objective evaluation. The HumanEval benchmark contains a dataset, consisting of 164 programming problems. Our analysis examines correlations between problem features, such as cyclomatic complexity, Halstead metrics, problem emphasis, a self-reflection metric we developed, and the models’ problem-solving capabilities on this dataset. XGBoost was utilized to identify the significance of various features in predicting pass@1 outcomes, with inconclusive findings suggesting areas for further investigation, including potential model overfitting and data bias. The study identifies potential limitations related to API errors, similarities between some problems in the dataset and a models training set, and resource constraints, emphasizing the need for future research with larger datasets and refined methodologies. In an additional investigation, problems in the dataset were placed into one of the following categories based on their emphasis: Language Comprehension, Algorithms, and Math. A self-reflection metric we developed was then used to compute the relative strength of a model for each problem in the dataset. Our findings indicated that all models were strongest on problems having primary emphasis on Algorithms, second strongest on problems having a primary emphasis on Language Comprehension, and weakest on problems having a primary emphasis on Math. In most cases, the Language Comprehension and Math strengths of models were similar. Our overall findings indicated that Claude 3.5 Sonnet had the strongest performance while GPT-3.5 Turbo had weakest performance.

Comments

The author holds the copyright to this work and any reuse or permission must be obtained from them directly.

Recommended Citation

Vornhagen, Austin, "EXPLORING THE RELATIONSHIPS BETWEEN PROBLEM FEATURES IN THE HUMANEVAL DATASET AND PASS@K" (2024). Computer Science Theses, Dissertations, and Student Creative Activity. 5.
https://digitalcommons.unomaha.edu/compscistudent/5

Download

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

COinS

DigitalCommons@UNO

Computer Science Theses, Dissertations, and Student Creative Activity

EXPLORING THE RELATIONSHIPS BETWEEN PROBLEM FEATURES IN THE HUMANEVAL DATASET AND PASS@K

Date of Award

Degree Type

Degree Name

Department

First Advisor

Abstract

Comments

Recommended Citation

Search

Links

Browse

Author Corner

Links

DigitalCommons@UNO

Computer Science Theses, Dissertations, and Student Creative Activity

EXPLORING THE RELATIONSHIPS BETWEEN PROBLEM FEATURES IN THE HUMANEVAL DATASET AND PASS@K

Author

Date of Award

Degree Type

Degree Name

Department

First Advisor

Abstract

Comments

Recommended Citation

Share

Search

Links

Browse

Author Corner

Links