Comparative Evaluation of Large Language Models using Key Metrics and
Emerging Tools
Abstract
This research involved designing and building an interactive generative
AI application to conduct a comparative analysis of two advanced Large
Language Models (LLMs), GPT-4 and Claude 2, using Langsmith evaluation
tools. The project was developed to explore the potential of LLMs in
facilitating postgraduate course recommendations within a simulated
environment at Munster Technological University (MTU). Designed for
comparative analysis, the application enables testing of GPT-4 and
Claude 2 and can be hosted flexibly on either AWS (Amazon Web Services)
or Azure. It utilizes advanced natural language processing and
retrieval-augmented generation (RAG) techniques to process proprietary
data tailored to postgraduate needs. A key component of this research
was the rigorous assessment of the LLMs using the Langsmith evaluation
tool against both customized and standard benchmarks. The evaluation
focused on metrics such as bias, safety, accuracy, cost, robustness, and
latency. Additionally, adaptability covering critical features like
language translation and internet access, was independently researched
since the Langsmith tool does not evaluate this metric. This ensures a
holistic assessment of the LLM’s capabilities.