Agents are live! Build no-code automation for your best work. Get free trial →

What are LLM benchmarks? Key metrics and limitations

With dozens of large language model (LLM) families and hundreds of versions available, choosing the right one can be a daunting task. That’s where LLM evaluation benchmarks step in. Read on to learn about the most popular examples of LLM benchmarks, how they work, and, most importantly, how you can use them to find the language model of your dreams.

What are LLM benchmarks? Key metrics and limitations

1/20/2025

15 min read

What are LLM benchmarks?

LLM benchmarks are standardized tests and frameworks used to evaluate and compare the performance of large language models across various domains. They consist of specific Benchmarking also lets you see if you will need more than one AI model for your projects. In such cases, using an AI orchestration platform like nexos.ai can be beneficial.datasets, tasks (such as language understanding, question-answering, math problem-solving, and coding), and scoring systems that measure whether models can produce correct responses to given inputs. Thanks to defined metrics, benchmarks help measure a model’s strengths, limitations, and reliability.

Moreover, LLM benchmarks serve as essential tools for model development and selection by providing consistent, uniform evaluation methods that enable fair comparisons across different models. They guide the fine-tuning process by offering quantitative measures highlighting where models excel and need improvement, helping researchers advance the field.

Additionally, benchmarks help software developers and organizations make informed decisions when choosing models for their specific needs, as they provide objective model performance data across standardized tests rather than relying on subjective assessments.

How do LLM benchmarks work?

LLM benchmarks work in three steps: 

  1. 1.
    Dataset preparation and task presentation
  2. 2.
    Performance evaluation and scoring mechanisms
  3. 3.
    Ranking and leaderboard systems

Let's break down each stage to understand how this works in practice.

1. Dataset preparation and task presentation

LLM benchmarks require datasets containing diverse challenges tailored to specific skills. These might include coding problems, mathematical equations, reading comprehension passages, scientific questions, or real-world conversation scenarios. The tasks themselves span multiple categories such as commonsense reasoning, problem-solving, question answering, text summarization, and language translation.

When administering these tests, benchmarks typically use one of three methodologies:

Zero-shot testing presents tasks to the model without giving any examples. This shows the LLM's ability to understand new concepts and adapt to unfamiliar scenarios on the fly.

Few-shot testing provides a handful of examples before asking the model to complete similar tasks. Such testing is suitable for demonstrating how well the LLM can learn from limited data.

Fine-tuned evaluation process involves training the model on datasets similar to the benchmark's content. This way, one can optimize LLM’s performance for a specific task type.

The number of test cases varies significantly by benchmark, ranging from dozens to thousands of examples. Each input requires the model to process information and generate appropriate responses.

2. Performance evaluation and scoring mechanisms

After models complete their assigned tasks, benchmarks use various evaluation methods and scoring mechanisms depending on the nature of the challenge. Read our comparison table of the best LLMs for data analysis.

These range from simple accuracy metrics where the LLM has to answer multiple-choice questions to scenarios where we use a second LLM as a judge and assesses multiple criteria. In some cases, human evaluation can also be used, especially when talking about chatbots.

Finally, evaluation and scoring mechanisms can be combined to provide a more all-around assessment.

3. Ranking and leaderboard systems

After multiple models complete the same benchmark, their scores enable direct performance comparisons through ranking systems. Individual benchmarks often maintain their own leaderboards, typically published alongside the original research introducing the evaluation framework.

Additionally, LLM comparison benchmark leaderboards, such as those gathered in HuggingFace’s Big Benchmarks Collection, aggregate results from multiple evaluation sources, providing broader performance perspectives.

These ranking systems usually provide scores from 0 to 100, creating standardized performance snapshots that help researchers, developers, and organizations make informed decisions about model selection and development priorities.

Why is LLM benchmarking important?

LLM benchmarking is important because it helps orient in an ever-growing selection of models. Without standardized evaluation, navigating the landscape full of unverified claims and inconsistent testing would be difficult, to say the least.

Objective model comparison

The main benefit of LLM benchmarking is objective model comparison. Instead of relying on user reviews and marketing copy, one can see how each model performs under uniform testing conditions. This is especially important because each LLM has its strengths and weaknesses, and using one for all your tasks can be counterproductive.

Informed decision-making

LLM benchmarks can help organizations decide which model to choose. A software development company might want the best option for coding, while an eshop might be looking for the right customer support chatbot. Benchmarking also lets you see if you will need more than one AI model for your projects. In such cases, using an all-in-one AI platform for business can be beneficial. With it, you can also compare the output of different LLMs and choose the best option.

Driving innovation and improvement

Benchmarks serve as progress indicators, providing quantitative data on where different models excel or struggle. So if a certain LLM scores high in coding tasks but lags behind in creative writing, its developers can act accordingly. It’s also beneficial to see how your competitors are doing and what techniques or solutions bring the best results.

Establishing industry standards

Model benchmarks also set the bar for every player. They also provide transparency and encourage developers to share performance data, which fuels the competition and benefits the end-user. As soon as one model gets the highest score, that calls for a new benchmark, pushing the evolution further. While benchmarks provide standardized testing, comprehensive LLM evaluation goes beyond scores, especially when you’re measuring reliability, safety, and real-world behavior.

What are the most common LLM benchmarks?

There are already plenty of LLM benchmarks available, and more seem to be popping up as the AI technology advances. At the same time, others get retired when they can no longer evaluate the latest models’ responses properly.

Here are some of the most popular and latest LLM benchmarks that are still active in 2025, in alphabetical order.

AgentHarm

AgentHarm benchmark examines AI model misuse. It evaluates how LLMs identify and prevent potential harmful actions by testing 44 behaviors in 8 categories, such as fraud or harassment. To score high, models must refuse malevolent requests and maintain their capabilities after the attack.

As of now, Meta’s Llama 3.1 has the top position in the AgentHarm benchmark.

American Invitational Mathematics Examination (AIME)

AIME is an invite-only math competition for high school students. It includes 15 questions that include algebra, geometry, and number theory, where the answer is a single number from 0 to 999. Historically, students answer just one-third of those correctly.  

This benchmark used questions from the 2024 and 2025 competitions. ChatGPT o3 Mini was the leader, with a solid 86.5% accuracy.

ARC-AGI-2

The Abstraction and Reasoning Challenge (ARC-AGI-2) got its latest update in March 2025. It tests using the few-shot method and aims to accelerate the creation of Artificial General Intelligence, or AGI. This benchmark involves symbolic interpretation and requires recognizing the meaning beyond their visual patterns.

ARC-AGI-2 also tests compositional reasoning, where LLMs need to apply rules simultaneously or rules that interact with each other.

Finally, it benchmarks contextual rule application when they must be applied differently depending on the situation.

As of June 2025, the leading model in the ARC-AGI-2 was Claude Opus 4.

Berkeley Function-Calling Leaderboard (BFCL)

BFCL is a benchmark used to test LLMs’ ability of function relevance detection and calling. The dataset consists of more than 2,000 questions in Python, Java, and other coding languages. BFCL also measures AI hallucinations

At the moment of writing this blog post, xLAM-2-70b by Salesforce had the best score among models with native support for function calling. In the prompt category, GTP-4o turned out to be the leader, taking the 6th place overall.

CaseLaw

CaseLaw benchmark evaluates how LLMs can be used for litigation in relation to public court systems. It uses data from Canadian court cases and two question types. Extractive questions are verified to match the ground truth, while summative questions include the most relevant semantic points.

When checking the latest leaderboard, updated on May 30, 2025, we saw Grok 3 beta as the best-performing model, which also offered the best speed.

FinBen

The FinBen benchmark was created to evaluate LLMs in real-world financial scenarios. It has 24 tasks covering domains such as risk management, forecasting, and decision-making. FinBen also uses two open-source datasets focused on stock trading and financial QA.

According to the leaderboard, GPT-4 and Llama 3.1 did the best job.

Graduate-Level Google-Proof Q&A Benchmark (GPQA)

GPQA is one of the hardest tests for measuring reasoning and general question-answering performance. The models have to answer complex science questions from STEM disciplines that are Google-proof, meaning they cannot be answered by recalling or web search.

At the time of writing this article, ChatGPT's o3 was at the top of the leaderboard on the vals.ai website. In general, all “reasoning” models performed well in GPQA.

GSM8K-Platinum

GSM8K has long been the key benchmark designed to test mathematical reasoning. In March 2025, the Platinum version replaced its parent model with a goal to reduce the label noise and improve the test's reliability. A more precise version showed drastically different results, putting Claude Sonnet 3.7 into the leader’s position.

Humanity’s Last Exam

A benchmark with the most dramatic name, Humanity’s Last Exam, surely pushes models to the limit in terms of technical knowledge and reasoning, and is considered one of the best LLM benchmarks. The dataset includes more than 2,500 challenging questions from over 100 subjects prepared by professors, researchers, and other experts. 

According to the latest data (April 2025), Gemini 2.5 Pro is at the top of the leaderboard (21.6%), followed by ChatGPT o3 (20.3%) and o4-mini (18.1%). While neither LLM has shown good accuracy, the creators of the benchmark predict that we might see it reach 50% by the end of 2025.

LiveCodeBench

As the name implies, LiveCodeBench is tailored to test how LLMs solve real coding cases from LeetCode, AtCoder, and Codeforces. The latest benchmark consists of over 1,000 questions and evaluates syntax generation, algorithm design, and code efficiency, among other factors.

According to the vals.ai leaderboard updated on June 16, 2025, ChatGPT’s o4 Mini is the ultimate leader in all three performance, budget, and speed categories.

LMArena

Previously known as Chatbot Arena, LMArena differs from most benchmarks because the evaluation is done by humans. Two anonymous models answer a prompt, and you get to decide which performed better. Afterwards, the identities of LLMs are revealed.

Currently, LMArena tests models in six areas. Here’s how the leaderboards look:

  • Text – Gemini 2.5 Pro Preview
  • WebDev – Gemini 2.5 Pro Preview
  • Vision – Gemini 2.5 Pro Preview
  • Search – Gemini 2.5 Pro Grounding
  • Copilot – DeepSeek V2.5 (FIM)
  • Text-to-image – GPT-Image-1

Massive Multimodal Multidiscipline Understanding (MMMU)

MMMU uses complex, college-level tasks from six disciplines and 183 subfields to measure logical reasoning and perception. It also includes images, such as diagrams, charts, and maps. MMMU is considered to be one of the most difficult tests in terms of depth and breadth.

Gemini 2.5 Pro and ChatGPT o3 are the two leaders, surpassing even a medium-level human expert. However, the harder MMMU-Pro version puts all three human experts back at the top, with Seed 1.5-VL Thinking as the closest opponent.

Measuring Massive Multitask Language Understanding (MMLU)

The Pro version of MMLU uses more challenging questions to test the complex reasoning and language understanding of LLMs. It also provides ten possible answers instead of four, reducing the chance of guessing the right one. MMLU includes more than 12,000 questions in 14 domains, such as Biology, Computer Science, History, and Law, requiring extensive world knowledge.

According to the vals.ai LLM leaderboard, which was updated on May 30, 2025, Claude Opus 4 is at the top, followed by ChatGPT o3.

SWE-bench

SWE-bench tests the model’s real-world coding problem solving ability. It includes over 2,000 tasks taken from GitHub where the LLM has to modify the code to solve the issue. SWE-bench then runs a Fail-to-Pass test for evaluation.

As of June 2025, Refact.ai Agent was in the leading position on the Lite version of the benchmark, closely followed by a combo of SWE-agent and Claude 4 Sonnet.

What are the key metrics for benchmarking?

There are plenty of different approaches when it comes to benchmarking AI systems. Here are some of the key metrics:

Accuracy-based metrics work best for tasks with definitive correct answers, such as multiple-choice questions. Benchmarks like MMLU (multitask accuracy) calculate the number of correct responses and are typically used to evaluate general model capabilities.

Recall metric evaluates the number of relevant items (true positives) the model has found in the full set of correct options.

F1 score takes both accuracy and recall into account. A high F1 score means that the model is good at both metrics, while a low score indicates that it’s bad in either accuracy or recall. This metric is used in benchmarks such as SQuAD that deal with natural language processing (NLP).

Exact match is a simple metric often used in NLP tasks, such as question answering. As the name implies, it only allows one correct answer. However, factors such as lowercase/uppercase, articles (a, an, the), and punctuation are usually not taken into account.

Perplexity is one of the key metrics used in OpenWebText and similar benchmarks. It shows how well a model can predict a sequence of words. A low score means that the LLM is confident and makes accurate predictions. If the score is high, it means that the model chooses from a number of equally likely answers. 

Overlap-based metrics are used when there can be multiple valid responses. These metrics compare shared words and phrases between the LLM’s output and reference answers. Two common examples are BLEU for machine translation and ROUGE for text summarization.

Functional evaluation process applies to coding benchmarks, such as HumanEval. This benchmark, developed by OpenAI, includes Python coding tasks and evaluates the chance that at least one of the model’s generated code samples passes the unit tests.

AI-powered evaluation represents an approach where we use an LLM as a judge to assess response quality based on criteria like truthfulness, helpfulness, or human preference alignment.

Human evaluation offers a more qualitative approach, taking into account metrics such as semantic meaning or relevance. One of the most well-known benchmarks of this type is LMArena, where users do a blind test comparing two large language models.

To sum up, each benchmark usually has its own evaluation metric or metrics based on the methodology used. 

How do you use LLM benchmarks?

Using LLM benchmarks is simple, as most of them have explanations for each element. Let’s take this LLM leaderboard from LMArena as an example.

LLM leaderboard from LMArena

The first column lists all models that have been tested. The second column shows the overall best LLM, which, at the time of writing this article, was Gemini 2.5 Pro.

Finally, columns 3-9 allow you to see how models compare to each other in particular areas, such as coding or creative writing. You should click each one to change the sorting accordingly.

Currently, Gemini 2.5 Pro is number one in all categories. However, LMArena allows other models to share the same place. Therefore, ChatGPT’s o3 is as good at math, while the same can be said about 4o and 4.5 preview for multi-turn tasks that require maintaining context and coherence in a longer conversation.

What are the limitations of LLM benchmarking?

Even though LLM benchmarks provide valuable information about most models and their capabilities, they do have some limitations. Knowing these will help you avoid over-relying on benchmarking and make an informed decision about the right model for you.

Restricted scope and focus on known capabilities

Most benchmarks test models in areas where they already have capabilities. This leaves a low chance of discovering any new ones. Therefore, traditional benchmarking cannot grasp the full potential of LLMs.

Short lifespan

Benchmarks become outdated quickly and lose relevance as soon as any LLM achieves a near-perfect or human-level score. This calls for new benchmarks, but the rapid improvement of the models forces us to play a never-ending catch-up game.

Data contamination

Data contamination happens when benchmark data unintentionally appears in the LLM’s training data because of data crawls or other reasons. In this case, the model might simply “remember” the correct answer instead of actually solving the problem.

Overfitting

Overfitting occurs when the models are trained on data similar to that of the benchmark. It can also happen when the LLM is being trained for too long on a dataset that’s too small. In such a case, the model might memorize the answers and show a high score but perform poorly in real-world scenarios.

Limited real-world applicability

Even the best benchmarks struggle to simulate all possible scenarios of LLM usage, even when testing in specific areas like coding. As a result, a high-scoring model might still struggle when applied in practice.

Not fit for edge cases

While there are plenty of benchmarks for different tasks, they don’t cover all niche areas or highly specialized fields. And even if such tests are created, more often than not, they become outdated as even more generic benchmarks struggle to keep up with the LLM model advancements.

Inadequate for LLM applications

Benchmarks can evaluate the model well, but the same cannot be said for LLM applications that often involve custom datasets and rules. Therefore, if a particular model scores highest in LMArena, that doesn’t mean it’s the best choice for your company’s customer support. In such cases, it’s best to build your own benchmark for evaluating chat assistants and othr AI systems.

Conclusion

There are multiple benchmarks for testing large language models, and LLM researchers make sure that their numbers continue to grow. While existing benchmarks can show the strengths and limitations of LLMs, the results do not always translate well to the real world. Therefore, we encourage you to look at the LLM leaderboards with a healthy dose of skepticism because, ultimately, the best model is not the number one overall but the number one for you.

FAQ

blog author Karolis
Karolis Pilypas Liutkevičius

Karolis Pilypas Liutkevičius is a journalist and editor exploring the topics of AI industry.

abstract grid bg xs
One platform for AI orchestration, zero complexity

Be one of the first to see nexos.ai in action — request a demo below.