nexos.ai raises €30M Series A to accelerate enterprise AI adoption. Read full announcement →

FrugalGPT: What is it, and how does it cut costs while maintaining quality?

FrugalGPT is a framework focused on cost-efficient large language model (LLM) usage. It enables AI orchestration of multiple models to deliver cost-optimized responses based on query routing. This framework is a solution for organizations that need artificial intelligence to balance latency, accuracy, and spend. In this article, we discuss what the FrugalGPT framework is, how it can be implemented for reducing costs while maintaining quality, and explain the techniques it uses.

What is it, and how does it cut costs while maintaining quality?

10/10/2025

9 min read

What is FrugalGPT?

FrugalGPT is an artificial intelligence algorithmic framework that selects which generative LLM APIs to use while reducing cost and improving performance. The key idea is that it routes queries to the smallest, cheapest model that can still deliver the needed quality. When higher accuracy is required, it escalates to more powerful models. The framework smartly matches the right request to the right model. As a result, users get a reduced cost with no loss in accuracy where it counts, and avoid possible AI security risks.

FrugalGPT was created by Lingjiao Chen, Matei Zaharia, and James Zou from Stanford University. The authors noticed a problem of wide pricing ranges of large language models, as well as performance and result quality issues. Their paper introduces FrugalGPT, a proposed framework that aims to reduce costs while improving accuracy and performance for users. 

Based on their work, FrugalGPT can match the performance of the best large language models and reduce spending at the same time, while ensuring top results. The findings presented in early benchmarks show that smaller models, guided by FrugalGPT, can deliver in-context examples that rival bigger models at a fraction of the inference cost.

The framework utilizes a stack of optimization techniques that work at different stages of the LLM request lifecycle. Each technique addresses a specific type of inefficiency: repetitive model calls, poor routing decisions, and token overspend. FrugalGPT cuts cost while improving performance across every model call, which isn’t too different from AI orchestration platforms that support multiple LLMs. 

How does FrugalGPT work? 

FrugalGPT works through layered decision logic. The framework uses the following techniques:

  • Classifies requests by complexity. Simple requests use cheaper models, while complex ones—advanced models.
  • Selects a model tier based on past performance and cost trade-offs. The framework uses historical data to decide which model delivers good quality for similar queries at a reduced cost.
  • Caches outputs for repeated queries. If the exact or near-identical request has been made before, FrugalGPT returns the cached response instantly.
  • Escalates to use large language models when smaller ones fail thresholds. If a cheaper model doesn’t provide the required quality, FrugalGPT escalates models automatically until the response meets the bar.

Your teams don’t change workflows. FrugalGPT runs behind the scenes, applying intelligent routing so every token spent delivers measurable value. It routes, caches, and escalates behind the scenes. Smart routing can lower cost and improve accuracy at scale. Inference cost falls as FrugalGPT reuses prior results and learns which less expensive model can deliver similar output quality.

The following table explains the optimization techniques used by FrugalGPT. These methods are employed throughout the entire process to reduce cost and improve accuracy.


Technique Function
LLM approximation Uses the smallest capable model
Prompt optimization Reduces input tokens
Caching Eliminates repeated calls
LLM Cascade Smart escalation by request complexity

Connecting all the above functions makes it possible to significantly reduce token usage and spending.

How does FrugalGPT reduce costs, and maintain quality?

Without a solution like FrugalGPT, most organizations overpay for their artificial intelligence needs. A $0.12-per-1K-token model answers even trivial requests. When you scale that to millions of calls, the LLM costs grow tremendously. With FrugalGPT, most of those calls are routed to lightweight models, while only mission-critical queries reach top tiers, ensuring high accuracy. This way,the framework employs LLMs sustainably.

How does FurgalGPT ensure quality using cheaper models? It utilizes prompt adaptation, LLM approximation, and LLM Cascade to end up with model fine-tuning. The optimization of costs happens before, during, and after a large language model request. FrugalGPT leverages training examples from your training set to continuously improve accuracy and predict which model tier best suits different queries.

First, the framework optimizes prompts before sending them to the model. Writing shorter prompts optimizes the process, because short prompts require fewer tokens. Then, it escalates the requests to a higher tier only if needed (LLM Cascade technique). LLM Cascade means the request doesn’t use the most advanced (and expensive) model right away. Finally, models are improved over time by fine-tuning on your organization’s data.

FrugalGPT offers a stack of strategies that are applied at all stages by employing pre-request prompt adaptation, which cuts token usage. LLM Cascade and approximation in the run-time route queries to cheaper models. Post-request fine-tuning reduces future dependence on the use of high-cost large language models. In short, FrugalGPT helps enterprises operate LLMs sustainably, lowering cost and improving performance.

Prompt adaptation and reduction with FrugalGPT

The key element for cost savings in FrugalGPR is smarter prompts, reducing duplicates, and adjusting queries to cheaper models. As a result, the context is retained, and answers are obtained through cheaper models. All this while keeping a low cost and improving the performance of individual LLM use.

FrugalGPT prefers a shorter prompt size

The framework trims and optimizes prompts to reduce tokens, all while keeping the context intact. Long prompts equal high token usage, and FrugalGPT retrieves only relevant information into the query. Let’s look at some practical examples:


Unoptimized prompt (1100 tokens) Optimized prompt (250 tokens)
You are a financial analyst. Read the following 5,000-word quarterly report for Corporation X. Identify the company’s main growth drivers, mention specific product segments that increased revenue year-over-year, summarize challenges, and suggest what management might focus on in the next quarter.
[Full 5,000-word report]

Include a summary of no more than 200 words.
You are a financial analyst. Based on the extracted sections below from Corporation X’s Q2 report — revenue summary, product breakdown, and management commentary — identify key growth drivers, main challenges, and priorities for next quarter. [Retrieved snippets: relevant information only] Output a 200-word summary.

Prompt optimization allows you to remove irrelevant parts and analyze only the parts that need processing. This way, the doc is stripped of information that is not useful for the analysis. The goal stays the same, but the number of tokens used decreases significantly. 

Combine the shorter, similar requests

As it often happens, users enter the same requests or very similar ones multiple times. FrugalGPT uses caching to reduce the need to generate a new answer based on similar prompts. It merges overlapping queries into a single, cached request. The outcome is reduced cost because related queries produce already-generated replies.

FrugalGPT is not the only tool that utilizes cost optimization techniques. At nexos.ai, the platform allows building a prompt library, reducing the need to enter the same queries over and over again, while improving performance. This is especially helpful when employees send closely related requests. Another feature to reduce reimplementing similar prompts is Projects, which holds them together. Projects keep the prompts organized and easily available for everyone.


Optimize your prompt to better utilize a smaller model

How can prompts be optimized? By adapting words for smaller models. FrugalGPT handles this challenge by, for example, rewriting a vague prompt into a direct, structured query that a cheap model can process.

nexos.ai, as an AI workspace, also uses a similar technique of prompt optimization by routing queries to the most cost-effective individual LLM. As a result, prompts aren’t analyzed by high-tier and expensive LLM providers. The platform also supports prompt engineering optimization to reduce computational complexity, resulting in lower token usage. All this is safe and secure thanks to custom AI guardrails.

LLM approximation as a key strategy

LLM approximation enables the use of a smaller and less expensive model to approximate what more advanced large language models would produce. Frugal GPT won’t call GPT-4 or another advanced LLM for each query. The framework learns when a lighter model or cached output is acceptable to give a similar value at a lower price.

The process works as follows:

  • Output comparison: the framework benchmarks smaller models against large ones for specific tasks. If the output is of good quality, the system uses the smaller model for future calls.
  • Cache and reuse: When identical or related inputs have already been processed, FrogalGPT uses the cached response instead of generating a new one.

With LLM approximation, you can use the smallest capable model and minimize the high cost of calling larger ones.

Read on to learn about the details of the approximation technique.

Utilize caching of LLM requests

Imagine a team working on the same task and doing research. They would probably search for the same info and enter very similar prompts. Each prompt uses tokens. But with caching, the usage can be significantly reduced. By optimizing model selection and prompt length, FrugalGPT helps lower cost and improve accuracy.

LLM request caching enables producing quality outputs while using already generated answers. In such a case, FrugalGPT caches related requests and serves the outcome instantly. Recurring queries get low or zero-cost and low-latency results. As a result, FrugalGPT drives LLM sustainability by cutting unnecessary computation, reducing energy use, and ensuring that every token spent delivers measurable business value.

Apart from prompt libraries and Projects mentioned in prompt adaptation, nexos.ai also implements intelligent caching as a core function on the platform. It prevents duplicate API calls for identical or similar prompts, cutting down on token usage. This function involves exact matching and semantic caching to produce cached responses when it detects a repeated query.

For example, when a team member enters a prompt, such as “What are the techniques for optimizing prompts?”, different variations of the query would use cached results. This means that requests like “prompt optimization techniques” or “how to optimize prompts” wouldn’t call for a new output, but use an existing one.

Start using LLM Cascade

Not every request needs to call a top-tier model. The cascade technique escalates to more advanced large language models only if needed. First, the framework tries to obtain a satisfactory answer from a cheap model. If the answer is not good enough, the prompt goes to a higher level until the output quality is met. This way, the quality is ensured, and costs are reduced.

The LLM Cascade is a decision engine that routes tasks across multiple models. The step-by-step process involves:

  1. 1.
    Starting from a small model: the engine routes the request to a lightweight model (e.g., GPT-3.5)
  2. 2.
    Output evaluation: if the result meets a specific threshold for confidence and relevance, the process stops here. If not, it moves to the next step.
  3. 3.
    Escalating (when necessary): the request is escalated to a stronger model when a smaller one misses context or falls short.

For example, a small model can handle a routine task, such as a summary of last month’s expenses for the financial team. When the prompt is more complex, such as analyzing a summary across subsidiaries, the query is automatically escalated to a high-accuracy model. FrugalGPT reduces cost while improving performance, routing each query to the most efficient model tier.

Combining approximation, caching, and cascade forms a closed optimization loop. LLM approximation finds the cheapest model, caching removes redundant requests, and LLM Cascade helps when bigger large language models are required. With FrugalGPT, you can run large-scale deployments at the same cost while improving performance through selective routing and adaptive caching.

FrugalGPT: Conclusion

FrugalGPT changes the way of thinking about AI efficiency. With all its features, it enables intelligent spending of every token. With prompt adaptation, LLM approximation, caching, and LLM Cascade, the framework delivers cost-effective token use without sacrificing quality. Instead of one expensive model for everything, FrugalGPT routes different queries to different tiers, optimizing LLM costs while improving performance across workloads.

FrugalGPT matches the right model to the right task, minimizing extra spending on routine requests. It eliminates redundant calls and delivers zero-cost responses thanks to caching. LLM Cascade guarantees quality with smart escalation. Together, they balance cost, quality, and accuracy. This approach allows enterprises to use large language models more strategically by deploying less expensive models for standard tasks while reserving premium models for high-stakes reasoning.

nexos.ai uses similar functions to decrease spending. Caching and prompt optimization ensure no unnecessary calls are made, and the Project features enable proper context for replies. Moreover, AI Workspace with multiple models simplifies cost tracking with token spend monitoring, while AI Gateway helps optimize the use of large language models. The AI capabilities of AI Gateway enable teams to utilize LLM APIs and optimize LLM instances for specific use cases.

All these features significantly influence the cost budget when using multiple large language models while reducing spending. The AI capabilities of the framework also improve performance and maintain quality. FrugalGPT proposes a smarter way to use large language models by matching each task to the smallest model that achieves the same cost efficiency without sacrificing performance.

blog author Karolis
Karolis Pilypas Liutkevičius

Karolis Pilypas Liutkevičius is a journalist and editor exploring the topics of AI industry.

abstract grid bg xs
Run all your enterprise AI in one AI platform.

Be one of the first to see nexos.ai in action — request a demo below.