The hype around AI agents has been immense in the past six months. We are not only witnessing new developments and techniques that enhance the performance of AI agents but also several benchmarks that evaluate them based on accuracy, reasoning, precision, and more.
But what about the cost?
AI agents recall the underlying Large Language Models (LLMs) several times which increases the overall cost considerably. Plus, if we consider the cost of a single-agent system to a multiagent, the latter is 10 times more expensive. This is because a multiagent system calls the LLM multiple times or involves multiple rounds to execute an action.
The approach to only focus on achieving higher accuracy results in costly AI agents. This deviates from the intention of developing AI agents aligned towards business implementation rather than topping the leaderboard.
There is a need for benchmarks that consider both accuracy and cost. This article delves into the shortcomings of current AI agent benchmarks and potential solutions to enhance their effectiveness in real-world applications.
If we look at the top LLMs for building AI agents, we will come across names like:
· GPT-4
· GPT-4 o
· Claude-3
· Palm-3
· Llama 3.1
These LLMs are in a constant race to improve their benchmark scores and rank higher on the leaderboard. But at what cost? No one asks that.
Claude 3.5 Sonnet outperforms GPT-4 on several benchmarks but still cannot compete with its accuracy. The former has scored 88.7% on MMLU, 92% on HumanEval, and 91.6% on Multilingual Math. However, it has an accuracy of 44% which is way lower than GPT-4 accuracy of 77%. Also, its F1 score is 70.83% and its precision is 85%.
Talking about precision, GPT-4o has the highest precision (86.21%) among all LLMs. It has outstanding scores on top benchmarks that include MMLU (85.7%), HumanEval (90.2%), MGSM (90.3%), and F1 (81.5%).
If we evaluate Llama 3.1 405B on the same benchmarks the results are quite appreciable. Llama 3.1 405B scores 88.6% on MMLU, 91.6% in MGSM, 89% in HumanEval, and 84.8% in F1.
These are some interesting numbers to watch and taken aback by the performance levels LLMs have managed to achieve. But when we talk about real-world implementation of these LLMs or AI agents using these LLMs, the story takes a horror turn.
From a business perspective, a major reason for shifting to AI agents is cost. You can have a 1-2 member sales team and integrate AI agents in your processes to automate and streamline everything. This removes the need to have large in-house human teams. So, the operational cost of AI agents should be affordable, so the businesses actually save rather than invest huge sums in the name of efficiency and accuracy.
Main challenges in current AI agent benchmarks that hinder their effectiveness and applicability in real-world scenarios are as follows:
1. Narrow Focus on Accuracy: Many benchmarks prioritize accuracy as the sole metric, neglecting other important factors such as cost and robustness. This narrow focus can lead to the development of unnecessarily complex and expensive agents, while also fostering misconceptions about the true sources of accuracy improvements.
2. Conflation of Benchmarking Needs: There is a lack of clarity regarding the distinct needs of model developers versus downstream users. This conflation makes it difficult to determine which agents are best suited for specific applications, leading to potential mismatches between agent capabilities and user requirements.
3. Inadequate Holdout Sets: Many benchmarks do not include sufficient holdout sets for evaluation. This inadequacy can result in agents that overfit the benchmark data, taking shortcuts that compromise their generalizability and robustness in real-world applications.
4. Shortcuts and Overfitting: The design of certain benchmarks enables agents to take shortcuts, leading to overfitting. This means that agents may perform well on benchmark tests but fail to generalize effectively to new, unseen tasks or environments.
5. Lack of Standardization: There is a pervasive lack of standardization in evaluation practices across different benchmarks, which contributes to reproducibility issues. This inconsistency can inflate accuracy estimates and create an overly optimistic view of agent capabilities.
6. Cost Control Issues: The stochastic nature of language models means that simply calling the underlying model multiple times can artificially boost accuracy without reflecting true performance. This can lead to evaluations that do not account for the operational costs associated with running AI agents.
By addressing these challenges, one can improve the design and evaluation of AI agents, ensuring they are not only accurate on benchmarks but also effective and practical for real-world applications.
A research paper named “AI Agents That Matter” proposes a method for optimizing both accuracy and cost in AI agents through a framework that allows for joint optimization of these two metrics.
Here’s a detailed breakdown of the approach suggested:
Joint Optimization Strategy
1. Pareto Frontier Visualization: The authors advocate for visualizing the trade-offs between accuracy and cost using a Pareto frontier. This visualization helps in identifying agent designs that can achieve better performance while minimizing costs. An agent is considered to be on the Pareto frontier if there is no other agent that performs better on both dimensions simultaneously.
2. Modification of Existing Frameworks: The paper discusses modifications made to the DSPy framework to facilitate this joint optimization. By adjusting parameters and optimizing hyperparameters, the authors demonstrate that it is possible to lower operational costs while maintaining a comparable level of accuracy.
3. Cost Components: The total cost of running an AI agent is categorized into fixed and variable costs:
Fixed Costs: These are one-time expenses associated with optimizing the agent's design, such as tuning hyperparameters.
Variable Costs: These are incurred each time the agent is executed and depend on factors like the number of tokens processed. The paper emphasizes that as usage increases, variable costs become more significant.
4. Trade-offs Between Costs: By investing more in the initial design and optimization phase (fixed costs), developers can reduce ongoing operational costs (variable costs). This approach encourages finding efficient prompts and examples that maintain accuracy without incurring excessive costs during execution.
5. Empirical Demonstration: The authors provide empirical evidence by applying their optimized framework to benchmarks like HotPotQA, showing that their approach can yield agents that are both cost-effective and accurate.
6. Avoiding Overfitting: The framework also addresses issues related to overfitting to benchmarks by ensuring that evaluations consider diverse holdout sets and realistic scenarios, which helps in developing agents that generalize better beyond specific test cases.
By implementing these strategies, we can foster the development of AI agents that are not only effective in achieving high accuracy but are also economical to operate, ultimately making them more viable for real-world applications.
Here are some recommendations we can consider:
1. Incorporate Cost Metrics Alongside Accuracy
Joint Optimization: There is a need for a framework that jointly optimizes accuracy and cost, allowing for a more balanced evaluation of agent performance. This approach encourages the development of agents that are not only accurate but also cost-effective in real-world usage scenarios.
2. Design Improved Benchmarks
Adequate Holdout Sets: Benchmarks should include diverse and sufficient holdout sets to prevent overfitting and ensure that agents can generalize well to unseen tasks. This helps in assessing how agents perform in varied conditions similar to real-world applications.
3. Distinguish Benchmarking Needs
Tailored Evaluation for Developers: Emphasis should be given to differentiate between the benchmarking requirements of model developers and those of downstream users. Evaluations should focus on practical metrics relevant to end-users, such as operational costs, rather than solely relying on proxies like model parameters.
4. Standardization of Evaluation Practices
Establish Consistent Protocols: There is a call for standardized evaluation protocols to enhance reproducibility and reliability across different benchmarks. This standardization would help ensure that benchmarks accurately measure what they intend to assess.
5. Avoid Shortcuts in Benchmarking
Prevent Overfitting: To address the issue of agents taking shortcuts due to poorly designed benchmarks, employing a principled framework that encourages robust evaluation practices will be an ideal way to go. This includes defining clear levels of generality for agents and utilizing appropriate types of hold-out samples based on desired outcomes.
6. Empirical Demonstration and Validation
Use Empirical Analysis: Empirical analyses should be conducted to validate claims about agent performance on benchmarks, ensuring that results reflect true capabilities rather than inflated accuracy due to methodological flaws.
By implementing these strategies, we can build AI agents that are not only effective on benchmarks but also practical and useful in real-world applications, ultimately bridging the gap between theoretical evaluations and real-world performance.
Despite the establishment of a fundamental technical architecture for AI agents, benchmarking practices remain in their infancy, with best practices yet to be defined. This situation complicates the ability to distinguish between authentic advancements and inflated claims regarding AI capabilities.
AI agents introduce considerable complexity to model-based implementations, necessitating a novel approach to evaluation.
-To tackle these challenges, initial recommendations include:
-Implementing cost-aware comparisons,
-Distinguishing between model evaluation and downstream task performance,
-Utilizing appropriate holdout sets to mitigate shortcuts,
-Standardizing evaluation methodologies.
These measures aim to enhance the rigor of agent benchmarking and lay a robust foundation for future developments.
Have a groundbreaking AI business idea?
Is finding the right tech partner to unlock AI benefits in your business hectic?
I’m here to help. With decades of experience in data science, machine learning, and AI, I have led my team to build top-notch tech solutions for reputed businesses worldwide.
Let’s discuss how to propel your business in my DM!
If you are into AI, LLMs, Digital Transformation, and the Tech world – do follow me on LinkedIn.
Explore the frontiers of innovation in Artificial Intelligence, breaking barriers and forging new paths that redefine possibilities and transform the way we perceive and engage with the world.
At Ampcome, we engineer smart solutions that redefine industries, shaping a future where innovations and possibilities have no bounds.