Why Evaluating AI Models is So Hard
Every week seems to bring with it a new AI model, and the technology has unfortunately outpaced anyone’s ability to evaluate it comprehensively. Here’s why it’s pretty much impossible to review something like ChatGPT or Gemini, why it’s important to try anyway, and our (constantly evolving) approach to doing so.
The tl;dr: These systems are too general and are updated too frequently for evaluation frameworks to stay relevant, and synthetic benchmarks provide only an abstract view of certain well-defined capabilities. Companies like Google and OpenAI are counting on this because it means consumers have no source of truth other than those companies’ own claims. So even though our own reviews will necessarily be limited and inconsistent, a qualitative analysis of these systems has intrinsic value simply as a real-world counterweight to industry hype.
Why AI Models Are Hard to Evaluate
The pace of release for AI models is far too fast for anyone but a dedicated outfit to do any kind of serious assessment of their merits and shortcomings. We receive news of new or updated models literally every day. The sheer number and breadth of AI models, along with their opaque nature, make it nearly impossible to evaluate them comprehensively.
These large models are not simply bits of software or hardware that you can test, score, and be done with it. They are platforms, with dozens of individual models and services built into or bolted onto them. Testing these platforms exhaustively is fundamentally impossible, as they are capable of performing a wide range of tasks beyond what their creators intended.
Furthermore, the secretive nature of these companies and their internal training methods make evaluating AI models even more challenging. They treat their processes as trade secrets, making it difficult to determine how they achieve certain capabilities.
Why We Review AI Models Anyway
Despite the challenges, we are motivated to review AI models to provide a real-world counterbalance to the industry hype. Our team’s commitment to telling the truth and our curiosity about these companies’ claims drives us to conduct our testing on major models. We aim to provide a subjective judgment of how each model performs and offer a hands-on experience of their capabilities.
Our Approach to Reviewing AI Models
Our approach to testing AI models is intended to provide a general sense of their capabilities without diving into elusive specifics. We have a series of prompts that we use to ask the models a variety of questions and follow-ups, covering different categories like news analysis, trivia questions, medical advice, and more. Our reviews summarize our experience with the model, highlighting what it did well, poorly, weird, or not at all during our testing.
We continually update our approach based on feedback, model behavior, conversations with experts, and industry developments. The fast-moving nature of the AI industry necessitates a flexible and adaptive testing methodology to ensure accurate and relevant evaluations.