Breaking the GPT Benchmark: Unveiling Limitations & Achieving 89.0% on MMLU

Breaking the GPT Benchmark: Unveiling Limitations & Achieving 89.0% on MMLU

Table of Contents

  • Introduction
  • The Limitations of GPT Models
  • The Smart GPT Framework
  • The Power of Prompts and Self-Reflection
  • Benchmarking with Smart GPT
  • The Massive Multitask Language Understanding (MMLU) Benchmark
  • Uncovering Issues in the MMLU
  • Improving Performance with Smart GPT
  • Challenges with Benchmarking
  • The Need for Independent Benchmarking
  • Practical Applications of Smart GPT
  • Conclusion
  • Resources

Breaking the Benchmark: Unveiling the Limitations of GPT Models

Can we truly rely on the impressive capabilities of GPT models? In this article, we dive deep into the world of language models, exploring the boundaries and exposing the limitations that even giants like OpenAI and Google seem to be unaware of. Join us on a journey through our experiments and discoveries as we showcase how you can benefit from our findings in unexpected domains like medicine. Get ready to break the benchmark with Smart GPT!


Since late April, myself and machine learning engineer Josh Stapleton have been tirelessly evaluating over a hundred and twenty thousand answers from GPT models. In my original Smart GPT video, I demonstrated that the popular belief of GPT4 being unintelligent was misleading, as it could easily answer complex questions correctly. Little did we know that our experiments with GPT4 would uncover a multitude of mistakes in an official benchmark called the Massive Multitask Language Understanding (MMLU). These revelations have raised concerns even among industry leaders like OpenAI and Google. However, by the end of this article, I will show you how you can benefit from our experiments and harness the power of GPT models in various domains, including medicine.

The Limitations of GPT Models

In the world of artificial intelligence and machine learning, language models like GPT4 are often hailed as the pinnacle of human-like understanding. However, our extensive evaluation of GPT models has revealed that there are certain limits to their capabilities. Contrary to popular belief, GPT4 is far from being unintelligent. The existing benchmarks, including those conducted by OpenAI and Google, fail to fully gauge the potential of these models. In the following sections, we will explore the Smart GPT framework and its ability to enhance the performance of GPT models.

The Smart GPT Framework

Smart GPT is a groundbreaking framework that leverages the latest Prompt engineering research to unlock better performance in language models like GPT4. By incorporating progressive thinking and optimized prompts, we can push the boundaries of what these models can achieve. In my previous video, I discussed the significance of getting the model to think and reflect before providing a final answer. This aspect, along with prompt optimization and self-dialogue, has been shown to significantly boost performance across various tasks.

The Power of Prompts and Self-Reflection

Through our manual experiments, we have discovered that the strategic use of prompts and self-reflection can greatly enhance the capabilities of GPT models. By optimizing prompts and facilitating self-dialogue within the model, we observed a remarkable improvement in tasks involving formal logic and college mathematics. These findings highlight the potential of prompt engineering in elevating the performance of language models. However, incorporating these techniques in a systematic and benchmarkable manner poses unique challenges.

Benchmarking with Smart GPT

Benchmarks play a crucial role in evaluating the performance of language models like GPT4. To accurately assess the potential of GPT models using the Smart GPT framework, we needed a reliable benchmark that aligned with our approach. That's where the Massive Multitask Language Understanding (MMLU) benchmark came into play.

The Massive Multitask Language Understanding (MMLU) Benchmark

The MMLU benchmark is widely recognized as one of the best measures of language model performance. It encompasses over 14,000 questions across 57 different domains, providing a comprehensive evaluation of a model's understanding. As a testament to its importance, the MMLU benchmark is prominently featured in the GPT4 technical report. Achieving a high score on the MMLU has been considered a significant milestone towards achieving artificial general intelligence (AGI).

Uncovering Issues in the MMLU

Our extensive evaluation of GPT models against the MMLU benchmark exposed numerous mistakes and inaccuracies. Surprisingly, these errors extended beyond GPT4's performance and called into question the validity of the benchmark itself. We discovered that even reputable sources used in the MMLU had missing statements or provided incorrect answers. In fact, we found over 80 discrepancies in the test, significantly impacting the final results. These findings cast doubt on the reliability and accuracy of both the MMLU benchmark and the models tested on it.

Improving Performance with Smart GPT

Despite the challenges and limitations we encountered during benchmarking, we were still able to achieve remarkable results. By harnessing the power of the Smart GPT framework, we achieved an unofficial Record-breaking score of 88.4 on the MMLU benchmark, surpassing OpenAI's recorded score of 86.4. We are confident that there are even more ways to further boost performance using existing models. With continued research and innovation, we anticipate breaking the widely coveted 95 threshold and achieving even higher levels of accuracy.

Challenges with Benchmarking

Our journey through benchmarking GPT models revealed several challenges that hinder a comprehensive evaluation of their capabilities. The existing benchmarks, including the MMLU, suffer from ambiguities, factual errors, and unclear answer choices. These shortcomings weaken the integrity of the benchmarks and limit our understanding of the true potential of GPT models. To overcome these challenges, we propose the establishment of an independent benchmarking organization to ensure unbiased and accurate assessments.

The Need for Independent Benchmarking

To truly measure the capabilities of GPT models and future AI systems, we advocate for the creation of an independent professional benchmarking organization. This organization, ideally funded by top AGI labs and supported by education companies like Pearson, can design comprehensive tests and benchmarks that cover a broad range of subjects with meticulous vetting and blind human grading processes. By standardizing benchmarking procedures, we can better understand the limits and capabilities of language models and foster more reliable assessments of their performance.

Practical Applications of Smart GPT

The Smart GPT framework has practical applications beyond benchmarking. Our experiments have demonstrated how optimizing prompts, leveraging self-reflection, and utilizing exemplars can significantly improve performance in various domains. While we do not suggest relying solely on GPT models for critical tasks like medical diagnosis, the incorporation of prompt engineering techniques can prove valuable in enhancing decision-making processes in diverse fields. From business ethics to security studies, GPT models can provide nuanced insights and aid in complex problem-solving.


Our extensive evaluation of GPT models and the MMLU benchmark has shed light on the limitations and potential of language models like GPT4. By leveraging the power of the Smart GPT framework, we have achieved unprecedented results and exposed flaws in widely used benchmarks. As AI continues to evolve, it is crucial to develop robust benchmarking standards and strive for independent evaluation to ensure an accurate understanding of AI capabilities. With further research and innovation, we can harness the full potential of GPT models and push the boundaries of artificial intelligence.


  1. GPT4 Technical Report - link
  2. Massive Multitask Language Understanding (MMLU) Benchmark - link
  3. Smart GPT GitHub Repository - link
  4. Oxford University Press - link
  5. Google Minerva Paper - link
  6. Aqua Rat Benchmark - link
  7. AGI Eval Benchmark - link
  8. Legal Bench - link
  9. Side Bench - link
  10. Helm Benchmark - link

Find AI tools in Toolify

Join TOOLIFY to find the ai tools

Get started

Sign Up
App rating
AI Tools
Trusted Users
No complicated
No difficulty
Free forever
Browse More Content