Large Language Model

Model NameCompanyYearParameters (Est.)Context LengthPerformance (MT-Bench / MMLU / HumanEval)Multimodal
Claude 3 OpusAnthropic2024Undisclosed200kMT-Bench: 9.9 / MMLU: 86.8% / HumanEval: 88%+Yes
GPT-4.5 / GPT-4-turboOpenAI2023~1.8T (MoE*)128kMT-Bench: ~9.9 / MMLU: ~87% / HumanEval: ~83%Yes
Gemini 1.5 ProGoogle DeepMind2024Undisclosed1MMT-Bench: ~9.7 / MMLU: ~86% / HumanEval: ~80%Yes
LLaMA 3 70BMeta202470B8kMT-Bench: ~8.9 / MMLU: 83.2% / HumanEval: ~74%No
Grok-1.5xAI2024~314B (MoE)128kMT-Bench: ~8.7 / MMLU: ~80% / HumanEval: ~72%No
Mistral LargeMistral2024~12.9B32kMT-Bench: 8.6 / MMLU: ~81% / HumanEval: ~69%No
Mixtral 8x7BMistral202312.9B x8 (MoE)32kMT-Bench: ~8.5 / MMLU: ~78% / HumanEval: ~65%No
Command R+Cohere2024~52B128kMT-Bench: ~8.4 / MMLU: ~79% / HumanEval: ~66%No
Phi-3-mini (3.8B)Microsoft20243.8B128kMMLU: ~71% (no MT-Bench)No
LLaMA 3 400BMeta (internal)2024400B128k (rumored)Not benchmarkedYes

Benchmark Standards

MT-Bench (Multi-Turn Benchmark)

  • Purpose: Measures multi-turn chat quality (helpfulness, reasoning, correctness).
  • How it works: Pairs of model responses to multi-turn questions are evaluated by GPT-4 as a judge.
  • Scoring: Scale from 0 to 10.
  • Focus: Human-like conversational ability, contextual understanding, and coherence across turns.
  • Used for: Ranking models on chat quality (e.g., Claude 3 Opus scored ~9.9/10).

MMLU (Massive Multitask Language Understanding)

  • Purpose: Evaluates general knowledge and reasoning across 57 academic and professional subjects.
  • Format: Multiple-choice questions (like exams in history, math, law, biology, etc.).
  • Scoring: Reported as % accuracy.
  • Focus: Tests factual knowledge and problem-solving ability.
  • Used for: Benchmarking raw intelligence and breadth of understanding.

HumanEval

  • Purpose: Evaluates code generation ability.
  • How it works: Model is given a prompt to write a function; its output is tested against unit tests.
  • Scoring: % of problems solved correctly on first attempt (pass@1).
  • Focus: Programming accuracy, logic, and syntax in Python.
  • Used for: Measuring coding skills, often important for developer-focused models.

Summary Table

BenchmarkMeasuresFormatScore Type
MT-BenchMulti-turn chat quality, coherenceConversational Q&A0–10 scale
MMLUGeneral knowledge & reasoningMultiple choice (57 subjects)% correct
HumanEvalCode generation in PythonFunction writing + test cases% pass@1