Large Language Model
| Model Name | Company | Year | Parameters (Est.) | Context Length | Performance (MT-Bench / MMLU / HumanEval) | Multimodal |
|---|---|---|---|---|---|---|
| Claude 3 Opus | Anthropic | 2024 | Undisclosed | 200k | MT-Bench: 9.9 / MMLU: 86.8% / HumanEval: 88%+ | Yes |
| GPT-4.5 / GPT-4-turbo | OpenAI | 2023 | ~1.8T (MoE*) | 128k | MT-Bench: ~9.9 / MMLU: ~87% / HumanEval: ~83% | Yes |
| Gemini 1.5 Pro | Google DeepMind | 2024 | Undisclosed | 1M | MT-Bench: ~9.7 / MMLU: ~86% / HumanEval: ~80% | Yes |
| LLaMA 3 70B | Meta | 2024 | 70B | 8k | MT-Bench: ~8.9 / MMLU: 83.2% / HumanEval: ~74% | No |
| Grok-1.5 | xAI | 2024 | ~314B (MoE) | 128k | MT-Bench: ~8.7 / MMLU: ~80% / HumanEval: ~72% | No |
| Mistral Large | Mistral | 2024 | ~12.9B | 32k | MT-Bench: 8.6 / MMLU: ~81% / HumanEval: ~69% | No |
| Mixtral 8x7B | Mistral | 2023 | 12.9B x8 (MoE) | 32k | MT-Bench: ~8.5 / MMLU: ~78% / HumanEval: ~65% | No |
| Command R+ | Cohere | 2024 | ~52B | 128k | MT-Bench: ~8.4 / MMLU: ~79% / HumanEval: ~66% | No |
| Phi-3-mini (3.8B) | Microsoft | 2024 | 3.8B | 128k | MMLU: ~71% (no MT-Bench) | No |
| LLaMA 3 400B | Meta (internal) | 2024 | 400B | 128k (rumored) | Not benchmarked | Yes |
Benchmark Standards
MT-Bench (Multi-Turn Benchmark)
- Purpose: Measures multi-turn chat quality (helpfulness, reasoning, correctness).
- How it works: Pairs of model responses to multi-turn questions are evaluated by GPT-4 as a judge.
- Scoring: Scale from 0 to 10.
- Focus: Human-like conversational ability, contextual understanding, and coherence across turns.
- Used for: Ranking models on chat quality (e.g., Claude 3 Opus scored ~9.9/10).
MMLU (Massive Multitask Language Understanding)
- Purpose: Evaluates general knowledge and reasoning across 57 academic and professional subjects.
- Format: Multiple-choice questions (like exams in history, math, law, biology, etc.).
- Scoring: Reported as % accuracy.
- Focus: Tests factual knowledge and problem-solving ability.
- Used for: Benchmarking raw intelligence and breadth of understanding.
HumanEval
- Purpose: Evaluates code generation ability.
- How it works: Model is given a prompt to write a function; its output is tested against unit tests.
- Scoring: % of problems solved correctly on first attempt (pass@1).
- Focus: Programming accuracy, logic, and syntax in Python.
- Used for: Measuring coding skills, often important for developer-focused models.
Summary Table
| Benchmark | Measures | Format | Score Type |
|---|---|---|---|
| MT-Bench | Multi-turn chat quality, coherence | Conversational Q&A | 0–10 scale |
| MMLU | General knowledge & reasoning | Multiple choice (57 subjects) | % correct |
| HumanEval | Code generation in Python | Function writing + test cases | % pass@1 |
