AI LLMs

Large Language Model

LLMs are large neural networks (usually transformer-based) trained to consume human-like language as input and produce human-like language as output, exhibiting emergent reasoning behavior through probabilistic next-token prediction.

Model Name	Company	Year	Parameters (Est.)	Context Length	Performance (MT-Bench / MMLU / HumanEval)	Multimodal
Claude 3 Opus	Anthropic	2024	Undisclosed	200k	MT-Bench: 9.9 / MMLU: 86.8% / HumanEval: 88%+	Yes
GPT-4.5 / GPT-4-turbo	OpenAI	2023	~1.8T (MoE*)	128k	MT-Bench: ~9.9 / MMLU: ~87% / HumanEval: ~83%	Yes
Gemini 1.5 Pro	Google DeepMind	2024	Undisclosed	1M	MT-Bench: ~9.7 / MMLU: ~86% / HumanEval: ~80%	Yes
LLaMA 3 70B	Meta	2024	70B	8k	MT-Bench: ~8.9 / MMLU: 83.2% / HumanEval: ~74%	No
Grok-1.5	xAI	2024	~314B (MoE)	128k	MT-Bench: ~8.7 / MMLU: ~80% / HumanEval: ~72%	No
Mistral Large	Mistral	2024	~12.9B	32k	MT-Bench: 8.6 / MMLU: ~81% / HumanEval: ~69%	No
Mixtral 8x7B	Mistral	2023	12.9B x8 (MoE)	32k	MT-Bench: ~8.5 / MMLU: ~78% / HumanEval: ~65%	No
Command R+	Cohere	2024	~52B	128k	MT-Bench: ~8.4 / MMLU: ~79% / HumanEval: ~66%	No
Phi-3-mini (3.8B)	Microsoft	2024	3.8B	128k	MMLU: ~71% (no MT-Bench)	No
LLaMA 3 400B	Meta (internal)	2024	400B	128k (rumored)	Not benchmarked	Yes

LLMs in different verticals

1. General-Purpose / Chat Assistants (Horizontal)

Company	Product	Underlying Model	Primary Use	Differentiation
OpenAI	ChatGPT	GPT-4.1 / GPT-4o	General chat, reasoning	Best ecosystem + tool use
Anthropic	Claude	Claude 3.x	Long-context chat	Strong safety + context window
Google	Gemini	Gemini 1.5	Multimodal assistant	Native search + YouTube
Microsoft	Copilot	GPT-4 (Azure)	Enterprise chat	Deep M365 integration
Meta	Meta AI	LLaMA 3	Consumer chat	Distribution via WhatsApp

2. Coding / Software Development

Company	Product	Model	Target User	Notes
OpenAI	Codex	Codex / GPT-4	Developers	Code-native reasoning
GitHub	Copilot	GPT-4	Developers	IDE-embedded, massive adoption
Anthropic	Claude for Code	Claude 3	Backend / infra	Strong refactoring
Google	Gemini Code Assist	Gemini	Enterprise devs	GCP + IDEs
Replit	Replit AI	GPT-4 / Claude	Solo builders	End-to-end app build

3. Enterprise Productivity / Knowledge Work

Company	Product	Vertical	Differentiation
Microsoft	Copilot for M365	Docs, Excel, Email	Deep workflow lock-in
Google	Gemini for Workspace	Docs, Sheets	Search + data leverage
OpenAI	GPTs (Custom)	Internal tools	Low-code AI apps
Notion	Notion AI	Knowledge mgmt	Context-aware writing

4. Medical / Healthcare (Regulated & Semi-Regulated)

Company	Product	Model	Use Case	Notes
OpenAI	GPT-4 Med	GPT-4	Clinical reasoning	Research / pilots
Google	Med-PaLM	PaLM	Diagnosis support	Strong research pedigree
Microsoft	Azure Health AI	GPT-4	Clinical workflows	HIPAA + compliance
Epic Systems	AI Clinical Tools	GPT-4	EHR-embedded	Distribution moat

5. Legal / Compliance

Company	Product	Model	Target User	Advantage
OpenAI	GPT-4 Legal	GPT-4	Law firms	Strong reasoning
Harvey	Harvey AI	GPT-4	Big Law	Workflow-specific
Thomson Reuters	CoCounsel	Proprietary + GPT	Legal research	Data moat

6. Finance / Investing / Risk

Company	Product	Model	Use Case	Notes
OpenAI	GPT-4 Finance	GPT-4	Analysis / modeling	General reasoning
Bloomberg	BloombergGPT	BloombergGPT	Financial NLP	Proprietary data edge
Microsoft	Copilot for Finance	GPT-4	Excel, modeling	CFO-focused
Databricks	DBRX	DBRX	Internal analytics	Open, fine-tunable

7. Creative / Media / Design

Company	Product	Model	Modality
OpenAI	DALL·E	DALL·E 3	Image
OpenAI	Sora	Sora	Video
Adobe	Firefly	Proprietary	Image / Video
Meta	Emu	Emu	Image

8. Agents / Automation (Early but Critical)

Company	Product	Concept	Why It Matters
OpenAI	Assistants API	Tool-using agents	Infra-level control
Microsoft	Copilot Studio	Enterprise agents	Workflow automation
Salesforce	Einstein Copilot	CRM agents	Sales + service lock-in
UiPath	Autopilot	Agentic automation	RPA + LLM convergence

Benchmark Standards

MT-Bench (Multi-Turn Benchmark)

Purpose: Measures multi-turn chat quality (helpfulness, reasoning, correctness).
How it works: Pairs of model responses to multi-turn questions are evaluated by GPT-4 as a judge.
Scoring: Scale from 0 to 10.
Focus: Human-like conversational ability, contextual understanding, and coherence across turns.
Used for: Ranking models on chat quality (e.g., Claude 3 Opus scored ~9.9/10).

MMLU (Massive Multitask Language Understanding)

Purpose: Evaluates general knowledge and reasoning across 57 academic and professional subjects.
Format: Multiple-choice questions (like exams in history, math, law, biology, etc.).
Scoring: Reported as % accuracy.
Focus: Tests factual knowledge and problem-solving ability.
Used for: Benchmarking raw intelligence and breadth of understanding.

HumanEval

Purpose: Evaluates code generation ability.
How it works: Model is given a prompt to write a function; its output is tested against unit tests.
Scoring: % of problems solved correctly on first attempt (pass@1).
Focus: Programming accuracy, logic, and syntax in Python.
Used for: Measuring coding skills, often important for developer-focused models.

Summary Table

Benchmark	Measures	Format	Score Type
MT-Bench	Multi-turn chat quality, coherence	Conversational Q&A	0–10 scale
MMLU	General knowledge & reasoning	Multiple choice (57 subjects)	% correct
HumanEval	Code generation in Python	Function writing + test cases	% pass@1