Large Language Model

LLMs are large neural networks (usually transformer-based) trained to consume human-like language as input and produce human-like language as output, exhibiting emergent reasoning behavior through probabilistic next-token prediction.

Model Name	Company	Year	Parameters (Est.)	Context Length	Performance (MT-Bench / MMLU / HumanEval)	Multimodal
Claude 3 Opus	Anthropic	2024	Undisclosed	200k	MT-Bench: 9.9 / MMLU: 86.8% / HumanEval: 88%+	Yes
GPT-4.5 / GPT-4-turbo	OpenAI	2023	~1.8T (MoE*)	128k	MT-Bench: ~9.9 / MMLU: ~87% / HumanEval: ~83%	Yes
Gemini 1.5 Pro	Google DeepMind	2024	Undisclosed	1M	MT-Bench: ~9.7 / MMLU: ~86% / HumanEval: ~80%	Yes
LLaMA 3 70B	Meta	2024	70B	8k	MT-Bench: ~8.9 / MMLU: 83.2% / HumanEval: ~74%	No
Grok-1.5	xAI	2024	~314B (MoE)	128k	MT-Bench: ~8.7 / MMLU: ~80% / HumanEval: ~72%	No
Mistral Large	Mistral	2024	~12.9B	32k	MT-Bench: 8.6 / MMLU: ~81% / HumanEval: ~69%	No
Mixtral 8x7B	Mistral	2023	12.9B x8 (MoE)	32k	MT-Bench: ~8.5 / MMLU: ~78% / HumanEval: ~65%	No
Command R+	Cohere	2024	~52B	128k	MT-Bench: ~8.4 / MMLU: ~79% / HumanEval: ~66%	No
Phi-3-mini (3.8B)	Microsoft	2024	3.8B	128k	MMLU: ~71% (no MT-Bench)	No
LLaMA 3 400B	Meta (internal)	2024	400B	128k (rumored)	Not benchmarked	Yes

What LLMs do better than existing software

Understand human intent from human language prompts. Software was limited to structured inputs, specifically defined apis that needed to be called. LLMs can take in vague inputs such as “build me a dashboard showing customer churn trends”. This greatly disrupts the UX of existing software.
Handling unstructured data. LLMs can inputs pdfs, emails, images etc.
Reasoning / decomposing tasks. This is still an emerging ability. LLMs can break problems into steps, suggest what data is needed and choose tools (APIs, DBs, Search )
Generate A) Code B) Documents such as Word Docs or Powerpoints.

Where LLMs are actually weak

Accuracy / hallucination
Determinism
Real-time data access (without tools)
Precise computation
Long multi-step workflows without orchestration

LLMs in different verticals

1. General-Purpose / Chat Assistants (Horizontal)

Company	Product	Underlying Model	Primary Use	Differentiation
OpenAI	ChatGPT	GPT-4.1 / GPT-4o	General chat, reasoning	Best ecosystem + tool use
Anthropic	Claude	Claude 3.x	Long-context chat	Strong safety + context window
Google	Gemini	Gemini 1.5	Multimodal assistant	Native search + YouTube
Microsoft	Copilot	GPT-4 (Azure)	Enterprise chat	Deep M365 integration
Meta	Meta AI	LLaMA 3	Consumer chat	Distribution via WhatsApp

2. Coding / Software Development

Company	Product	Model	Target User	Notes
OpenAI	Codex	Codex / GPT-4	Developers	Code-native reasoning
GitHub	Copilot	GPT-4	Developers	IDE-embedded, massive adoption
Anthropic	Claude for Code	Claude 3	Backend / infra	Strong refactoring
Google	Gemini Code Assist	Gemini	Enterprise devs	GCP + IDEs
Replit	Replit AI	GPT-4 / Claude	Solo builders	End-to-end app build

3. Enterprise Productivity / Knowledge Work

Company	Product	Vertical	Differentiation
Microsoft	Copilot for M365	Docs, Excel, Email	Deep workflow lock-in
Google	Gemini for Workspace	Docs, Sheets	Search + data leverage
OpenAI	GPTs (Custom)	Internal tools	Low-code AI apps
Notion	Notion AI	Knowledge mgmt	Context-aware writing

4. Medical / Healthcare (Regulated & Semi-Regulated)

Company	Product	Model	Use Case	Notes
OpenAI	GPT-4 Med	GPT-4	Clinical reasoning	Research / pilots
Google	Med-PaLM	PaLM	Diagnosis support	Strong research pedigree
Microsoft	Azure Health AI	GPT-4	Clinical workflows	HIPAA + compliance
Epic Systems	AI Clinical Tools	GPT-4	EHR-embedded	Distribution moat

5. Legal / Compliance

Company	Product	Model	Target User	Advantage
OpenAI	GPT-4 Legal	GPT-4	Law firms	Strong reasoning
Harvey	Harvey AI	GPT-4	Big Law	Workflow-specific
Thomson Reuters	CoCounsel	Proprietary + GPT	Legal research	Data moat

6. Finance / Investing / Risk

Company	Product	Model	Use Case	Notes
OpenAI	GPT-4 Finance	GPT-4	Analysis / modeling	General reasoning
Bloomberg	BloombergGPT	BloombergGPT	Financial NLP	Proprietary data edge
Microsoft	Copilot for Finance	GPT-4	Excel, modeling	CFO-focused
Databricks	DBRX	DBRX	Internal analytics	Open, fine-tunable

7. Creative / Media / Design

Company	Product	Model	Modality
OpenAI	DALL·E	DALL·E 3	Image
OpenAI	Sora	Sora	Video
Adobe	Firefly	Proprietary	Image / Video
Meta	Emu	Emu	Image

8. Agents / Automation (Early but Critical)

Company	Product	Concept	Why It Matters
OpenAI	Assistants API	Tool-using agents	Infra-level control
Microsoft	Copilot Studio	Enterprise agents	Workflow automation
Salesforce	Einstein Copilot	CRM agents	Sales + service lock-in
UiPath	Autopilot	Agentic automation	RPA + LLM convergence

Benchmark Standards

MT-Bench (Multi-Turn Benchmark)

Purpose: Measures multi-turn chat quality (helpfulness, reasoning, correctness).
How it works: Pairs of model responses to multi-turn questions are evaluated by GPT-4 as a judge.
Scoring: Scale from 0 to 10.
Focus: Human-like conversational ability, contextual understanding, and coherence across turns.
Used for: Ranking models on chat quality (e.g., Claude 3 Opus scored ~9.9/10).

MMLU (Massive Multitask Language Understanding)

Purpose: Evaluates general knowledge and reasoning across 57 academic and professional subjects.
Format: Multiple-choice questions (like exams in history, math, law, biology, etc.).
Scoring: Reported as % accuracy.
Focus: Tests factual knowledge and problem-solving ability.
Used for: Benchmarking raw intelligence and breadth of understanding.

HumanEval

Purpose: Evaluates code generation ability.
How it works: Model is given a prompt to write a function; its output is tested against unit tests.
Scoring: % of problems solved correctly on first attempt (pass@1).
Focus: Programming accuracy, logic, and syntax in Python.
Used for: Measuring coding skills, often important for developer-focused models.

Summary Table

Benchmark	Measures	Format	Score Type
MT-Bench	Multi-turn chat quality, coherence	Conversational Q&A	0–10 scale
MMLU	General knowledge & reasoning	Multiple choice (57 subjects)	% correct
HumanEval	Code generation in Python	Function writing + test cases	% pass@1