Large Language Model

LLMs are large neural networks (usually transformer-based) trained to consume human-like language as input and produce human-like language as output, exhibiting emergent reasoning behavior through probabilistic next-token prediction.

Model NameCompanyYearParameters (Est.)Context LengthPerformance (MT-Bench / MMLU / HumanEval)Multimodal
Claude 3 OpusAnthropic2024Undisclosed200kMT-Bench: 9.9 / MMLU: 86.8% / HumanEval: 88%+Yes
GPT-4.5 / GPT-4-turboOpenAI2023~1.8T (MoE*)128kMT-Bench: ~9.9 / MMLU: ~87% / HumanEval: ~83%Yes
Gemini 1.5 ProGoogle DeepMind2024Undisclosed1MMT-Bench: ~9.7 / MMLU: ~86% / HumanEval: ~80%Yes
LLaMA 3 70BMeta202470B8kMT-Bench: ~8.9 / MMLU: 83.2% / HumanEval: ~74%No
Grok-1.5xAI2024~314B (MoE)128kMT-Bench: ~8.7 / MMLU: ~80% / HumanEval: ~72%No
Mistral LargeMistral2024~12.9B32kMT-Bench: 8.6 / MMLU: ~81% / HumanEval: ~69%No
Mixtral 8x7BMistral202312.9B x8 (MoE)32kMT-Bench: ~8.5 / MMLU: ~78% / HumanEval: ~65%No
Command R+Cohere2024~52B128kMT-Bench: ~8.4 / MMLU: ~79% / HumanEval: ~66%No
Phi-3-mini (3.8B)Microsoft20243.8B128kMMLU: ~71% (no MT-Bench)No
LLaMA 3 400BMeta (internal)2024400B128k (rumored)Not benchmarkedYes

What LLMs do better than existing software

  1. Understand human intent from human language prompts. Software was limited to structured inputs, specifically defined apis that needed to be called. LLMs can take in vague inputs such as “build me a dashboard showing customer churn trends”. This greatly disrupts the UX of existing software.
  2. Handling unstructured data. LLMs can inputs pdfs, emails, images etc.
  3. Reasoning / decomposing tasks. This is still an emerging ability. LLMs can break problems into steps, suggest what data is needed and choose tools (APIs, DBs, Search )
  4. Generate A) Code B) Documents such as Word Docs or Powerpoints.

Where LLMs are actually weak

  • Accuracy / hallucination
  • Determinism
  • Real-time data access (without tools)
  • Precise computation
  • Long multi-step workflows without orchestration

LLMs in different verticals


CompanyProductUnderlying ModelPrimary UseDifferentiation
OpenAIChatGPTGPT-4.1 / GPT-4oGeneral chat, reasoningBest ecosystem + tool use
AnthropicClaudeClaude 3.xLong-context chatStrong safety + context window
GoogleGeminiGemini 1.5Multimodal assistantNative search + YouTube
MicrosoftCopilotGPT-4 (Azure)Enterprise chatDeep M365 integration
MetaMeta AILLaMA 3Consumer chatDistribution via WhatsApp

CompanyProductModelTarget UserNotes
OpenAICodexCodex / GPT-4DevelopersCode-native reasoning
GitHubCopilotGPT-4DevelopersIDE-embedded, massive adoption
AnthropicClaude for CodeClaude 3Backend / infraStrong refactoring
GoogleGemini Code AssistGeminiEnterprise devsGCP + IDEs
ReplitReplit AIGPT-4 / ClaudeSolo buildersEnd-to-end app build

CompanyProductVerticalDifferentiation
MicrosoftCopilot for M365Docs, Excel, EmailDeep workflow lock-in
GoogleGemini for WorkspaceDocs, SheetsSearch + data leverage
OpenAIGPTs (Custom)Internal toolsLow-code AI apps
NotionNotion AIKnowledge mgmtContext-aware writing

CompanyProductModelUse CaseNotes
OpenAIGPT-4 MedGPT-4Clinical reasoningResearch / pilots
GoogleMed-PaLMPaLMDiagnosis supportStrong research pedigree
MicrosoftAzure Health AIGPT-4Clinical workflowsHIPAA + compliance
Epic SystemsAI Clinical ToolsGPT-4EHR-embeddedDistribution moat

CompanyProductModelTarget UserAdvantage
OpenAIGPT-4 LegalGPT-4Law firmsStrong reasoning
HarveyHarvey AIGPT-4Big LawWorkflow-specific
Thomson ReutersCoCounselProprietary + GPTLegal researchData moat

CompanyProductModelUse CaseNotes
OpenAIGPT-4 FinanceGPT-4Analysis / modelingGeneral reasoning
BloombergBloombergGPTBloombergGPTFinancial NLPProprietary data edge
MicrosoftCopilot for FinanceGPT-4Excel, modelingCFO-focused
DatabricksDBRXDBRXInternal analyticsOpen, fine-tunable

CompanyProductModelModality
OpenAIDALL·EDALL·E 3Image
OpenAISoraSoraVideo
AdobeFireflyProprietaryImage / Video
MetaEmuEmuImage

CompanyProductConceptWhy It Matters
OpenAIAssistants APITool-using agentsInfra-level control
MicrosoftCopilot StudioEnterprise agentsWorkflow automation
SalesforceEinstein CopilotCRM agentsSales + service lock-in
UiPathAutopilotAgentic automationRPA + LLM convergence

Benchmark Standards

MT-Bench (Multi-Turn Benchmark)

  • Purpose: Measures multi-turn chat quality (helpfulness, reasoning, correctness).
  • How it works: Pairs of model responses to multi-turn questions are evaluated by GPT-4 as a judge.
  • Scoring: Scale from 0 to 10.
  • Focus: Human-like conversational ability, contextual understanding, and coherence across turns.
  • Used for: Ranking models on chat quality (e.g., Claude 3 Opus scored ~9.9/10).

MMLU (Massive Multitask Language Understanding)

  • Purpose: Evaluates general knowledge and reasoning across 57 academic and professional subjects.
  • Format: Multiple-choice questions (like exams in history, math, law, biology, etc.).
  • Scoring: Reported as % accuracy.
  • Focus: Tests factual knowledge and problem-solving ability.
  • Used for: Benchmarking raw intelligence and breadth of understanding.

HumanEval

  • Purpose: Evaluates code generation ability.
  • How it works: Model is given a prompt to write a function; its output is tested against unit tests.
  • Scoring: % of problems solved correctly on first attempt (pass@1).
  • Focus: Programming accuracy, logic, and syntax in Python.
  • Used for: Measuring coding skills, often important for developer-focused models.

Summary Table

BenchmarkMeasuresFormatScore Type
MT-BenchMulti-turn chat quality, coherenceConversational Q&A0–10 scale
MMLUGeneral knowledge & reasoningMultiple choice (57 subjects)% correct
HumanEvalCode generation in PythonFunction writing + test cases% pass@1