Language Models Can't Tell What's Missing

AbsenceBench is a benchmark for evaluating how well language models can identify and reason about missing information in text—revealing fundamental gaps in model comprehension.

4,300+Test Examples
14Models Evaluated
3Task Categories

Model Performance

Top models by average score

  1. 🥇Gemini-2.5-flash (thinking)71.2
  2. 🥈Claude-3.7-Sonnet (thinking)69.6
  3. 🥉Claude-3.7-Sonnet66.9
  4. 4Gemini-2.5-flash63.6
View full leaderboard →

About the Benchmark

The Challenge

While language models excel at processing explicit information, they struggle with a fundamental aspect of comprehension: identifying what's missing. AbsenceBench tests models' ability to detect gaps, omissions, and absent context in text.

Three Task Categories

  • Poetry: Find the missing lines in a recitation of a poem
  • Numerical Sequences: Detecting when a number in a sequence is absent
  • GitHub Pull Requests: Identify missing lines within a PR's diff

Why It Matters

The ability to recognize missing information is crucial for real-world applications like code review, fact-checking, instruction following, and critical reasoning. Current models' limitations in this area reveal important gaps in their cognitive capabilities.

Example Tasks

Full Model Leaderboard

Rank Model Poetry Numerical Sequences GitHub PRs Average