Language Models Can't Tell What's Missing
AbsenceBench is a benchmark for evaluating how well language models can identify and reason about missing information in text—revealing fundamental gaps in model comprehension.
4,300+Test Examples
14Models Evaluated
3Task Categories
Model Performance
Top models by average score
- 🥇Gemini-2.5-flash (thinking)71.2
- 🥈Claude-3.7-Sonnet (thinking)69.6
- 🥉Claude-3.7-Sonnet66.9
- 4Gemini-2.5-flash63.6
About the Benchmark
The Challenge
While language models excel at processing explicit information, they struggle with a fundamental aspect of comprehension: identifying what's missing. AbsenceBench tests models' ability to detect gaps, omissions, and absent context in text.
Three Task Categories
- Poetry: Find the missing lines in a recitation of a poem
- Numerical Sequences: Detecting when a number in a sequence is absent
- GitHub Pull Requests: Identify missing lines within a PR's diff
Why It Matters
The ability to recognize missing information is crucial for real-world applications like code review, fact-checking, instruction following, and critical reasoning. Current models' limitations in this area reveal important gaps in their cognitive capabilities.
Example Tasks
Full Model Leaderboard
| Rank | Model | Poetry | Numerical Sequences | GitHub PRs | Average |
|---|