KyroBench
Context correctness under changing knowledge.
KyroDB Research / Published June 12, 2026
Agents do not only fail because they cannot reason. They fail because the context in front of them is stale, polluted, outside the user boundary, or impossible to verify. KyroBench measures that layer.
The benchmark asks a simple production question: would it be safe for an agent to act on this context right now? If the answer depends on old facts, weak authority, hidden tenant leakage, or missing proof, the system should not pass.
Why this benchmark exists
Most retrieval tests reward semantic proximity. That is useful, but it is not enough for enterprise context. The closest text can still be the wrong text after a policy update, account switch, support escalation, chart correction, or revoked memory.
BEIR
Broad information-retrieval generalization across static corpora.
It does not test mutable enterprise state, deletion, proof, or tenant/session boundaries.
VectorDBBench
Vector database performance, recall, throughput, and latency.
It benchmarks database behavior, not whether returned context is safe for an agent to act on.
CRAG and RAGBench
RAG retrieval and QA quality.
They are adjacent to KyroBench, but KyroBench is stricter about freshness, scope, pollution, and proof.
LongMemEval and agent-memory suites
Long-term memory, history use, and multi-session agent recall.
KyroBench focuses on governed context correctness when memory changes, conflicts, or becomes unsafe.
What the tasks look like
The suite is built around changing business knowledge, not static passages. A case may begin with a valid document, add a later amendment, delete an old memory, inject a near-duplicate from another tenant, and then ask for the context an agent should use before acting.
KyroBench evaluates six tracks together:
Core Context Correctness
Basic answerability, relevance, and current evidence under ordinary retrieval pressure.
Runtime Context Correctness
Mutation, delete, cache-window, generation, and update-ordering behavior.
End-to-End Context Quality
Claim support and action-readiness chains that require multiple pieces of valid evidence.
Agent Memory Quality
Session memory, user memory, long-running memory, and memory bleed across similar histories.
Pollution Resistance
Prompt injection, tool-result pollution, long-context noise, near duplicates, and contradictions.
Enterprise Boundary Correctness
Tenant, namespace, entitlement, canary, user, session, and thread boundaries.
The domains are intentionally ordinary: support tickets, billing overrides, legal amendments, SRE runbooks, CRM memory, synthetic healthcare records, code-agent traces, and public-policy changes. The difficulty comes from state changes, boundaries, and proof, not from obscure trivia.
Support and billing
Tests whether a system keeps account-specific exceptions, live case state, and current policy separate from stale macros.
Code agent
Measures whether local implementation evidence survives tool noise, stale notes, and similar-but-wrong repository facts.
Legal contracts
Tests whether controlling language, amendments, and authority levels beat older or lower-authority text.
SRE incidents
Focuses on fast-changing operational context where old mitigation steps can be actively harmful.
CRM account memory
Tests user, account, and entitlement boundaries under realistic sales and customer-memory handoffs.
Synthetic healthcare records
Measures whether sensitive context remains scoped to the right subject and reflects the latest accepted state.
How the evaluation works
Each system is connected through the same public adapter protocol. The evaluator loads scoped knowledge, applies updates and deletes, injects realistic distractors, asks constrained retrieval questions, and scores the returned context against the current source of truth.
Ingest
Load scoped documents, memories, session facts, and authority metadata.
Change
Apply updates, deletes, revocations, and near-duplicate distractors.
Query
Ask under tenant, session, namespace, freshness, and context-budget constraints.
Return
Provide ranked context plus proof metadata for each returned item.
Score
Check currentness, scope, pollution, proof, and evidence packing.
The current official run evaluates 36,864 scored checks across 6 tracks, with 12,288 retrievals per run and a 1,200 token context budget. Public examples show the task shape; held-out labels and full task material are not released.
Updated policy beats old FAQ
Freshness / Documents
A query matches two sources. Only the newer authoritative policy should support the answer.
Tenant boundary blocks leakage
Scope / Memory
Two customers share a similar incident. Evidence from the wrong account must not enter context.
Deleted memory stays deleted
Updates / Memory
A useful preference remains semantically close after deletion. A correct system still excludes it.
Distractor quote is rejected
Pollution / Documents
A near duplicate has the wrong date, source, or authority level. Similarity alone should not win.
Scoring
The official score is a certification gate. A system must return current, scoped, clean, supported, and provable context inside the budget. Diagnostic metrics are reported separately so readers can understand the failure mode without confusing partial signal for certification.
Context quality
30%Returned context contains the evidence needed to answer, not loosely related text.
Freshness
20%Updated, superseded, revoked, and deleted knowledge is handled correctly.
Scope safety
20%Tenant, namespace, user, session, and authority boundaries are respected.
Pollution resistance
15%Distractors, stale facts, and misleading near matches are kept out of context.
Proof completeness
10%Returned context can be checked against current source state.
Evidence fit
5%The system stays inside the context budget while preserving required evidence.
This makes the leaderboard harsher than a retrieval benchmark. A system can retrieve plausible evidence and still fail if that evidence is stale, unsupported, out of scope, or missing proof.
Results
No system is certified in the current official evaluation. That is the important result. The diagnostic rows show different kinds of progress, but none of the systems clears the full context-correctness bar.
Leaderboard
Official score is a certification gate. Retrieval and semantic signals are diagnostics.
Retrieval signal
Relevant evidence surfaced before certification checks.
Retrieval signal shows whether the system can surface plausible evidence. It is useful for diagnosis, but it does not certify the context because similar text can still be stale, polluted, or unverifiable.
KyroDB: Strong retrieval coverage and complete proof metadata, but not certified because the held-out semantic and freshness gates did not clear.
Graphiti/Zep: High retrieval signal and fast responses, but not certified without proof metadata and complete context-support behavior.
Qdrant: Useful as a raw vector-store baseline; it does not expose native freshness, authority, or proof semantics for certification.
Mem0: Strict requests were often blocked as stale, leaving very little completed retrieval signal in the official run.
| System | Official | Retrieval | Semantic | Proof | Freshness | Pollution | p95 latency |
|---|---|---|---|---|---|---|---|
KyroDB | 0.0 | 79.1 | 0.0 | 100.0 | 0.0 | 100.0 | 520 ms |
Graphiti/Zep | 0.0 | 71.6 | 0.0 | 0.0 | 69.4 | 100.0 | 168 ms |
Qdrant | 0.0 | 53.3 | 0.0 | 0.0 | 0.0 | 100.0 | 231 ms |
Mem0 | 0.0 | 0.06 | 0.04 | 0.0 | 0.0 | 2.5 | 401 ms |
Diagnostics explain system behavior. They are not blended into the certified score.
KyroDB shows the strongest retrieval signal and complete proof metadata, but still fails the official semantic and freshness gates. Graphiti/Zep retrieves strongly and responds quickly, but does not expose the proof surface required for certification. Qdrant is useful as a raw vector-store baseline, while Mem0 blocks many strict requests as stale and leaves little completed retrieval signal.
Failure analysis
The observed failures are the kinds that matter in production. They are not formatting misses. They are cases where an agent could act on the wrong account, an old policy, a deleted memory, or an unsupported claim because the context layer treated similarity as trust.
Stale accepted
Old facts beat newer valid evidence.
Scope leak
Out-of-scope content enters the answer context.
Deleted memory
Invalidated facts reappear after deletion or revocation.
Low authority
Weak sources are treated as sufficient support.
Missing proof
The answer lacks complete or valid provenance.
Packing failure
Key evidence is dropped under the context budget.
The most common pattern is not that systems retrieve nothing. It is that they retrieve something plausible and then fail one of the harder obligations: the source is old, the account boundary is wrong, the proof is not auditable, or the returned set omits the one fragment that makes the action safe.
Limitations and next steps
KyroBench evaluates retrieval-grounded memory and context systems. It does not score free-form answer style, general web search, or long-context reasoning without a retrieval layer.
The public surface is intentionally limited. It gives teams enough to build honest adapters and understand results, while keeping the official judgment held out so the benchmark does not become a tuning set.
The Data page gives a deeper breakdown of the suite shape, fixture boundaries, tracks, domains, and common failure modes without publishing the hidden answer keys.