KyroBench

Context correctness under changing knowledge.

KyroDB Research / Published June 12, 2026

Agents do not only fail because they cannot reason. They fail because the context in front of them is stale, polluted, outside the user boundary, or impossible to verify. KyroBench measures that layer.

The benchmark asks a simple production question: would it be safe for an agent to act on this context right now? If the answer depends on old facts, weak authority, hidden tenant leakage, or missing proof, the system should not pass.

Why this benchmark exists

Most retrieval tests reward semantic proximity. That is useful, but it is not enough for enterprise context. The closest text can still be the wrong text after a policy update, account switch, support escalation, chart correction, or revoked memory.

BEIR

Broad information-retrieval generalization across static corpora.

It does not test mutable enterprise state, deletion, proof, or tenant/session boundaries.

VectorDBBench

Vector database performance, recall, throughput, and latency.

It benchmarks database behavior, not whether returned context is safe for an agent to act on.

CRAG and RAGBench

RAG retrieval and QA quality.

They are adjacent to KyroBench, but KyroBench is stricter about freshness, scope, pollution, and proof.

LongMemEval and agent-memory suites

Long-term memory, history use, and multi-session agent recall.

KyroBench focuses on governed context correctness when memory changes, conflicts, or becomes unsafe.

KyroBench scores the context packet before the model acts. The unit is not the final answer. The unit is whether the packet contains current, scoped, non-polluted, auditable evidence for the requested action.

What the tasks look like

The suite is built around changing business knowledge, not static passages. A case may begin with a valid document, add a later amendment, delete an old memory, inject a near-duplicate from another tenant, and then ask for the context an agent should use before acting.

KyroBench evaluates six tracks together:

Core Context Correctness

Basic answerability, relevance, and current evidence under ordinary retrieval pressure.

Runtime Context Correctness

Mutation, delete, cache-window, generation, and update-ordering behavior.

End-to-End Context Quality

Claim support and action-readiness chains that require multiple pieces of valid evidence.

Agent Memory Quality

Session memory, user memory, long-running memory, and memory bleed across similar histories.

Pollution Resistance

Prompt injection, tool-result pollution, long-context noise, near duplicates, and contradictions.

Enterprise Boundary Correctness

Tenant, namespace, entitlement, canary, user, session, and thread boundaries.

The domains are intentionally ordinary: support tickets, billing overrides, legal amendments, SRE runbooks, CRM memory, synthetic healthcare records, code-agent traces, and public-policy changes. The difficulty comes from state changes, boundaries, and proof, not from obscure trivia.

Support and billing

Tests whether a system keeps account-specific exceptions, live case state, and current policy separate from stale macros.

Code agent

Measures whether local implementation evidence survives tool noise, stale notes, and similar-but-wrong repository facts.

Legal contracts

Tests whether controlling language, amendments, and authority levels beat older or lower-authority text.

SRE incidents

Focuses on fast-changing operational context where old mitigation steps can be actively harmful.

CRM account memory

Tests user, account, and entitlement boundaries under realistic sales and customer-memory handoffs.

Synthetic healthcare records

Measures whether sensitive context remains scoped to the right subject and reflects the latest accepted state.

How the evaluation works

Each system is connected through the same public adapter protocol. The evaluator loads scoped knowledge, applies updates and deletes, injects realistic distractors, asks constrained retrieval questions, and scores the returned context against the current source of truth.

Ingest

Load scoped documents, memories, session facts, and authority metadata.

Change

Apply updates, deletes, revocations, and near-duplicate distractors.

Query

Ask under tenant, session, namespace, freshness, and context-budget constraints.

Return

Provide ranked context plus proof metadata for each returned item.

Score

Check currentness, scope, pollution, proof, and evidence packing.

The current official run evaluates 36,864 scored checks across 6 tracks, with 12,288 retrievals per run and a 1,800 token context budget. Public examples show the task shape; held-out labels and full task material are not released.

Updated policy beats old FAQ

Freshness / Documents

A query matches two sources. Only the newer authoritative policy should support the answer.

Tenant boundary blocks leakage

Scope / Memory

Two customers share a similar incident. Evidence from the wrong account must not enter context.

Deleted memory stays deleted

Updates / Memory

A useful preference remains semantically close after deletion. A correct system still excludes it.

Distractor quote is rejected

Pollution / Documents

A near duplicate has the wrong date, source, or authority level. Similarity alone should not win.

Public

Protocol, adapter contract, setup checks, scoring axes, and limited diagnostics.

Held back

Official labels, complete task material, and reviewer-run evaluation traces.

Published

Headline frontier score, diagnostics, run time, and public interpretation.

Scoring

The official-private frontier headline score is an all-gates incident-success rate. The component weights below are the official-private scoring axes recorded in the manifest. Diagnostic metrics are reported separately so readers can understand the failure mode without confusing partial signal for certification.

Context quality

35%

Required support, answerable evidence, precision, recall, ranking, action readiness, and content support.

Freshness

20%

Correct generation under updates, stale cache windows, strict freshness, and delete/update ordering.

Scope safety

18%

Tenant, namespace, entitlement, user, session, thread, canary, and memory boundaries are isolated.

Pollution resistance

15%

Tool-result pollution, prompt injection, stale contradictions, and budget-filling distractors are excluded.

Proof completeness

10%

Returned context carries auditable support evidence rather than proof-shaped metadata.

Efficiency

Latency and context-token behavior are measured without letting speed mask correctness failures.

This makes the leaderboard harsher than a retrieval benchmark. A system can retrieve plausible evidence and still fail if that evidence is stale, unsupported, out of scope, or missing proof.

Results

No system is certified in the current official evaluation. That is the important result. The diagnostic rows show different kinds of progress, but none of the systems clears the full context-correctness bar.

Leaderboard

Headline frontier score reports all-gates incident success. Retrieval and semantic signals are diagnostics.

Updated June 12, 20264 systems / 36,864 scored checks / 12,288 retrievals per run

Retrieval signal

Relevant evidence surfaced before certification checks.

0-100

KyroDB79.1

Graphiti/Zep71.6

Qdrant53.3

Mem00.06

0/4

Certified

KyroDB

Top retrieval

Graphiti/Zep

Fastest p95

Retrieval signal shows whether the system can surface plausible evidence. It is useful for diagnosis, but it does not certify the context because similar text can still be stale, polluted, or unverifiable.

KyroDB: Strong retrieval coverage and complete proof metadata, but not certified because the held-out semantic and freshness gates did not clear.

Graphiti/Zep: High retrieval signal and fast responses, but not certified without proof metadata and complete context-support behavior.

Qdrant: Useful as a raw vector-store baseline; it does not expose native freshness, authority, or proof semantics for certification.

Mem0: Strict requests were often blocked as stale, leaving very little completed retrieval signal in the official run.

System	Retrieval	Semantic	Proof	Freshness	Pollution	p95 latency
KyroDB	79.1	0.0	100.0	0.0	100.0	520 ms
Graphiti/Zep	71.6	0.0	0.0	69.4	100.0	168 ms
Qdrant	53.3	0.0	0.0	0.0	100.0	231 ms
Mem0	0.06	0.04	0.0	0.0	2.5	401 ms

Diagnostics explain system behavior. They are not part of the headline frontier score.

KyroDB shows the strongest retrieval signal and complete proof metadata, but still fails the official semantic and freshness gates. Graphiti/Zep retrieves strongly and responds quickly, but does not expose the proof surface required for certification. Qdrant is useful as a raw vector-store baseline, while Mem0 blocks many strict requests as stale and leaves little completed retrieval signal.

Failure analysis

The observed failures are the kinds that matter in production. They are not formatting misses. They are cases where an agent could act on the wrong account, an old policy, a deleted memory, or an unsupported claim because the context layer treated similarity as trust.

Stale accepted

Old facts beat newer valid evidence.

Scope leak

Out-of-scope content enters the answer context.

Deleted memory

Invalidated facts reappear after deletion or revocation.

Low authority

Weak sources are treated as sufficient support.

Missing proof

The answer lacks complete or valid provenance.

Packing failure

Key evidence is dropped under the context budget.

The most common pattern is not that systems retrieve nothing. It is that they retrieve something plausible and then fail one of the harder obligations: the source is old, the account boundary is wrong, the proof is not auditable, or the returned set omits the one fragment that makes the action safe.

Limitations and next steps

KyroBench evaluates retrieval-grounded memory and context systems. It does not score free-form answer style, general web search, or long-context reasoning without a retrieval layer.

The public surface is intentionally limited. It gives teams enough to build honest adapters and understand results, while keeping the official judgment held out so the benchmark does not become a tuning set.

The Data page gives a deeper breakdown of the suite shape, fixture boundaries, tracks, domains, and common failure modes without publishing the hidden answer keys.

Reference

The public repository contains the adapter contract, setup checks, examples, and documentation for implementing against KyroBench.

GitHub Data