KyroBench data
The official data measures whether retrieved agent context remains correct after knowledge changes.
Updated June 12, 2026. This page explains suite shape and public boundaries; it does not publish the held-out answer keys.
Current official run
KyroDB, Graphiti/Zep, Mem0, and Qdrant.
Atomic judgments across context quality, freshness, scope, pollution, proof, and budget fit.
Strict, scoped retrieval calls issued after ingestion and mutation events.
Ingest, update, delete, revoke, query, and score operations in each official run.
Returned context is evaluated under a top-10 retrieval budget.
Fixture boundary
KyroBench exposes enough data for implementation and inspection without turning the official suite into a tuning set. Public development fixtures are useful for adapter work; they are not leaderboard substitutes.
Runner
System-visible execution inputs used by adapters during a run.
Evaluator public
Public labels and fixtures for local development checks. These are not the official leaderboard set.
Evaluator
Reviewer-held labels, hidden split, answer keys, and fixture locks used for official scoring.
Published
Aggregate scores, diagnostic metrics, setup contract, examples, and interpretation.
Tracks
The benchmark is organized around six context-correctness tracks. A system must satisfy all applicable obligations for certification; partial retrieval signal remains diagnostic.
Core Context Correctness
Basic answerability, relevance, and current evidence under ordinary retrieval pressure.
Runtime Context Correctness
Mutation, delete, cache-window, generation, and update-ordering behavior.
End-to-End Context Quality
Claim support and action-readiness chains that require multiple pieces of valid evidence.
Agent Memory Quality
Session memory, user memory, long-running memory, and memory bleed across similar histories.
Pollution Resistance
Prompt injection, tool-result pollution, long-context noise, near duplicates, and contradictions.
Enterprise Boundary Correctness
Tenant, namespace, entitlement, canary, user, session, and thread boundaries.
Domains
The domains are ordinary enterprise situations: the kind of context an agent sees before answering a customer, changing a workflow, triaging an incident, or editing code.
Support and billing
Tests whether a system keeps account-specific exceptions, live case state, and current policy separate from stale macros.
Code agent
Measures whether local implementation evidence survives tool noise, stale notes, and similar-but-wrong repository facts.
Legal contracts
Tests whether controlling language, amendments, and authority levels beat older or lower-authority text.
SRE incidents
Focuses on fast-changing operational context where old mitigation steps can be actively harmful.
CRM account memory
Tests user, account, and entitlement boundaries under realistic sales and customer-memory handoffs.
Synthetic healthcare records
Measures whether sensitive context remains scoped to the right subject and reflects the latest accepted state.
Support escalation trace
Catches agents that answer from a nearby account, stale callback, or globally valid text that is wrong for the case.
Public policy change trace
Scores whether the current effective rule beats old public guidance that still looks semantically relevant.
Scenario variants
Cases are generated in variants so a system cannot pass by optimizing for one easy pattern. Some variants concentrate on freshness, some on boundaries, some on pollution, and some on sparse business lookup.
Scoring surfaces
Official score is a certification gate. Retrieval and semantic context are diagnostics. A system can retrieve plausible evidence and still fail if the evidence is stale, out of scope, polluted, unsupported, or not auditable.
Context quality / 30%
Returned context contains the evidence needed to answer, not loosely related text.
Freshness / 20%
Updated, superseded, revoked, and deleted knowledge is handled correctly.
Scope safety / 20%
Tenant, namespace, user, session, and authority boundaries are respected.
Pollution resistance / 15%
Distractors, stale facts, and misleading near matches are kept out of context.
Proof completeness / 10%
Returned context can be checked against current source state.
Evidence fit / 5%
The system stays inside the context budget while preserving required evidence.
Failure modes
These are the misses KyroBench is designed to expose. They matter because they change whether an agent can safely act.
Stale accepted
Old facts beat newer valid evidence.
Scope leak
Out-of-scope content enters the answer context.
Deleted memory
Invalidated facts reappear after deletion or revocation.
Low authority
Weak sources are treated as sufficient support.
Missing proof
The answer lacks complete or valid provenance.
Packing failure
Key evidence is dropped under the context budget.