KyroBench data

The official data measures whether retrieved agent context remains correct after knowledge changes.

Updated June 12, 2026. This page explains suite shape and public boundaries; it does not publish the held-out answer keys.

Current official run

4
Systems

KyroDB, Graphiti/Zep, Mem0, and Qdrant.

36,864
Scored checks

Atomic judgments across context quality, freshness, scope, pollution, proof, and budget fit.

12,288
Retrievals per run

Strict, scoped retrieval calls issued after ingestion and mutation events.

66,560
Protocol steps

Ingest, update, delete, revoke, query, and score operations in each official run.

1,200 tokens
Context budget

Returned context is evaluated under a top-10 retrieval budget.

Fixture boundary

KyroBench exposes enough data for implementation and inspection without turning the official suite into a tuning set. Public development fixtures are useful for adapter work; they are not leaderboard substitutes.

Runner

System-visible execution inputs used by adapters during a run.

Evaluator public

Public labels and fixtures for local development checks. These are not the official leaderboard set.

Evaluator

Reviewer-held labels, hidden split, answer keys, and fixture locks used for official scoring.

Published

Aggregate scores, diagnostic metrics, setup contract, examples, and interpretation.

Tracks

The benchmark is organized around six context-correctness tracks. A system must satisfy all applicable obligations for certification; partial retrieval signal remains diagnostic.

Core Context Correctness

Basic answerability, relevance, and current evidence under ordinary retrieval pressure.

Runtime Context Correctness

Mutation, delete, cache-window, generation, and update-ordering behavior.

End-to-End Context Quality

Claim support and action-readiness chains that require multiple pieces of valid evidence.

Agent Memory Quality

Session memory, user memory, long-running memory, and memory bleed across similar histories.

Pollution Resistance

Prompt injection, tool-result pollution, long-context noise, near duplicates, and contradictions.

Enterprise Boundary Correctness

Tenant, namespace, entitlement, canary, user, session, and thread boundaries.

Domains

The domains are ordinary enterprise situations: the kind of context an agent sees before answering a customer, changing a workflow, triaging an incident, or editing code.

Support and billing

Tests whether a system keeps account-specific exceptions, live case state, and current policy separate from stale macros.

ticketsinvoicesrefund policyentitlement overrides

Code agent

Measures whether local implementation evidence survives tool noise, stale notes, and similar-but-wrong repository facts.

repo issuesbranch contractsCI observationsscratchpad pollution

Legal contracts

Tests whether controlling language, amendments, and authority levels beat older or lower-authority text.

agreementsamendmentseffective datesrestricted addenda

SRE incidents

Focuses on fast-changing operational context where old mitigation steps can be actively harmful.

alertsdeploymentsrunbook revisionsbreak-glass ACLs

CRM account memory

Tests user, account, and entitlement boundaries under realistic sales and customer-memory handoffs.

account handoffsopportunity statestakeholder memorytenant traps

Synthetic healthcare records

Measures whether sensitive context remains scoped to the right subject and reflects the latest accepted state.

consentcare-plan updatescorrected notesprivacy leaks

Support escalation trace

Catches agents that answer from a nearby account, stale callback, or globally valid text that is wrong for the case.

case policyaccount overridelive tool statecallback memory

Public policy change trace

Scores whether the current effective rule beats old public guidance that still looks semantically relevant.

current rulesamendmentseffective windowsstale guidance

Scenario variants

Cases are generated in variants so a system cannot pass by optimizing for one easy pattern. Some variants concentrate on freshness, some on boundaries, some on pollution, and some on sparse business lookup.

Full adversarial
Freshness-heavy
Scope-heavy
Session-memory-heavy
Tool-pollution-heavy
Authority-conflict-heavy
Deleted-context-heavy
Sparse business lookup

Scoring surfaces

Official score is a certification gate. Retrieval and semantic context are diagnostics. A system can retrieve plausible evidence and still fail if the evidence is stale, out of scope, polluted, unsupported, or not auditable.

Context quality / 30%

Returned context contains the evidence needed to answer, not loosely related text.

Freshness / 20%

Updated, superseded, revoked, and deleted knowledge is handled correctly.

Scope safety / 20%

Tenant, namespace, user, session, and authority boundaries are respected.

Pollution resistance / 15%

Distractors, stale facts, and misleading near matches are kept out of context.

Proof completeness / 10%

Returned context can be checked against current source state.

Evidence fit / 5%

The system stays inside the context budget while preserving required evidence.

Failure modes

These are the misses KyroBench is designed to expose. They matter because they change whether an agent can safely act.

Stale accepted

Old facts beat newer valid evidence.

Scope leak

Out-of-scope content enters the answer context.

Deleted memory

Invalidated facts reappear after deletion or revocation.

Low authority

Weak sources are treated as sufficient support.

Missing proof

The answer lacks complete or valid provenance.

Packing failure

Key evidence is dropped under the context budget.