KyroBench data

The official data measures whether retrieved agent context remains correct after knowledge changes.

Updated June 12, 2026. This page explains suite shape and public boundaries; it does not publish the held-out answer keys.

Current official run

Systems

KyroDB, Graphiti/Zep, Mem0, and Qdrant.

36,864

Scored checks

Atomic judgments across context quality, freshness, scope, pollution, proof, and efficiency.

12,288

Retrievals per run

Strict, scoped retrieval calls issued after ingestion and mutation events.

66,560

Protocol steps

Ingest, update, delete, revoke, query, and score operations in each official run.

1,800 tokens

Context budget

Returned context is evaluated under a top-16 retrieval budget.

Fixture boundary

KyroBench exposes enough data for implementation and inspection without turning the official suite into a tuning set. Public development fixtures are useful for adapter work; they are not leaderboard substitutes.

Runner

System-visible execution inputs used by adapters during a run.

Evaluator public

Public labels and fixtures for local development checks. These are not the official leaderboard set.

Evaluator

Reviewer-held labels, hidden split, answer keys, and fixture locks used for official scoring.

Published

Aggregate scores, diagnostic metrics, setup contract, examples, and interpretation.

Tracks

The benchmark is organized around six context-correctness tracks. A system must satisfy all applicable obligations for certification; partial retrieval signal remains diagnostic.

Core Context Correctness

Basic answerability, relevance, and current evidence under ordinary retrieval pressure.

Runtime Context Correctness

Mutation, delete, cache-window, generation, and update-ordering behavior.

End-to-End Context Quality

Claim support and action-readiness chains that require multiple pieces of valid evidence.

Agent Memory Quality

Session memory, user memory, long-running memory, and memory bleed across similar histories.

Pollution Resistance

Prompt injection, tool-result pollution, long-context noise, near duplicates, and contradictions.

Enterprise Boundary Correctness

Tenant, namespace, entitlement, canary, user, session, and thread boundaries.

Domains

The domains are ordinary enterprise situations: the kind of context an agent sees before answering a customer, changing a workflow, triaging an incident, or editing code.

Support and billing

Tests whether a system keeps account-specific exceptions, live case state, and current policy separate from stale macros.

ticketsinvoicesrefund policyentitlement overrides

Code agent

Measures whether local implementation evidence survives tool noise, stale notes, and similar-but-wrong repository facts.

repo issuesbranch contractsCI observationsscratchpad pollution

Legal contracts

Tests whether controlling language, amendments, and authority levels beat older or lower-authority text.

agreementsamendmentseffective datesrestricted addenda

SRE incidents

Focuses on fast-changing operational context where old mitigation steps can be actively harmful.

alertsdeploymentsrunbook revisionsbreak-glass ACLs

CRM account memory

Tests user, account, and entitlement boundaries under realistic sales and customer-memory handoffs.

account handoffsopportunity statestakeholder memorytenant traps

Synthetic healthcare records

Measures whether sensitive context remains scoped to the right subject and reflects the latest accepted state.

consentcare-plan updatescorrected notesprivacy leaks

Support escalation trace

Catches agents that answer from a nearby account, stale callback, or globally valid text that is wrong for the case.

case policyaccount overridelive tool statecallback memory

Public policy change trace

Scores whether the current effective rule beats old public guidance that still looks semantically relevant.

current rulesamendmentseffective windowsstale guidance

Scenario variants

Cases are generated in variants so a system cannot pass by optimizing for one easy pattern. Some variants concentrate on freshness, some on boundaries, some on pollution, and some on sparse business lookup.

Full adversarial

Freshness-heavy

Scope-heavy

Session-memory-heavy

Tool-pollution-heavy

Authority-conflict-heavy

Deleted-context-heavy

Sparse business lookup

Scoring surfaces

The official-private frontier score reports all-gates incident success. The rows below show the manifest-recorded component weights; retrieval and semantic context remain diagnostics. A system can retrieve plausible evidence and still fail if the evidence is stale, out of scope, polluted, unsupported, or not auditable.

Context quality / 35%

Required support, answerable evidence, precision, recall, ranking, action readiness, and content support.

Freshness / 20%

Correct generation under updates, stale cache windows, strict freshness, and delete/update ordering.

Scope safety / 18%

Tenant, namespace, entitlement, user, session, thread, canary, and memory boundaries are isolated.

Pollution resistance / 15%

Tool-result pollution, prompt injection, stale contradictions, and budget-filling distractors are excluded.

Proof completeness / 10%

Returned context carries auditable support evidence rather than proof-shaped metadata.

Efficiency / 2%

Latency and context-token behavior are measured without letting speed mask correctness failures.

Failure modes

These are the misses KyroBench is designed to expose. They matter because they change whether an agent can safely act.

Stale accepted

Old facts beat newer valid evidence.

Scope leak

Out-of-scope content enters the answer context.

Deleted memory

Invalidated facts reappear after deletion or revocation.

Low authority

Weak sources are treated as sufficient support.

Missing proof

The answer lacks complete or valid provenance.

Packing failure

Key evidence is dropped under the context budget.