KyroBench

Measuring whether context systems stay correct when knowledge changes and defend against pollution and staleness.

Leaderboard

Official score is a certification gate. Retrieval and semantic signals are diagnostics.

Updated June 12, 20264 systems / 36,864 scored checks / 12,288 retrievals per run

Retrieval signal

Relevant evidence surfaced before certification checks.

0-100
KyroDB79.1
Graphiti/Zep71.6
Qdrant53.3
Mem00.06
0/4
Certified
KyroDB
Top retrieval
Graphiti/Zep
Fastest p95

Retrieval signal shows whether the system can surface plausible evidence. It is useful for diagnosis, but it does not certify the context because similar text can still be stale, polluted, or unverifiable.

KyroDB: Strong retrieval coverage and complete proof metadata, but not certified because the held-out semantic and freshness gates did not clear.

Graphiti/Zep: High retrieval signal and fast responses, but not certified without proof metadata and complete context-support behavior.

Qdrant: Useful as a raw vector-store baseline; it does not expose native freshness, authority, or proof semantics for certification.

Mem0: Strict requests were often blocked as stale, leaving very little completed retrieval signal in the official run.

SystemOfficialRetrievalSemanticProofFreshnessPollutionp95 latency
KyroDB
0.079.10.0100.00.0100.0520 ms
Graphiti/Zep
0.071.60.00.069.4100.0168 ms
Qdrant
0.053.30.00.00.0100.0231 ms
Mem0
0.00.060.040.00.02.5401 ms

What KyroBench measures

KyroBench tests the context layer an agent receives before it acts: whether evidence is current, in scope, clean, supported, and small enough to use.

Freshness

Newer valid evidence must beat old text that still matches the query.

Scope

Similar evidence from another scope is treated as incorrect context.

Pollution

Close but invalid content must not survive ranking or packing.

Proof

Returned context needs enough metadata for independent verification.

Multi-hop

The system must combine related evidence without losing constraints.