KyroBench
Measuring whether context systems stay correct when knowledge changes and defend against pollution and staleness.
Leaderboard
Official score is a certification gate. Retrieval and semantic signals are diagnostics.
Retrieval signal
Relevant evidence surfaced before certification checks.
Retrieval signal shows whether the system can surface plausible evidence. It is useful for diagnosis, but it does not certify the context because similar text can still be stale, polluted, or unverifiable.
KyroDB: Strong retrieval coverage and complete proof metadata, but not certified because the held-out semantic and freshness gates did not clear.
Graphiti/Zep: High retrieval signal and fast responses, but not certified without proof metadata and complete context-support behavior.
Qdrant: Useful as a raw vector-store baseline; it does not expose native freshness, authority, or proof semantics for certification.
Mem0: Strict requests were often blocked as stale, leaving very little completed retrieval signal in the official run.
| System | Official | Retrieval | Semantic | Proof | Freshness | Pollution | p95 latency |
|---|---|---|---|---|---|---|---|
KyroDB | 0.0 | 79.1 | 0.0 | 100.0 | 0.0 | 100.0 | 520 ms |
Graphiti/Zep | 0.0 | 71.6 | 0.0 | 0.0 | 69.4 | 100.0 | 168 ms |
Qdrant | 0.0 | 53.3 | 0.0 | 0.0 | 0.0 | 100.0 | 231 ms |
Mem0 | 0.0 | 0.06 | 0.04 | 0.0 | 0.0 | 2.5 | 401 ms |
What KyroBench measures
KyroBench tests the context layer an agent receives before it acts: whether evidence is current, in scope, clean, supported, and small enough to use.
Freshness
Newer valid evidence must beat old text that still matches the query.
Scope
Similar evidence from another scope is treated as incorrect context.
Pollution
Close but invalid content must not survive ranking or packing.
Proof
Returned context needs enough metadata for independent verification.
Multi-hop
The system must combine related evidence without losing constraints.