Ask any detection engineer what they do when they push a new rule live. The honest answer is "I watch the alert queue for a few hours and hope". They will not phrase it that way. They will say "we monitor for false positive rate", or "we have a feedback loop", or "we tune iteratively". These are all kind paraphrases of the same thing: the rule is being tested in production, on live data, with live consequences.
The defensible alternative is retrospective replay. Before the rule ships, you run it against the last 90 days of hot data and a sampled cross-section of the last 7 years of cold data. You know, before live traffic ever touches the new logic, what the rule fires on, how often, and whether it would have caught the last incident. The rule is no longer a hypothesis; it is a measured artefact.
This is not a novel idea. Detection engineers have been arguing for it for a decade. The reason it remains rare in practice — and we mean genuinely rare; informal surveys put real adoption in the 10-15% range — is not that teams don't want to do it. It is that their platforms can't.
Why most platforms structurally can't replay
Replay sounds simple. Stream historical events through the rule engine. Count the hits. But the architectural requirements are unforgiving, and the standard log-lake-plus-SIEM stack fails at least one of them in almost every deployment.
You need an addressable cold tier
If your historical data lives in compressed archives, glacier-class storage, or worst of all an offline tape rotation, replay is a 6-week IT project. The cold tier has to be queryable in place — random-access scans, predicate pushdown, schema-on-read — without rehydration. Most legacy SIEMs treat anything older than 90 days as "archive" and effectively cold-store it as opaque blobs. Replay is impossible against opaque blobs.
You need the rule engine to be portable
If your rule engine is a black box embedded inside the production SIEM, you cannot point it at a replay corpus without affecting production. Some platforms solve this by spinning up a parallel environment; the cost is operational complexity and usually price. Some platforms do not solve it at all. The closer the rule engine is to a library that can be invoked from a script against an arbitrary event stream, the easier replay becomes.
You need the historical data to carry the same schema as today
This is the silent killer. A rule written in February 2026 against the current event schema may not match events from August 2023 because the field names changed, the values were reformatted, or the parser improved. Replay against historical data with schema drift produces nonsense. The fix is forward-compatible schemas — additive evolution only, never breaking — and a versioned parser library that can re-interpret old data through the current schema lens. This is a multi-year discipline that most platforms have not invested in.
You need to replay traversals, not just events
For graph-pattern detections, replay isn't just streaming events past a filter. It is rehydrating the graph state at a point in time and traversing it. That requires a temporal graph — every edge carries an interval of validity — which is a substantially more demanding storage model than "events with timestamps". A platform that only retains events, not relationship state, can replay simple signature detections but cannot replay multi-hop pattern detections. The interesting detections are mostly the second kind.
What replay catches that production tuning doesn't
The argument is not "replay is academically better". The argument is that replay catches three classes of problem that production tuning structurally cannot. We'll walk through each, with examples we've watched land.
Seasonal noise
A rule that looks clean for ten days in production can light up at month-end batch processing, quarter-close, the financial year-end, or the annual security audit. Production tuning has, at most, a 30-day window before the rule is considered "stable" and forgotten. Replay against 90 days of hot data catches the monthly seasonal patterns; replay against multi-year cold data catches the annual ones.
A concrete case from a BFSI deployment: a rule designed to catch privileged-account abuse looked perfect in November-December. On 31 March — the financial year-end — it fired on every reconciliation job and generated 4,800 alerts in 90 minutes. The same rule, replayed against the prior year's March-end data before shipping, would have shown 5,200 historical firings. The team would have known to add the exclusion.
"We've seen this before"
This is the headline value of retrospective detection. A new rule fires on yesterday's traffic. It also fires, on replay against last August, on six instances that nobody noticed at the time. Those six instances are previously-undetected real incidents. The rule isn't theoretical until it sees a live attacker; it has already caught attackers who got through the old detection logic and were never investigated.
This is uncomfortable. It means historical breaches were unreported. It means the regulator may have grounds for retrospective enquiry. It also means the SOC's institutional understanding of "what we got hit with last year" gets revised, and the next budget cycle becomes a different conversation. Several teams have, frankly, avoided retrospective detection precisely because they did not want to find what they would find. We do not have a clever rebuttal to this; we think it is the wrong choice, but we understand why people make it.
Rule conflicts and overlap
New rules rarely live alone. They are added to a corpus of 800-2,000 existing rules. Some of those existing rules already fire on a superset of what the new rule catches. Some fire on a near-overlap that produces duplicate alerts. Replay reveals overlap before production does. Without replay, the SOC discovers overlap through analyst complaint, usually three weeks after the rule went live.
| Class of problem | Detectable via production tuning? | Detectable via replay? |
|---|---|---|
| Daily false-positive baseline | Yes, in ~7 days | Yes, in minutes |
| Weekly seasonality | Yes, in ~30 days | Yes, in minutes |
| Monthly/quarterly batch events | Maybe, in ~90 days | Yes, in minutes |
| Annual events (year-end, audit) | No | Yes, in minutes |
| Historical undetected incidents | No | Yes, conditional on data |
| Overlap with existing corpus | Eventually, via analyst complaint | Yes, immediately |
| Schema drift in upstream parsers | No | Yes, surfaces as parse errors |
What a workable replay budget looks like
Replay is not free, and pretending otherwise is the second-fastest way to lose credibility (the fastest being "graphs are magic"). Realistic replay budgets, from our deployments:
- 14-day hot replay: 30 to 90 seconds wall-clock for a moderately complex graph-pattern detection. Run on every PR, blocking gate.
- 90-day warm replay: 5 to 12 minutes wall-clock. Run on every PR for new rules; skipped for trivial exclusion changes.
- 1-year sampled cold replay: 20 to 60 minutes wall-clock against a 10% sample of cold data. Run nightly across the whole rule corpus; surfaces drift and overlap.
- Full 7-year cold replay: 2 to 6 hours wall-clock against full cold data. Run on demand — typically when a new threat-intel hit lands and the question is "were we touched at any point in the last seven years?"
The on-demand 7-year replay is the one that has the highest practical value during an incident or a regulatory enquiry. When the question is "did this novel TTP touch our environment in the last 24 months?", the team that can answer in three hours has a categorically different conversation with the regulator than the team that needs to scope a four-week project.
# typical replay invocation, from the platform CLI
ng replay \
--rule detections/identity/anomalous-token-issuance.ngd \
--since 2019-02-01 --until 2026-02-19 \
--sample 0.10 \
--emit json,markdown \
--report-to ./replay-reports/2026-02-19-anomalous-token-issuance.md
The output is the prosaic part of retrospective detection: a markdown report with hit counts, distribution over time, distinct entities involved, and a sample evidence pack for each cluster of hits. The detection engineer reads it. The lead reviews it. The merge is conditional on the report. Production traffic is the last test, not the first.
The cultural shift, again
The first time a team sees retrospective detection working, the reaction is usually a long pause followed by "wait, how did we not do this before?" The answer is that the platform never let you. Once the platform does, the practice becomes table-stakes inside a quarter. The teams that have made the shift will tell you it's the single biggest change in their detection lifecycle in the last five years. The teams that haven't tend to over-rotate on real-time tuning because it is what they have.
One related practice that surfaces naturally from replay: tagging rules by "evidence vintage". A rule that has been replayed against 5+ years of data with stable behaviour is more trustworthy than a rule that has only seen the last 30 days. Treat vintage as a metadata field. Surface it to L1 analysts when they triage. An L1 escalating a rule with low vintage knows to ask harder questions; an L1 escalating a rule with high vintage knows the rule has been stress-tested.
Where to start tomorrow
If your platform supports replay, the engineering work is largely already done; the cultural work is to make it a gate, not an option. Add a CI requirement: no rule merges without a replay report. Period. The detection engineers will grumble for a week and then never go back.
If your platform doesn't support replay, you have two options. Option one: replay against a sample. Even 5% of hot data, replayed manually before each rule push, is better than nothing. Option two: instrument the gap. Track how many rules ship without replay coverage, and how often those rules subsequently produce noise spikes or miss known incidents. The data will make the case for a platform change more compellingly than any blog post.
We are biased, obviously. Netgraph was built with a temporal graph and a unified hot-cold query layer specifically so replay is a single command. We also know that not everyone reading this is in a position to migrate. The argument stands regardless of the vendor: retrospective replay is not a luxury, and the platforms that make it impossible are leaving real detection capability on the table. If you cannot adopt it, at least know what you are giving up.
Key takeaways
- Retrospective detection means replaying a new or changed rule against historical data before it goes live, not after.
- Most platforms structurally cannot do this: opaque cold storage, embedded rule engines, schema drift, and event-only retention all block replay.
- Replay catches three classes of problem that production tuning misses entirely: annual seasonality, previously-undetected historical incidents, and rule overlap.
- Workable replay budgets: 14-day hot under 90s, 90-day warm under 12 minutes, sampled 1-year cold under an hour, full 7-year cold on demand.
- Make it a CI gate, not an option. Tag rules by "evidence vintage" and surface that to L1 triage.
That closes the first run of blog posts. The next series will move from architecture and discipline into the harder operational realities — agentic adversaries, multi-cloud identity blast radius, and the new economics of full-fidelity retention. If you want a preview of where that goes, the FAQ has the short version of most of it.