1. Background: why correlation keeps breaking
A working analyst spends most of their day chasing the same question in slightly different forms. Did this token, used from this host, on behalf of this principal, touch anything that matters? The question has four named entities, three relationships, and an implicit time window. It is, definitionally, a graph query. Yet the systems we hand the analyst — log lakes, behaviour analytics, posture scanners, ticketing — store the answer as four disconnected projections of the same underlying reality.
The cost shows up in three places. First, in mean time to context: the analyst becomes the join engine, pivoting across consoles by hand. Second, in detection brittleness: rules written against one projection cannot reference attributes that live in another, so they overfit to surface artefacts. Third, in compounding silence: exposures that span domains — an over-privileged identity attached to a workload that mounts a secret used by a public-facing service — are invisible to any single tool and visible to the graph only if someone, somewhere, has walked it end to end.
The pattern is not new. What is new is that the entity count has crossed the threshold where ad-hoc reconstruction has stopped working. A mid-size enterprise today routinely has more than a hundred thousand identities, half a million workload instances per quarter, several million policy attachments, and a code surface that turns over weekly. The number of meaningful relationships among those entities is in the high tens of millions. No analyst, however senior, can hold that in working memory; no tabular schema, however wide, can express it without combinatorial pain.
2. The architectural case
2.1 Relational joins versus typed traversal
The fundamental cost asymmetry is well known to anyone who has profiled a many-way join. A relational query plan for "find principals two hops from a public service with write capability on a secret store" expands combinatorially: each intermediate table is materialised, filtered, then hashed against the next. The optimiser does its best, but the underlying primitive is set-of-tuples, and the algebra has no first-class notion of an edge.
A typed traversal does not pay this cost. Once a node is in hand, its neighbours are an O(1) pointer dereference; a depth-k expansion is O(k · average-degree), not O(n^k). For the sparse, high-cardinality graphs that describe security state — most nodes have a handful of immediate neighbours, a few have thousands — the practical speedup is two to three orders of magnitude on operational queries. More importantly, the query reads like the question: pattern in, pattern out, with edge types as first-class predicates.
// Conceptual: "principals reaching a public service within 2 hops, with write to a secret"
MATCH (p:Principal)-[:CAN_ASSUME*1..2]->(:Role)-[:ATTACHED_TO]->(w:Workload)
WHERE w.exposure = 'public'
AND EXISTS { (p)-[:HAS_PERMISSION {verb:'write'}]->(:SecretStore) }
RETURN p, w, path()
The equivalent in a wide-table store is a screen of CTEs, three self-joins, and a window function — and it still cannot express "within 2 hops" without unrolling.
2.2 Schema-on-write versus schema-on-read
Detection content is, in the end, a contract between the writer of a rule and the shape of the world it expects. Schema-on-read shifts that contract to query time: every rule re-asserts its own view of the data, in the syntax of the engine it happens to run on. The result is the well-documented detection-content tax: a rule written against one parser breaks when the parser changes, and the same logical detection has to exist in three dialects to cover three data sources.
A typed property graph forces schema-on-write at the boundary of ingest. The projector resolves every event into nodes and edges of declared types, with declared properties, before anything else sees it. Downstream consumers — rules, hunts, posture queries, agent reasoning — bind to the graph schema, not to the source format. When a new source arrives, the projector grows; nothing else has to.
2.3 OLAP analytics versus OLTP graph patterns
Security workloads are sometimes analytical (compute a rolling baseline over a billion events) and sometimes transactional (resolve this token to its owning principal in 8 milliseconds). The industry default is to pick a columnar store optimised for the first and accept that the second runs poorly. A graph-native substrate inverts the default: it optimises for point lookups and short traversals, and pushes analytical scans to a sidecar projection that reads from the same canonical entities.
| Workload | Tabular columnar store | Graph-native substrate |
|---|---|---|
| Aggregate over a billion events (last 90 days) | Excellent | Delegated to columnar sidecar |
| "Who can reach what, with which privilege" (point-in-time) | Multi-minute, brittle | Sub-second, native |
| "Did this principal touch any of these 12 assets in the last 30 minutes" | Acceptable if pre-indexed | Single traversal |
| "All paths from a public surface to crown-jewel data" | Effectively impossible | Routine |
| "Same query, but bounded by code-change deltas in the last 7 days" | Requires a separate system | Same query, additional edge predicate |
3. The Netgraph instantiation
Netgraph treats the property graph as the system of record for security state, with everything else as a derived projection. The pipeline has four stages, in this order, with no shortcuts and no out-of-band channels between them.
3.1 Projector
The projector is responsible for one job: turn every inbound event, posture finding, code change, and identity assertion into a typed addition to the graph, idempotently. It owns the canonical entity catalogue (Principal, Role, Workload, SecretStore, CodeArtifact, NetworkEdge, DataAsset, and roughly sixty more), the canonical edge taxonomy (CAN_ASSUME, ATTACHED_TO, OBSERVED_FROM, MODIFIES, READS, EXPOSES, GRANTS, and so on), and the resolution rules that collapse aliases of the same entity across sources.
Two properties matter here. First, the projector is deterministic: the same input event produces the same nodes and edges every time, so reprocessing is safe and replay is exact. Second, it is versioned at the edge level: every edge carries a valid-from and valid-to timestamp, so the graph is queryable at any point in the past without separate snapshot storage.
3.2 Graph store
The store implements a typed property graph with secondary indices on hot properties, in-memory caches for high-degree nodes, and an append-only edge log. It is operated as a clustered service with synchronous replication for the hot tier and asynchronous tiering to durable storage for cold edges. Path queries are evaluated by a planner that prefers bidirectional BFS for short paths and constrained DFS with pruning for longer ones; the same planner is used by detection content and by the response engine.
// Edge insert (conceptual, projector-side)
upsert_node(type="Principal", natural_key=tenant_id+"/"+principal_id, props={...})
upsert_node(type="Workload", natural_key=tenant_id+"/"+workload_arn, props={...})
upsert_edge(
from=principal_node,
to=workload_node,
type="OBSERVED_FROM",
valid_from=event_ts,
props={method:"sts.assume_role", ip:event_ip, user_agent:event_ua}
)
3.3 Graph-grounded retrieval
Retrieval is the surface the rest of the platform — and any language-model reasoning — uses to ask the graph questions. It exposes three primitives: resolve (identifier to canonical node), neighbourhood (k-hop expansion with edge-type filters and property predicates), and path (constrained shortest-path or all-paths between two nodes or sets). Every retrieval call is logged with the exact query, the time bound, and the result-set hash, so that any downstream finding can be re-derived later by replaying the same call.
3.4 Agent layer
The agent layer is the only component permitted to mutate state outside the graph itself, and it does so only by emitting actions to the response engine. Crucially, the agent layer does not talk to raw logs or vendor APIs; it talks to the retrieval primitives above. This is what makes its reasoning auditable: every step it took is a graph query with a stored result, and every action it proposed is grounded in that result.
4. Graph-native versus graph-flavoured
Several incumbent platforms describe themselves as graph-based. In practice they fall into one of three patterns, and all three diverge from a graph-native substrate in ways that matter operationally.
4.1 The closed schema
The first pattern ships a fixed, vendor-owned schema, usually narrowly focused on identity and cloud posture. The graph is real, but it is read-only from the customer's perspective: new entity types cannot be added without a release, and edges between the vendor's domain and the customer's own (custom applications, internal data classifications, business-criticality tiers) are simply not expressible. The graph helps for the questions the vendor anticipated, and disappears for the rest.
4.2 The dashboard graph
The second pattern is a node-link visualisation rendered on top of a tabular store. The picture is a graph; the data underneath is rows. Path queries are simulated by joining tables at view time, which produces correct-looking answers for one or two hops and then collapses — either in performance, or in correctness, once edges have to be derived rather than stored. The give-away is the query language: if the platform's detection or hunting syntax does not have edges as first-class citizens, the graph is decoration.
4.3 The opaque relationship
The third pattern stores relationships but does not expose them. The platform may use a graph internally for risk scoring, but customers cannot write a query that uses an edge directly; they can only consume the scores. This is the worst of the three for an analyst: the system can show that two things are related but cannot be asked why, and the answer cannot be appealed.
| Property | Graph-native (Netgraph) | Closed schema | Dashboard graph | Opaque relationship |
|---|---|---|---|---|
| Custom entity / edge types | First-class | Vendor-owned | Limited to view layer | Hidden |
| Edges as query predicates | Yes | Partial | No (joins under the hood) | No |
| Same language for detection, hunt, response | Yes | No | No | No |
| Point-in-time queries | Native (edge-versioned) | Snapshot-based | Best-effort | Not exposed |
| Agent grounding | Graph retrieval | API surface | Tabular API | Score only |
| Replayable investigation | Yes | Partial | No | No |
5. What this unlocks in practice
5.1 Detection that reads like the threat model
Because the graph carries identity, workload, code, and network as one substrate, a detection can express a real adversary objective rather than a surface artefact. "Service principal acquired a session via an unattended workflow, then accessed a data asset it has never accessed before, from a workload whose code changed in the last 24 hours" is a single pattern across four edge types. In a tabular world it is four detections in three systems, glued together by hope.
5.2 Hunting as graph editing
An analyst who finds a new technique can encode it directly as a graph pattern, save it, and have it run against history without translation. The pattern is the rule; the rule is the hunt. When the hunt fires, the resulting incident is, again, a sub-graph — not a row in an alert table.
5.3 Exposure as reachability
Posture scanning, the workload most often siloed away in its own product category, becomes a path query: "from any public surface, is there a sequence of edges, with the privileges required, that ends at a crown-jewel asset?" Findings stop being lists of misconfigurations and start being end-to-end paths an adversary would actually traverse.
5.4 Response as graph mutation
Containment actions are, by construction, edits to the graph: revoke this edge, isolate this node, suspend this identity. The same query that scoped the incident scopes the containment, so blast radius is calculable in advance and reversible after the fact.
6. Cost, scale, and the question of overhead
The most common objection to graph-native storage is cost: do you not pay a steep penalty in write amplification, in memory, in operator skill? The honest answer is that you trade one cost for another. Write amplification is real — projection produces more state per event than naive append — but the read-side savings on the queries that matter dominate within weeks. Memory pressure is managed by keeping cold edges in tiered storage and hot subgraphs in cache, which is well-understood operational practice. Operator skill is a one-time learning cost; pattern-matching syntax is, in our experience, easier for SOC analysts than multi-CTE SQL because it maps directly onto how they already think about incidents.
// Cost ratio sketch (illustrative; varies by tenancy and source mix)
ingest_overhead_factor ≈ 1.6× // vs. flat-append
storage_overhead_factor ≈ 1.2× // edges versioned; cold tier amortises
median_query_speedup ≈ 30× // typical 2–3 hop operational queries
detection_authoring_speedup ≈ 4× // one language, one schema
investigation_time_reduction ≈ 60–75% // measured across rollouts
7. Anti-patterns we have learnt to refuse
- Bolting a graph viewer onto a log lake. If the edges are computed at view time, they are not edges; they are joins with a nicer icon.
- Treating the graph as an enrichment cache. A read-only graph that mirrors the SIEM is a duplication, not a substrate. The graph has to own writes for the surrounding workflow to converge.
- Letting the agent reason over raw logs. Once the model can pull from outside the graph, the auditability story collapses. The discipline is: every model call gets a retrieval handle, never a query string against an arbitrary store.
- Closed entity vocabularies. Customers will, and should, add their own entity types — business-criticality tiers, regulator-defined data classes, internal blast-radius zones. A platform that refuses these turns into a posture scanner with extra steps.
8. What we are not claiming
We are not claiming that graphs replace columnar storage. Aggregations over billions of events still belong in a columnar sidecar, and Netgraph operates one. We are not claiming that every query is faster on a graph; bulk scans without locality are slower, and we route them accordingly. And we are not claiming that the graph removes the need for skilled analysts; it removes the need for them to be join engines, which is not the same thing.
What we are claiming is narrower and, we think, decisive: the operational questions a SOC actually asks are graph-shaped, and a platform whose substrate is not a graph will answer them by approximation. Approximation is the source of the silence that defines modern breaches.
Key takeaways
- The SOC's working questions — reachability, privilege, blast radius, lateral movement — are graph-shaped by definition; tabular stores approximate them at quadratic cost.
- A typed property graph forces schema-on-write at ingest, which collapses the detection-content tax: one language across detection, hunting, posture, and response.
- Graph-native is distinct from graph-flavoured: edges must be first-class predicates in the query language, custom entity types must be additive, and relationships must be inspectable.
- Netgraph's pipeline — projector, graph store, graph-grounded retrieval, agent layer — keeps every step auditable because every model call is a logged retrieval against the same substrate.
- Investigations become reproducible, containment becomes reversible, and posture becomes reachability. None of these are possible when the graph is a viewer on top of rows.
About this paper
Authored by Autocops Desk — the Netgraph practitioner team. First published 12 February 2026. Confidential to recipient.
Continue reading
Why graph-native SOC
The operational companion to this paper — what the working day actually looks like when the graph is the substrate, not the slide.
Top 20 vulnerabilities of H1 2026, traced as graph paths
Each entry rendered as the path an attacker walks, not as a CVE list.
Frequently asked questions
How graph-native correlation maps to procurement, deployment, and audit.