SOAR without tears: code-first playbooks that survive an audit

Drag-and-drop playbooks look great in the demo and rot inside a year. Here is the discipline that produces SOAR content that an auditor will sign off and an analyst will still trust at 2 a.m.

Every SOC we have inherited inherits a graveyard of SOAR playbooks. There is a Phishing Triage v3.2, a Phishing Triage Final, a Phishing Triage Final FINAL, and a Phishing Triage (DO NOT USE — broken). Half are disabled. A quarter were modified by analysts who have since left. None of them have a clean test trail. When the auditor asks how the bank quarantines an inbox, the answer is a slack thread.

The problem is rarely the SOAR product. It is the operating model. Drag-and-drop is a great UI for first-time authoring and a terrible UI for change management. Playbooks are software; they need to be versioned, reviewed, tested, gated and rolled back like software. This post is the pattern set we use to keep playbook estates healthy through DPDP, ISO 27001 and SOC 2 audits — and, more importantly, through analyst turnover.

Why drag-and-drop estates rot

Three forces conspire. First, visual editors hide diff. Two playbooks that look identical on screen can differ in a hidden retry count or a stringly-typed parameter, and you only find out at 3 a.m. when one of them silently drops an action. Second, visual editors discourage abstraction. Common logic — "fetch the asset record", "lookup the user's manager", "post to the IM channel for the data owner" — gets copy-pasted into every playbook, and now you have twelve copies to maintain. Third, visual editors do not have a natural place for unit tests, so the test discipline never starts.

The audit reality. An auditor asking "show me the change history for the playbook that revoked the partner's session on 14 February" is not asking for a screenshot. They are asking for a commit, an approver, a test record and a rollback plan. If your SOAR estate cannot produce those four artefacts on demand, you have a finding waiting to happen.

The four artefacts an auditor wants

Reduce the audit conversation to four questions and the right operating model becomes obvious:

Provenance. Who wrote this playbook, when, and against which version of the runtime?
Review. Who else reviewed it, against which acceptance criteria?
Verification. What tests confirm it behaves as documented, and when did they last run?
Reversibility. What is the rollback plan, and is the playbook idempotent so a partial failure does not double-execute?

None of those are easy to answer from a visual editor. All of them are routine in a code-first workflow.

The patterns that work

Across the SOCs we have helped untangle, six patterns produce playbook estates that survive audits and analyst turnover. Adopt them in roughly this order; each builds on the previous.

1. Playbooks-as-code, in a real repository

Move every playbook into a version-controlled repository. The visual editor becomes a renderer, not an author. Every change goes through a pull request with at least one reviewer who did not write the change. The repository is the source of truth — if it is not in the repo, it cannot run in production. This single move closes more than half of the typical audit findings on its own.

# Repository layout we have seen work
playbooks/
  phishing/
    triage.yaml          # the playbook
    triage.test.yaml     # input fixtures and expected outcomes
    README.md            # rationale, blast radius, owner
  identity/
    revoke-session.yaml
    revoke-session.test.yaml
    README.md
lib/
  asset-lookup.yaml      # shared reusable steps
  notify-data-owner.yaml
.changelog/
  2026-Q1.md             # human-readable summary of merged PRs

2. Blast-radius gating, expressed as policy

Every destructive action — quarantine, revoke, isolate, block — should carry a declared blast radius and pass a policy check before executing. The policy lives in the repo, the playbook references it, and the runtime refuses to execute if the action's scope exceeds what the policy allows. The point is not to prevent action; it is to make scope-creep visible. An analyst who needs to quarantine 500 mailboxes when the policy allows 50 has to make a deliberate, logged decision to override.

3. Approval ladders, scoped to impact

Tier playbook actions by their potential impact and define an approval ladder. A read-only enrichment runs unattended. A reversible action — say, dropping a session — requires a single L1 approval. A potentially-customer-impacting action — say, blocking a payment domain — requires a senior approval. A potentially-regulator-reportable action — say, isolating a production database — requires named on-call leadership. The ladder is data, not code, and it is the first thing an auditor will ask to see.

4. Idempotency by default

Treat every external action as something the runtime might retry. Quarantining the same mailbox twice should be a no-op. Creating a ticket should look up by an idempotency key before insert. Posting to a channel should de-duplicate by incident-id. Idempotency turns partial failures from incidents into ignorable noise, and removes the temptation to suppress retries during outages.

5. Contract tests at the integration boundary

Every connector — to the email gateway, the identity provider, the ticketing system — should have a contract test that runs nightly against a sandbox. The contract test asserts that the request and response shape the playbook depends on has not drifted. When the vendor silently renames a field in a quarterly release, the test catches it before the playbook does. We have seen one large SOC reduce its weekend incidents by 60% on the back of contract tests alone.

6. Replay and dry-run as first-class

Every playbook should be runnable in three modes: live, dry-run (no side effects, full logging), and replay (against a recorded incident). Analysts use dry-run to debug; engineers use replay to validate changes against historical incidents before merging. Both modes produce the same audit log lines as live, marked with the mode flag, so the auditor sees exactly which executions actually moved bits.

What the playbook itself looks like

The on-disk format is less important than the discipline, but here is a stripped-down example of the shape we use. It carries declared blast radius, an approval ladder, and explicit idempotency keys. The runtime refuses to execute if any of these are missing.

id: revoke-suspicious-session
version: 1.4.0
owner: identity-soc@example.com
blast_radius:
  scope: single-identity
  max_affected_principals: 1
  reversibility: reversible-within-15m
approval_ladder:
  - tier: read
    role: analyst-l1
  - tier: act
    role: analyst-l2
    required_for: [revoke_session]
inputs:
  incident_id: { type: string, required: true }
  principal:   { type: identity, required: true }
steps:
  - id: enrich
    use: lib/identity-context
    with: { principal: $.principal }
  - id: revoke_session
    when: $.enrich.risk_score > 80
    action: idp.revoke_active_sessions
    idempotency_key: "revoke:$.incident_id:$.principal.id"
    rollback: idp.restore_session_window
  - id: notify
    action: comms.post_channel
    with:
      channel: "#soc-actions"
      template: revoke-session.md
      idempotency_key: "notify:$.incident_id"
tests:
  - file: revoke-session.test.yaml

Notice what is not in the file: vendor-specific API calls, secrets, environment URLs. Those live in the runtime, behind named connectors. The playbook stays portable, which is the next nice thing about code-first: you can move it between environments without rewriting it.

Mapping the patterns to audit clauses

The patterns above are not just good engineering; they map cleanly onto the clauses auditors actually check. The table below is what we hand to compliance teams when they ask how SOAR feeds into the certification scope.

Pattern	DPDP touchpoint	ISO 27001 / SOC 2 touchpoint
Playbooks-as-code	Demonstrates documented processes for data-fiduciary actions	A.5.37 documented operating procedures; CC8.1 change management
Blast-radius gating	Limits the scope of automated decisions affecting principals	A.5.15 access control; CC6.3 logical access
Approval ladders	Records the human-in-the-loop for high-impact actions	A.5.18 access rights; CC6.2 access management
Idempotency	Avoids unintended secondary effects on principal data	A.8.32 change management; CC8.1
Contract tests	Ongoing verification of integrations handling personal data	A.8.29 secure testing; CC7.1 monitoring
Replay / dry-run	Demonstrable rehearsal of breach-response procedures	A.5.24 incident planning; CC7.4 incident response

"We stopped counting playbooks and started counting commits. The day my team started shipping a weekly changelog instead of a list of new playbooks, our audit cycle dropped from eight weeks to three." — SOC engineering manager, regulated healthcare network.

The migration path

You will not move an estate of seventy visual playbooks to code-first in a sprint. The path we have seen succeed:

Week 1-2: Inventory. Tag every existing playbook with owner, last-modified, last-fired, and a colour status. Most estates have a long tail of dead playbooks that should simply be retired before migration.
Week 3-6: Build the repo skeleton, the linter, the test harness, and the CI pipeline. Migrate three playbooks end-to-end as proofs.
Week 7-12: Migrate the top-fired 20% of playbooks. These cover 80% of incident volume; finishing them gives the biggest visible win.
Quarter 2: Migrate the long tail or retire it. Add contract tests and replay coverage.
Quarter 3: The visual editor becomes a renderer for the on-disk content. New playbooks always start in the repo.

Key takeaways

The SOAR problem is not the product, it is the operating model — drag-and-drop hides diff, discourages abstraction, and has nowhere to put tests.
Auditors want four artefacts: provenance, review, verification, reversibility. Code-first produces all four as a side effect.
Bake blast-radius declarations and approval ladders into the playbook file itself so the runtime can refuse to execute mis-scoped actions.
Treat every external action as retryable. Idempotency keys turn most outages into noise.
Contract tests at integration boundaries catch vendor drift before it becomes a 3 a.m. incident.
Migrate in priority order — top-fired 20% first — and retire the dead long tail before you migrate it.

For more on the underlying engineering discipline, see our whitepaper on closed-loop detection engineering. The same versioning, testing, and rollback discipline applies to detection content; treating playbooks and detections as the same engineering surface compounds the win.