Agent-Ready Infrastructure

Four patterns for running AI agents in production. Agents are moving out of chat windows and into IT operations: triaging incidents, provisioning environments, resizing capacity. The infrastructure underneath still assumes a human reads the runbook, files the ticket, and clicks approve. This article describes the patterns that close that gap, and is explicit about where the approach fails.

Services

Platform engineering, AI agent integration, Infrastructure automation

Industry

Enterprise IT, IT operations

The scaling gap

The short version: agent-ready infrastructure exposes repeatable operations as governed APIs, maintains trustworthy operational data, bounds agent autonomy with policy, and tracks every deployed agent through a registry. The rest of this article explains why each part exists and where the approach fails.

Almost every IT organization is experimenting with AI agents. Very few are running them at scale. McKinsey, citing its 2025 State of AI survey, reports that 62 percent of organizations are piloting agents, yet in any given business function no more than 10 percent report scaling them. At the same time, infrastructure costs are projected to grow two to three times by 2030 while budgets stay flat (McKinsey, "Reimagining tech infrastructure for (and with) agentic AI").

The gap between piloting and scaling is rarely a model problem. The models are good enough for a large class of operational work. The constraint is the infrastructure underneath: ticket queues designed for human throughput, configuration data nobody fully trusts, dependencies that live in the heads of two senior engineers, and permission models that assume a person is clicking the button.

An agent dropped into that environment does not fix it. It inherits it. A pilot can succeed in a sandboxed corner of the estate and still be impossible to scale, because scaling requires the estate itself to be legible to machines.

The first mistake is to treat this as an LLM rollout. It is not. It is an infrastructure readiness problem. The model can suggest an action, but the platform decides whether that action exists, whether it is safe, whether the data behind it is trustworthy, and whether the outcome can be audited.

We have been here before

This is not the first time infrastructure had to be redesigned for a new kind of client. Cloud adoption forced teams to stop treating servers as named pets. Infrastructure as code forced environment definitions out of wiki pages and into version control. GitOps made the repository the source of truth and turned the running system into something derived from it.

Each shift followed the same logic: take knowledge that lived in people and procedures, and encode it in a form that software can execute and verify. Martin Fowler's infrastructure-as-code definition captures it: manage environments through executable, tested, version-controlled definitions rather than manual steps.

Agentic AI is the next client in this sequence, and it is the most demanding one. A CI pipeline executes a fixed script. An agent makes decisions: it chooses which action to take based on what it can observe. That raises the bar on how actions are exposed, how trustworthy the observable data is, and how authority is bounded. The four patterns below address those three problems, plus the operational question of managing the agents themselves.

Four patterns

1. Runbooks as Governed APIs

Every repeatable infrastructure operation is exposed as an API with policy checks embedded in the call path. A warning in the runbook enforces nothing.

If restarting a service, rotating a credential, or resizing a node pool requires a human to follow a runbook, an agent cannot do it safely. The operation must exist as code: parameterized, idempotent where possible, and wrapped in checks that validate preconditions before execution. Use this pattern first for the operations your team performs most often through tickets. The work is mostly conventional platform engineering, which is why teams that invested in internal developer platforms are ahead here without having planned for agents at all.

2. Operational Ground Truth

One queryable, machine-readable source of truth for assets, dependencies, ownership, and recent changes, accurate enough that an agent acting on it does not cause harm.

Agents reason over what they can read: CMDB records, service catalogs, dependency graphs, change histories. Where that data is wrong, the agent is confidently wrong. The discipline that prevents this is the one Martin Kleppmann lays out in his writing on logs and derived data: keep a clear source of truth, derive the views that systems consume from it, and keep the record of changes replayable. Applied to operations, it means an agent queries a maintained view of the estate instead of scraping dashboards and tribal knowledge. The practical starting point is narrow. Pick the domain you want to automate and make its records trustworthy: consistent naming, explicit ownership, a defined source of truth for each fact. You do not need perfect data across the whole estate, and waiting for it is a common way to stall. You do need consistency inside the domain you automate.

3. Governed Autonomy

Each agent operates inside an explicit permission envelope: which actions it may take autonomously, which require approval, and who is accountable for it.

This is the pattern that separates production-grade agent deployments from demos. Every agent has an identity, a named owner, and a declared scope. Every action is logged and attributable. Low-risk actions execute autonomously; high-impact ones, such as production rollbacks or customer-facing changes, wait for human approval. The envelope is declared as policy, versioned like any other code:

agent: capacity-rightsizer
owner: platform-team
identity: svc-agent-rightsizer        # auditable, distinct from human users
permissions:
  - action: resize-node-pool
    autonomy: autonomous
    bounds: { max_delta: "20%", environments: [staging, prod] }
    window: outside-business-hours
  - action: delete-volume
    autonomy: requires-approval
    approvers: [sre-oncall]
audit: all actions logged with input context and decision trace
escalation: page sre-oncall on repeated failure or bound violation

The important part is not the syntax. The important part is that autonomy becomes reviewable.

The example above is close to what we actually run: our rightsizing agent acts only outside business hours, within declared bounds, and pages on-call when it hits one. The time window is part of the envelope, because off-hours is when impact is lowest and also when nobody is watching, so the bounds do the supervising. At the other end of the risk scale, our solution architects run log triage the same way: an agent correlates errors and opens tickets autonomously, because the worst a wrong ticket can do is clutter a queue.

4. The Agent Registry

A formal inventory of every deployed agent covering purpose, scope, owner, performance, and cost, with a lifecycle so redundant agents get retired.

Agents proliferate the way cron jobs and Excel macros did, and ungoverned proliferation is how organizations end up with a second shadow estate. The registry makes consumption visible too: inference cost is granular and nonlinear, and an agent that polls an LLM in a tight retry loop can produce the kind of bill that used to require a misconfigured autoscaler. Use this pattern from the second agent onward. One agent does not need a registry; ten unregistered ones are already a problem.

What the patterns add up to

Together, the patterns create a simple separation: agents decide, APIs execute, policies constrain, and operational data explains the world. That separation matters. Without it, an agent becomes just another automation script with a larger blast radius.

Diagram showing AI agents reading context from operational ground truth and executing actions through governed APIs, with governed autonomy gating execution and an agent registry managing owner, scope, cost, and lifecycle

None of it requires replacing the platforms you already run. ServiceNow, your observability stack, and your cloud tooling become the endpoints these patterns govern. The closest established pattern is the anti-corruption layer from Eric Evans' Domain-Driven Design: a mediation layer that lets a new model work with legacy systems without absorbing their quirks, while those systems keep running unchanged. The orchestration layer plays that role for agents. They reason in terms of governed actions and trustworthy state, and the layer translates that into whatever each platform underneath actually speaks. The layer is permanent, and it can be built one workflow at a time.

Pattern What it does Start with
Runbooks as Governed APIs Repeatable operations become APIs with policy checks in the call path The operations your team performs most often through tickets
Operational Ground Truth One trustworthy, queryable record of assets, dependencies, ownership, and changes The domain you plan to automate first
Governed Autonomy Each agent acts inside a declared permission envelope with audit and approvals Low-risk actions, bounded by time window and blast radius
The Agent Registry Inventory of every agent: owner, scope, performance, cost, lifecycle The second agent you deploy

What we would build first

We would not start with a general-purpose operations agent.

We would start with one boring, high-volume workflow: access provisioning, password reset, service restart, capacity resize, or known-error triage.

Then we would make three things production-grade: the action API, the data source behind the decision, and the rollback path. Only after that would we let an agent choose the action.

This is slower than a demo. It is also the difference between automation that impresses in a meeting and automation that survives production.

A concrete workflow: SEV1 triage

Consider how the patterns combine when a severity-1 alarm fires.

An incident-manager agent opens a structured triage workflow. Domain agents for network, infrastructure, application, and change history investigate in parallel, each querying its own tools and correlating logs, configuration records, and recent changes. None of this works unless Operational Ground Truth holds. Each domain agent forms and tests hypotheses inside its declared scope. An orchestrator agent synthesizes the findings into a probable root cause and a remediation plan with ordered steps and validation checks.

Diagram of a SEV1 triage workflow: an alarm triggers an incident manager agent, four domain agents investigate in parallel within declared scopes, an orchestrator agent synthesizes a remediation plan, low-risk actions execute automatically while high-risk actions wait for human approval, and every step lands in an audit log with MTTR reporting

Then Governed Autonomy decides what happens next. Restarting a stateless service inside declared bounds: executed autonomously through a governed API. A production rollback or customer communication: queued for human approval with full context attached. Every step lands in the audit log, and the incident manager produces the timeline and MTTR reporting as a side effect rather than as after-the-fact archaeology.

The engineer's role changes shape: less time gathering context across five dashboards, more time judging the one decision that actually needs judgment. McKinsey's deployment experience puts the automation potential at 60 to 80 percent of routine infrastructure work over time, with 20 to 40 percent run-rate cost reduction in initial deployments. In a recent enterprise AI automation project, we saw the same pattern at smaller scale: the hard part was not the model. It was making the knowledge base and historical records reliable enough to retrieve from, which is Operational Ground Truth under another name.

Trade-offs and failure modes

This approach has real costs and clear boundaries. Ignoring them is how pilots become incidents.

Bad data makes agents worse than humans

A human operator distrusts a stale CMDB entry from experience. An agent does not, unless you have explicitly engineered that distrust. The dangerous version is not the agent that fails. It is the agent that succeeds against the wrong data: executing wrong decisions faster, and with more confidence, than a person would.

Non-determinism collides with change management

Agents are probabilistic. The same incident can produce different remediation plans on different days. Deterministic validation layers are what make probabilistic planning compatible with production change control: preconditions, postcondition checks, and a bounded blast radius. They are the load-bearing wall of the whole approach, and they cost about as much engineering effort as the agents themselves.

Inference costs are a new failure domain

Token spend scales with incident volume and agent chattiness rather than with provisioned capacity, so it evades classic capacity planning. Without per-agent cost visibility (the registry again), savings on labor can quietly leak back out as API spend.

When not to do this

A small operations team with modest ticket volume will not recover the investment; the patterns pay off where volume is high and workflows are repeatable, which is why service desks, typically 20 to 30 percent of infrastructure labor spend, are usually the right first target. Likewise, if your internal APIs are unstable or your processes are genuinely broken, agents will automate the mess. Process redesign comes first, because an agent will faithfully execute whatever process it is given, broken or fixed.

The operating model is the slow part

The technical patterns can be built in months. Shifting engineers from resolving tickets to supervising autonomous execution, renegotiating vendor contracts so productivity gains become financial outcomes rather than absorbed slack, and evolving SRE toward designing and constraining agents will take quarters. Plan for it explicitly or the technology outruns the organization.

Where this leads

Infrastructure has spent decades as a support function: it existed so that applications, and the people operating them, could do the real work. Running agents in production changes its position in the system. The orchestration layer of governed actions, trustworthy state, and bounded autonomy becomes the platform through which operational work is executed, and the humans move up a level to supervise, design, and improve it.

Bolting agents onto unprepared infrastructure produces a few automated workflows and stops there; the survey data showing pilots stuck below 10 percent scaling is what that looks like in aggregate. Building the four foundations is slower to show results and compounds for years, the same way infrastructure as code did. The organizations that did IaC properly a decade ago are the ones finding this transition cheap. The pattern repeats: the best time to make your estate legible to machines was then. The second-best time is before your first agent reaches production.

The agent is the visible part. The useful work is underneath: APIs, ownership, permissions, logs, tests, and rollback paths.

FAQ: Agent-ready infrastructure

What is agent-ready infrastructure?

Infrastructure that AI agents can operate safely: repeatable actions exposed as policy-checked APIs, reliable machine-readable operational data, explicit permission and audit models for agents, and lifecycle management for the agents themselves.

Do we need to replace our ITSM or observability platforms?

No. Existing platforms stay and become endpoints the orchestration layer mediates, in the spirit of DDD's anti-corruption layer. The work is integration, and it can happen incrementally.

Where should a team start?

Pick one high-volume, repeatable domain, typically the service desk or incident management. Redesign the workflow so routine steps are automatable within bounds, and make that domain's operational data trustworthy before deploying agents into it.

What savings are realistic?

Published deployment experience suggests 20 to 40 percent run-rate cost reduction in initial deployments and 60 to 80 percent automation of routine work over time, offset by new inference costs and the engineering investment in validation and governance layers.

Further reading

McKinsey: Reimagining tech infrastructure for (and with) agentic AI

Martin Fowler: Infrastructure as Code

Martin Fowler: Emerging Patterns in Building GenAI Products

Martin Kleppmann: Designing Data-Intensive Applications

Planning to put agents into your operations? Talk to engineers who have done it.

Get in touch