· engineering · 5 min read

What happens when you let an AI agent loose on your production infrastructure

A developer lost 2.5 years of data in seconds after Claude Code ran a Terraform destroy on live infrastructure. Here is what actually happened, and why the same mistake is easier to make than you think.

What happens when you let an AI agent loose on your production infrastructure

Last week, Alexey Grigorev, the founder of DataTalks .Club, a course platform used by over 100,000 students, lost 2.5 years of production data in about thirty seconds. Homework submissions, leaderboard entries, project records, everything. The automated backups were gone too.

The tool that did it was Claude Code. And honestly, reading through his postmortem, it is hard to blame the AI.

What actually happened

Grigorev was migrating a side project, AI Shipping Labs, from GitHub Pages to AWS. To save a few dollars a month, he decided to share the existing DataTalks.Club infrastructure rather than spin up a separate environment. Claude Code itself flagged this as a bad idea. He overrode it.

The problem started when he switched computers without moving the Terraform state file, which is the document that tells Terraform what infrastructure currently exists. Without it, Claude created duplicate resources. When Grigorev later uploaded the correct state file, the agent treated it as the source of truth and ran terraform destroy to clean up the duplicates.

Because the state file described both sites, everything went down: the VPC, the RDS database, the ECS cluster, the load balancers, and the database snapshots he was counting on as a fallback. AWS Business Support eventually found a hidden snapshot and restored the data, but the recovery took about 24 hours and cost him a permanent 10% increase in his AWS support tier.

He wrote about it publicly, accepted full responsibility, and documented six specific changes he made afterward, including deletion protection at both the Terraform and AWS levels, S3 state storage with versioning, automated restore testing via Lambda, and a hard rule that Claude Code cannot run destructive commands without manual review.

The part worth sitting with

What gets me about this incident is that Claude actually gave the right advice before things went wrong. It warned against combining the two setups. Grigorev acknowledged this in his own postmortem. The agent was not rogue, it was obedient. It did exactly what the infrastructure description told it to do.

AI coding agents are very good at following instructions. They are not good at knowing when following instructions will destroy six figures worth of accumulated data. They have no concept of blast radius.

A separate GitHub issue filed the same week by a different developer tells a similar story. A Claude Code agent running in a background terminal session executed drizzle-kit push --force against a production PostgreSQL database on Railway, wiped 60+ tables, and the data was unrecoverable. Railway did not have automatic backups. Months of trading data, AI research results, user records, all gone.

Two incidents in one week is not a coincidence. It is a pattern.

Why this keeps happening

The vibe coding workflow has made it genuinely easy to hand infrastructure tasks to an AI agent and walk away. That is the appeal. You describe what you want in plain language, the agent figures out the commands, and things get done faster than doing them yourself.

The standard safeguards most senior engineers apply automatically - deletion protection, remote state management, separate dev and prod accounts, backup verification - do not exist by default in an AI-assisted workflow. The AI makes it feel like they are someone else’s problem. They are not.

At WebArt Design, AI agents are treated the same way we would treat a junior developer on day one: they are not given direct access to production. Production work remains with experienced engineers, with review and approval for anything that could be destructive.

What Grigorev changed, and what we would add

His six post-incident safeguards are solid and worth copying directly:

  • Deletion protection enabled at the Terraform and AWS level on all production resources
  • Terraform state moved to S3 with versioning, not stored locally
  • Automated restore testing on a schedule, not just automated backups
  • S3 backup buckets requiring explicit content removal before deletion
  • Separate dev and prod AWS accounts
  • Claude Code’s automatic command execution disabled, all destructive actions require manual approval

Review any destructive command before it runs, regardless of whether you think you know what it will do. terraform destroy on a partial state file and terraform destroy on a full state file look identical in a chat window.

Also, test your backups. Not once when you set them up. On a schedule. Grigorev’s backups were running, the events showed in the AWS console, but when he clicked through to the actual snapshots, they were inaccessible. He found that out at 11 PM when everything was already gone.

Where this leaves AI tooling

None of this means Claude Code or agentic tools are a bad idea. They are genuinely fast for infrastructure work. But speed is not useful if you do not understand what the commands are doing.

Both incidents share the same root cause: the developer trusted the agent to understand context that only a human with full situational awareness could have. That is not a fair expectation to have of any tool.

Nobody has fully figured out the right guardrails for agentic coding in production yet. The tooling is moving faster than the conventions around it. Grigorev’s postmortem is one of the more honest accounts I have read on the subject. His summary does not blame the tool and does not pretend the workflow was fine except for one unlucky mistake. He just wrote: he was “overly reliant on the AI agent to run Terraform commands.”

It’s one sentence and hard to argue with.

Back to Blog

Related Posts

View All Posts »