· ai and automation · 14 min read
Expert in the loop: rethinking AI oversight
Human in the loop made sense when AI errors were obvious. The failures we see now look like working software, and only an expert can tell the difference.

Something has shifted in the last twelve months, and the language we use to describe AI oversight hasn’t caught up.
For years, “human in the loop” was the safety phrase that put everyone at ease. The AI does the work, a person checks it before anything ships, and the worst outcomes get caught by someone with a brain and a conscience. Fine. That made sense when AI was bad enough that errors were obvious, like a hallucinated citation or a piece of code that wouldn’t compile. The human’s job was to catch what the model couldn’t possibly get right.
That world is gone. The AI now produces output that looks finished. It compiles, it runs, it passes the demo. The bug isn’t visible until someone with the right kind of expertise actually goes looking, and most of the time, nobody does.
When I hear someone say their workflow has a human in the loop, the question I want to ask is which human, and what are they qualified to catch?
This is more than a semantic complaint. The gap between AI as an accelerator and AI as a liability generator often comes down to who, exactly, is in that loop, and at WebArt Design we are seeing both outcomes, sometimes inside the same project.
Two worlds, one tool
The same AI tooling that builds a polished prototype in a weekend can also ship a security incident into production. What changes between those two outcomes is rarely the model. It’s the context the model is operating inside.
In a startup or solo-builder context, vibe coding works fine. You move fast, the AI maintains its own mess, and if the codebase becomes incoherent, you can regenerate large parts of it without anyone losing sleep. Technical debt accrues against future-you, and future-you might not even need to pay it off. There’s a real argument that traditional code-quality concerns (readability, naming conventions, modularity) only matter because humans maintain code. If only the AI ever reads it, who exactly are you optimising for?
I’m sympathetic to that argument right up until the moment the project has real users and real data. Then everything changes.
In an enterprise or regulated context, you have pre-existing data structures, compliance obligations, integrations with systems no one fully understands, and an audit trail that someone is going to subpoena one day. The AI’s output is the easy part. The guard rails around the AI are the entire job. Quality in this world is bounded not by what the model can produce, but by how strictly the organisation can enforce its rules on the output.
The dangerous middle is the growing band of mid-market companies who are trying to apply startup workflows to enterprise problems. They saw the demo, they liked the speed, and they are about to find out that “it works” and “it’s safe to run on customer data” are different sentences.
The economics that nobody priced in
For two decades, the SaaS narrative was that software costs trend toward zero. Once you’ve written it, the marginal cost of one more user is basically nothing. That assumption shaped how every modern company budgets technology.
AI breaks that assumption.
Every meaningful AI interaction has an inference cost. Add a few third-party services on top (a vector database, a voice provider, an embedding model) and every user action you ship now carries a compounding bill. Token prices are falling, but enterprise AI bills are rising sharply. The FinOps Foundation’s 2026 State of FinOps Report identifies AI and data platforms as the fastest-growing new category of enterprise spend. Average enterprise AI budgets have climbed from around $1.2 million in 2024 to $7 million in 2026, with some Fortune 500 companies reporting monthly inference bills in the tens of millions.
Per-token pricing has actually been falling. Bills go up because successful AI features get used. Agentic workflows trigger ten to twenty model calls per user task. RAG architectures inflate context windows. Always-on monitoring agents burn compute around the clock. The cost curves are reshaping software economics in a way that looks more like CapEx than SaaS.
For our clients, this means the infrastructure decisions they punted on during the prototype phase have a habit of arriving as a tax later. A vibe-coded stack will pick the most convenient backend, not the most appropriate one. The kid building a weekend prototype on Supabase isn’t thinking about per-row read costs at 10 million records. The expert who eventually has to scale that system is.
This is one of the quieter arguments for expert oversight: an expert priced in the cost of scale before the prototype was built. A non-expert is hoping the bill stays small.
What “expert in the loop” actually means
The phrase “human in the loop” originated when AI couldn’t reliably produce final output and a person was needed to catch obvious errors. The role assumed by that human was supervisory but not specialised. Anyone reasonably attentive could do the job.
That’s no longer true. AI now produces output that looks correct, and a non-expert reviewer cannot reliably spot what’s missing. Pressing approve on something you don’t understand is governance theatre dressed up as oversight.
Expert in the loop is a different proposition. It says the person evaluating AI output needs the domain knowledge to spot what isn’t there, not just what is.
In our work, that judgement lives across a few areas the model cannot self-supply. There’s product and UX judgement, which covers what to build, what to leave out, and what users actually need versus what they say they need. AI is happy to build whatever you ask for, and it will not push back when you ask for the wrong thing.
There’s architecture and data judgement, which covers cost, scale, security, and maintainability trade-offs that compound silently. The choice of database, the data model, the auth strategy, the way state is managed across services. These decisions can quietly poison a system for years if they’re made wrong, and an LLM has no skin in that long-term game.
And then there’s the translation between business reality and technical feasibility. Compliance constraints, vendor relationships, internal politics, the fact that the marketing team owns one critical system and won’t let anyone touch it. None of this is in the model’s training data for your specific company. An expert maps it. The AI cannot.
Once you frame oversight this way, the expert’s day-to-day work changes. Setting constraints rather than typing code by hand. Codifying organisational standards into instructions the AI will follow. Reviewing the shape of the decisions the AI is making, beyond just the line-by-line output. Knowing when to override, and having the authority to do so.
What the data says about getting this wrong
I want to be specific here, because vague warnings about AI risk are a dime a dozen. The numbers are bad, and they aren’t getting better.
Veracode’s 2025 GenAI Code Security Report tested over 100 large language models across 80 real-world coding tasks. Forty-five percent of AI-generated code samples contained vulnerabilities aligned with the OWASP Top 10. When the models were given an explicit choice between a secure and an insecure implementation, they chose the insecure one 45 percent of the time. Newer and larger models did not perform meaningfully better. Veracode’s CTO Jens Wessling described it as a systemic issue, not a model-scaling problem.
The Lovable security incident from early 2026 made the abstract concrete. A vulnerability in Lovable’s API allowed any free-tier account holder to access another user’s source code and database credentials in five API calls. The flaw was reported in March, patched only for new projects, and left open for existing ones for 48 days. One of the affected projects belonged to a Danish nonprofit and exposed real user records linked to staff at major employers. A separate scan by security firm Wiz found that around one in ten Lovable apps audited were leaking user data through the same class of flaw. The Common Vulnerabilities and Exposures database now includes CVE-2025-48757 specifically for this category of misconfiguration.
The Moltbook breach from the same period exposed 1.5 million API authentication tokens and 30,000 email addresses. The platform’s founder publicly stated he hadn’t written a single line of code. The vulnerability wasn’t sophisticated; it was a missing Row Level Security policy on a Supabase database, with the public API key embedded in the client-side bundle. Properly configured, that key is safe. Without RLS, it grants full unauthenticated read and write access to the entire database.
A wider scan by Escape.tech of 5,600 vibe-coded applications found over 2,000 critical vulnerabilities, 400 exposed secrets, and 175 instances of personally identifiable information leaking through public endpoints. Georgia Tech’s Vibe Security Radar tracked 35 CVEs directly attributed to AI-generated code in March 2026 alone, up from 6 in January. The researchers estimate the actual rate is five to ten times their detected count.
The story here is what happens when AI generates code that works on the happy path, and no expert verifies the unhappy path before it ships.
The specific things experts catch
Here’s the practical version of all that. These are the failure modes we’ve found in client work, and each one is solvable, but only by someone who knows to look.
Secrets in the wrong place. API keys hardcoded in source, ending up in client bundles, ending up in commits. The AI doesn’t always know which keys are safe to expose and which aren’t. A developer who has worked with Supabase or Firebase knows that the public key being “designed to be public” is not a complete defence if your access policies have gaps.
Dependency hygiene. When you run npm install on a freshly generated project, the install log will often surface known vulnerabilities and deprecated packages. A developer notices. The AI generally doesn’t flag it, and neither does the non-technical user staring at a green checkmark.
Third-party data flow. Sensitive user data routed through vendor APIs without anyone tracing where it lands or what’s retained. The legal and reputational exposure is real, and it’s not the AI’s problem to manage.
Architectural drift and bloat. Importing entire UI libraries to use two components. Mixing tools that the documentation explicitly warns against combining. Duplicating logic across files because the AI lost track of where it had already implemented something. Each individual instance is small. The cumulative cost shows up as bandwidth bills and maintenance pain.
The testing void. AI is generally good at not breaking the thing it’s currently editing. It is not equipped to know that the feature it shipped on Tuesday silently broke the feature shipped six weeks ago. Without tests, every release is a coin flip. The bigger the codebase grows, the worse the odds get. A senior engineer instinctively builds tests as a forcing function for correctness. A vibe-coded project usually has none, and the absence of them only becomes a problem when something subtle breaks in production.
Inverted access control. This one’s the subtlest. The AI implements authentication that looks right and is logically inverted. Authenticated users get blocked and unauthenticated visitors get full access, because the model produced something that pattern-matched to “access control” without understanding which side of the gate the locked door was on. Researchers have documented this exact bug across more than 170 production applications.
The pattern across all of these: AI prioritises making the feature work. Security and operational soundness are non-functional requirements, and the model treats them as secondary. If you don’t have someone on your team who treats them as primary, they don’t get treated as primary.
The signal-distortion problem
There’s a softer concern that runs underneath all of this, and it’s worth naming because it doesn’t show up in security reports.
AI is helpful in a particular way. It’s pleasant to interact with, draws people out, smooths over rough edges, and generates tidy summaries that feel authoritative. That’s a feature when you’re collecting information from a user. It’s a problem when the AI’s output is taken as the source of truth.
There’s a growing body of research on LLM sycophancy, the tendency of models to agree with users and align their responses with stated preferences even when accuracy suffers. A 2026 study from Stony Brook found that complimentary, agreement-aligned model behaviour can reduce perceived authenticity of the output. A separate study found that when users mention a wrong answer in their prompt, model accuracy can drop by up to 15 percentage points. Newer “better” models are not necessarily more resilient to this. GPT-4.1 showed larger sycophancy effects than older GPT-4o on the same tests.
The implication for AI-built products is that the output bends toward the user’s expectations. An AI summarising customer interviews is partly shaped by what the user implicitly wanted to hear. An AI screening job candidates produces a polished overview that may flatten the things a hiring manager most needs to see. The output sounds confident regardless of whether it should.
I’m not arguing this makes AI unusable in these contexts. I’m arguing that someone with the relevant expertise has to decide which layer of the output to trust for which purpose. That’s an expert judgement, and it can’t be automated away.
What this looks like at our scale
Most of the projects WebArt Design takes on now have AI in the build process somewhere. Sometimes it’s our team using it as an accelerator. Sometimes it’s a client coming to us with a Replit or Lovable prototype and asking us to get it production-ready. Sometimes it’s a legacy system being modernised and the question is whether AI-assisted refactoring can carry part of the load.
In all of these, our job has shifted. We write less code by hand than we used to. We spend more time setting constraints, codifying organisational rules into AGENTS.md files and Claude skills, reviewing the architecture the AI is gravitating towards, and pushing back when the easy answer is the wrong one.
The clients who get the most out of AI tooling are the ones who recognise this. They use the model to multiply expertise that already exists, rather than trying to replace it. The clients who struggle are the ones who saw a demo, decided they didn’t need us, and came back six months later with an app that works for ten users and falls over at a hundred.
I don’t have a moralistic point to make about that. It’s just what we keep observing.
What to actually do about it
If you’re a solo founder or solo builder, ship the prototype. That’s what the tools are good at, and the speed advantage is real. Just be clear-eyed about which decisions you’re punting to future-you and which you’re betting the company on. Bring an expert in before the rebuild becomes inevitable, not after. The cost of fixing a vibe-coded codebase six months in is almost always higher than the cost of getting expert input early.
If you run a mid-market company that’s started adopting AI tooling internally, the biggest risk is rarely the AI itself. It’s adopting startup workflows without acknowledging that you’re not a startup. Codify the constraints. Make the AI follow your rules instead of relying on humans to spot the violations. Standards that live in someone’s head don’t get applied at AI generation speed. Standards in a config file or an AGENTS.md do.
If you run an enterprise, the guard rails are the strategy, and the AI is the easy part. Expert oversight is the difference between AI as accelerator and AI as a liability machine. The companies that win the next few years won’t be the ones that adopted AI fastest. They’ll be the ones that built the most rigorous expert oversight around it.
The closing thought
Expertise becomes more valuable when AI is in the picture, not less. The teams that recognise this and build their workflows around expert oversight will get more out of these tools over the next few years than the ones still treating “approved by a human” as a sufficient safeguard.
“Human in the loop” was the right phrase when AI failures were obvious. The failures we see now look like working software, and the only people who can reliably tell the difference are the ones with the relevant expertise. That’s the actual condition of the field right now, marketing language aside, and the workflows that ignore it are going to keep producing the kinds of incidents we keep reading about.
If you’re sitting on a vibe-coded prototype and wondering what it would take to get it into production safely, or if you’re a leadership team trying to figure out how AI tooling fits into your organisation without creating a new class of risk, that’s the conversation we have most weeks. Happy to have it with you too.


