The Goldilocks Problem: What We Can All Learn from Physicians Using AI

The debate over AI tends to swing between two simplistic positions. In one, AI is overhyped and professionals should trust their own expertise. In the other, AI is so capable that ignoring it is irrational.

Two recent studies suggest the truth is more demanding than either — and the lesson reaches well beyond medicine.

The first failure mode: too little

A Stanford-led randomized trial gave 50 physicians either GPT-4 access alongside conventional resources, or conventional resources only. The result was striking in its non-result: physicians with GPT-4 scored just 1.6 percentage points higher than those without. Not statistically significant. But GPT-4 alone, run by the study team with deliberate prompting, scored nearly 16 points above the unassisted group.

The AI was demonstrably capable. The physicians had access to that capability. They didn’t unlock it.

The study authors point toward the explanation: prompt quality, workflow design, and the broader challenge of clinician-AI interaction. The physicians were expert clinicians. They were not yet expert collaborators with AI. Access is not the same as effective collaboration.

This is the first failure mode. You have a capable tool. You don’t engage with it seriously. Performance stays near baseline, the AI’s advantage evaporates, and value is left on the table.

The second failure mode: too much

A scoping review published in 2026, “Artificial intelligence in medicine: a scoping review of the risk of deskilling and loss of expertise among physicians”, examined the opposite risk. What happens when professionals rely on AI not too little, but too much?

The finding is sobering. Over-reliance on AI creates conditions for deskilling: the gradual erosion of the clinical reasoning skills that practitioners spent years developing. This is not the story of a single bad call. It is a story about trajectory. A professional who consistently defers to AI output — even when that output is correct — may stop exercising the cognitive processes that built their expertise. Output stays acceptable. The underlying capability quietly weakens.

This failure mode is harder to detect precisely because it doesn’t announce itself. Nothing goes visibly wrong. The atrophy is cumulative and often invisible until the moment when independent judgment is required and the muscle has weakened from disuse.

Two timescales, one problem

Read together, these studies don’t contradict each other. They describe the same problem operating on two different timescales.

The Stanford study captures the immediate performance cost of under-engagement: a physician who doesn’t use AI well scores no better than one without it, while the AI alone outperforms both.

The scoping review captures the longitudinal cost of over-engagement: a professional who defers too much to AI may produce adequate work today while slowly eroding the judgment that would allow them to work independently, critically evaluate AI output, or recognize when the system is wrong.

This is the Goldilocks Problem of human-AI collaboration: too little trust and you leave capability on the table. Too much trust and you risk becoming dependent on a system you can no longer effectively audit.

Three positions on the spectrum

The failure modes aren’t binary. They define a spectrum with three recognizable positions.

The first is underuse. Human judgment dominates, AI signals are dismissed or superficially consulted, and performance stays at baseline. The cost is immediate and measurable: you are paying for a capability you are not using.

The second is calibrated collaboration. AI proposes, humans interrogate, verify, and decide. Prompting is treated as a skill worth developing. Workflows are built around verification rather than acceptance. This is where the real promise of AI lives — not AI replacing human judgment, but AI expanding the field of possibilities while humans test, refine, and decide.

The third is overreliance. AI output begins to substitute for human reasoning rather than support it. In the short run, work may look acceptable. Over time, the cognitive habits that generate expertise weaken. Professionals become dependent on a system they are less and less equipped to question.

What this means beyond medicine

These studies use physicians and clinical vignettes, but the mechanism is not specific to healthcare. Anywhere a skilled professional uses AI to assist with judgment-intensive work, the same three positions exist.

In legal review, a contract analyst who ignores AI-flagged issues will miss things the AI caught. One who accepts every AI recommendation without independent evaluation will import whatever errors or gaps the system carries. One who lets AI do all first-pass reasoning, year after year, may find their own reading and reasoning skills have quietly weakened.

The same dynamic applies in research operations, negotiation strategy, protocol analysis, and financial modeling. The domain changes. The failure modes are the same.

What calibrated collaboration actually requires

Deploying AI is not the same as building effective human-AI collaboration. The Stanford study makes that plain. The scoping review adds the longer-term dimension: even effective deployment, if it tips toward over-reliance, carries its own risks.

Calibrated collaboration requires three things that most organizations are not yet systematically providing.

The first is training — not just in how AI works, but in when to trust it and when to push back. Prompt engineering matters. So does knowing the failure modes of the systems you use, recognizing when output is overconfident, and maintaining the habit of independent verification.

The second is workflow design. Systems that deliver AI recommendations as verdicts invite rubber-stamping. Systems designed so that AI surfaces information and humans make decisions — with verification checkpoints built in — keep human judgment genuinely in the loop.

The third is a culture that treats expertise as worth preserving. If AI handles more and more of the cognitive work, the reasoning skills that allow professionals to evaluate, override, and improve AI outputs will weaken. The goal is not to protect humans from AI. It is to ensure that humans remain capable enough to be genuine partners with it.

The Goldilocks zone is not a default state. It requires active cultivation. But the two studies together make the case clearly: both failure modes are real, both carry measurable costs, and neither resolves itself without deliberate attention.

That is the lesson from the physicians. It applies to all of us.

If you’re ready to move beyond both failure modes and build calibrated collaboration into your workflows, book a demo to see how we’ve designed AI capabilities that keep human judgment genuinely in the loop.

Jim Wagner is CEO of The Contract Network, where AI is used responsibly under enterprise-grade security controls to help research sites and sponsors optimize clinical trial agreements and budgets. TCN’s AI implementation follows CHAI principles, maintains SOC 2 Type II compliance, and prohibits model training on customer data.