Blog

February 9, 2026

Before the market declared SaaS dead, it should have tested Anthropic’s new tools first. We did.

The stock market wiped hundreds of billions in value before anyone ran an actual test. We ran three.

Key Takeaways

The market reaction assumed horizontal AI can replace vertical software. Nobody tested whether that’s true.
We tested it. On a clinical trial agreement, using the identical playbook, a purpose-built system produced 163 substantive tracked changes. Claude produced 11.
The gap is architectural, not cosmetic — and it widens as document complexity increases. If AI is a threat to vertical software, that threat should have been priced in months or years ago, not the week Anthropic shipped a plugin.
Claude is an extraordinary model. We use it daily. But a plugin is not a platform — and nothing Anthropic announced addresses the multi-party orchestration, shared data models, and institutional workflows that define complex vertical software.

Last week, I got a call from one of our investors. Then from one of our partners. Then a message from one of the world’s largest PE firms. All asking some version of the same question: Does the Anthropic announcement change everything?

If you’ve been anywhere near enterprise software in the past ten days, you know the context. Anthropic launched a series of plugins for its Claude platform, including tools that can review contracts, triage NDAs, and run compliance checks. The stock market responded by wiping hundreds of billions in market caps, including companies like Thomson Reuters, RELX, Wolters Kluwer, DocuSign, and LegalZoom. The headlines called it a “SaaSpocalypse.” Investors started asking whether every vertical software company was now obsolete.

The commentary that followed was voluminous. Some of it was informed. Much of it was not. What was almost entirely absent was something simple: an actual test. Someone running the tool against a purpose-built platform on the same agreement, with the same playbook, and comparing the results.

Insight

What was almost entirely absent was something simple: an actual test.

So we did exactly that.

A note on what this is and isn’t. TCN is our platform, and we’re obviously not a disinterested party. But the purpose of this study isn’t to argue that TCN is “better” than Claude — that framing misses the point entirely. We used our platform as an illustrative comparator because we had the playbooks, the agreements, and the infrastructure to run a controlled test. The real questions are broader: Did anyone test these tools before declaring the death of SaaS? Is a general-purpose model with a vertical plugin actually coming for the intricate, multi-document, multi-party workflows that define enterprise vertical software? And if AI is genuinely a threat to this category, why would the market reprice now — rather than months or years ago, when these models were already capable?

We also want to be transparent about methodology. We used Anthropic’s own Claude Opus 4.6 — the most capable publicly available model — to evaluate the results of each test, including the section-by-section determinations and the analysis of why performance varied so significantly. It did an excellent job of explaining the gaps, and we concur with its assessments. The first draft of every benchmark determination in our study was generated by Claude itself.

The Contract Network (TCN) is an AI-powered platform that helps research sites, sponsors, and CROs accelerate study startup, including the negotiation of clinical research agreements. TCN operates under the mission “faster contracts, faster cures.”

Why the trust threshold matters

Before I walk through what we found, I need to frame why it matters in a way that’s specific to legal work.

Lawyers are not early adopters by temperament, and they don’t grade on a curve. A tool that reviews a contract and misses a material protection doesn’t get classified as “promising but incomplete.” It risks being shelved. Permanently. The standard is binary: either the tool is reliable enough that I can build a workflow around it, or it isn’t. There is no middle ground where a legal team says “it caught seven out of ten critical issues, so let’s use it for now.”

This is especially true in regulated environments — clinical trials, financial services, healthcare — where a missed clause isn’t an aesthetic problem. It’s a liability exposure, a regulatory finding, or a damaged institutional relationship. The question isn’t whether AI can review contracts. It can. The question is whether it can do so at the threshold required for a professional to rely on it.

That’s the standard we tested against.

Insight

The standard is binary: either the tool is reliable enough that I can build a workflow around it, or it isn’t.

Test one: Claude's built-in NDA skill vs. TCN

We started with a straightforward M&A confidentiality agreement — a sell-side NDA for a private equity acquisition. This is a well-understood document type. Any experienced PE lawyer has seen hundreds of them.

We ran it through Claude’s built-in legal skill and through TCN’s platform. Claude produced roughly six substantive text changes and left eight of fifteen sections completely untouched. Where it did engage, it mostly added comments — explanatory notes flagging issues and suggesting negotiation strategy — rather than executing actual redlines. TCN produced over fifty substantive text changes across virtually every section, added six entirely new protective provisions, and delivered a markup that reflected deep PE buyer-side practice: portfolio company protections, non-solicit narrowing with carve-outs, a residuals clause, VDR click-through overrides, and assignment restrictions.

The result was decisive. But there was an obvious objection: TCN’s playbook is substantially more sophisticated than Claude’s default instructions. We were comparing a purpose-built analytical engine running a refined playbook against a general-purpose model running generic guidance (albeit guidance that Anthropic chose to ship as a product feature). Still, not apples to apples.

Fair enough. So we leveled the playing field.

Test two: Give Claude our playbook

We took TCN’s own playbook — the same structured positions, the same guidance, the same negotiating logic — and gave it to Claude to use in its specialized contract review environment. Same agreement. Same playbook. Claude’s turf.

The results improved. Claude made roughly thirty substantive changes, added five new sections, and covered most of the core buyer protections. But TCN still made 50% more required changes, and the qualitative differences were revealing. TCN wove protections into existing contract language — inserting a carveout for backups into the return-and-destroy section, integrating the definitive agreement concept into the no-obligation-to-proceed provision. Claude tended to append new standalone sections at the end of the agreement — a cleaner approach visually, but one that experienced practitioners recognize as easier for counterparty counsel to reject as “new asks.” TCN included PE-specific provisions that Claude missed entirely, despite sharing the identical playbook: a three-business-day conflicts check window, a financing carve-out, portfolio company non-imputation language, and an anti-assignment clause.

A fair-minded evaluator would say Claude’s output is useful — and that it reads like the work of an inexperienced associate, while TCN’s reads like the work of a senior practitioner whose output reflects deep expertise in both the subject matter and the AI implementation behind it.

Insight

Claude appended new standalone sections at the end of the agreement — easier for counterparty counsel to reject as “new asks.” TCN wove protections into existing contract language.

But the NDA is a relatively short, well-understood document. We wanted to see what happened when we raised the stakes.

Test three: The clinical trial agreement

A clinical trial agreement is a different animal. It’s longer, more technically complex, and touches regulatory frameworks — HIPAA, FDA reporting obligations, IRB oversight, 21 CFR Part 54 financial disclosure — that require genuine domain expertise. The provisions interact with each other in ways that matter: a change to monitoring visit procedures can impact confidentiality obligations; a publication review period needs to account for patent deferral timelines; a subject injury provision needs to include a safe harbor for protocol deviations made to protect patient safety.

Once again, we gave Claude the identical playbook TCN uses — one specifically structured for AI consumption, with clear logic and well-defined positions — and ran both systems against the same clinical trial agreement.

The gap didn’t narrow. It widened.

TCN made 101 insertions of required protective language and 62 targeted deletions — 163 substantive changes in total. Claude made 7 insertions and 4 deletions. Tellingly, Claude’s changes were largely find-and-replace-level revisions: substituting “immediately” with “promptly,” replacing “sole” with “reasonable,” increasing an insurance figure, and adding pandemic language to a force majeure clause. These are real edits. They are also the edits a first-year associate would make in the first twenty minutes of review.

Insight

Claude’s changes were the edits a first-year associate would make in the first twenty minutes of review. Entire categories of institutional protections went unaddressed.

What Claude left untouched — using the identical playbook — included HIPAA restrictions on sponsor use of protected health information, safety reporting obligation rewrites, publication timeline restructuring that reduced maximum delay from ninety to forty-five days, indemnification scope expansion to cover IP infringement and PHI misuse, a safe harbor for protocol deviations made to protect patient safety, financial disclosure obligations under 21 CFR Part 54, subject injury cost provisions, monitoring visit confidentiality and access controls, record retention mechanics, and warranty disclaimers.

These aren’t edge cases. They are core institutional protections that our playbook and any experienced clinical trial agreement negotiator would address.

Section by section, TCN produced the stronger edit in twenty-two of thirty sections. Claude was stronger in one — a useful flag on an IP license provision that TCN missed. Seven were comparable or mixed. The detailed section-by-section comparison and the actual Word markups are available here.

Why the gap exists

These results are not a reflection of Claude’s quality as a language model. Claude is an extraordinarily capable general-purpose AI, and we use it daily in our own work. The gap is a reflection of architecture and ambition.

Claude’s legal plugin reads an entire agreement and an entire playbook, then attempts to produce all of its analysis and redlines in a single pass. This is analogous to asking a lawyer to read a thirty-page contract and a fifty-topic playbook simultaneously, then dictate every markup from memory in one sitting. Issues inevitably get lost — not because the lawyer lacks ability, but because the task exceeds what any single-pass process can reliably accomplish.

A purpose-built system works differently. Each playbook position is matched against the agreement independently and analyzed in a dedicated step with only the relevant clause text and guidance in front of it. Nothing competes for attention. Every position in the playbook is programmatically guaranteed to be evaluated. The system doesn’t need to “remember” to check a provision — it cannot skip one.

This also explains why the gap widened on the longer, more complex clinical trial agreement. The more provisions, the more playbook positions, and the more regulatory context a single-pass system must hold in working memory simultaneously, the more it drops. A purpose-built pipeline scales linearly. A single-pass approach degrades.

Insight

The system doesn’t need to “remember” to check a provision — it cannot skip one.

For a deeper technical discussion of how these architectures differ, including decomposed analysis, structured playbook enforcement, and auditability constraints, see our companion piece here.

What this means — and for whom

The ramifications of Anthropic’s move are real. But they differ substantially depending on where you sit.

For general-purpose AI providers like Anthropic, the results should be encouraging. Claude produced a passable NDA markup with the right playbook — proof that the foundation is powerful. Our own platform uses Claude as one of the models in our pipeline, which means we know firsthand that the model is capable of this work when it operates within the right architecture. The limitation isn’t Claude. It’s the plugin. A single-pass skill layered on top of a foundation model is not the same thing as a purpose-built system that decomposes the work, enforces every playbook position programmatically, and has been refined over thousands of agreements. Anthropic could, in theory, build that. But that would make them a vertical software company — and that’s not their business or their ambition. They build foundation models and make them as broadly useful as possible. That’s a different, and complementary, enterprise.

For in-house teams considering a DIY approach — using Claude or another foundation model directly with their own playbooks — the results are a caution. It will work for simple, well-understood agreements where the cost of a missed issue is low. It will not work, yet, for complex agreements where comprehensiveness matters. And the dangerous middle ground is where teams believe they have adequate coverage because the output looks polished and professional, while material protections are silently absent.

For SaaS providers whose products are thin wrappers around foundation models — adding a user interface and a clause library but little analytical depth — the threat is existential and immediate. If Claude can produce a comparable output to your product using the same model you’re built on, your value proposition has evaporated. The market commentary on this point is correct: the wrappers are in trouble.

For purpose-built platforms with genuine domain architecture, proprietary analytical frameworks, and deep subject-matter expertise, the Anthropic announcement is a competitive event, not an existential one. The data shows that purpose-built systems outperform general-purpose models on complex legal work by an order of magnitude — not because the models are weak, but because the problem demands more than a model and plugin alone can deliver.

And then there is a constituency that has received less attention in the current discourse but deserves more: traditional service providers whose business is document-intensive analysis. Law firms, contract management consultancies, outsourced legal operations teams. For these providers, the threat is not Claude alone — it’s the convergence of foundation models and purpose-built platforms approaching a point where end-to-end outcome delivery is within reach. The question is no longer whether AI can assist with contract review. It’s whether AI-powered platforms can deliver a finished, reliable, comprehensive markup — the actual outcome, not just a tool to help produce one. That threshold is closer than most service providers appreciate, and the competitive pressure will come from both directions: general-purpose AI making basic work free, and purpose-built platforms making complex work dramatically faster and more consistent.

What the market missed

The stock market’s reaction treated Anthropic’s announcement as if a general-purpose model with a vertical plugin is architecturally equivalent to purpose-built vertical software. It isn’t — and the evidence is now available for anyone willing to run an actual test.

But there’s a more fundamental point. Nothing Anthropic announced addresses multi-document congruence, multi-party collaboration, or institutional workflow orchestration. A Claude user reviewing a clinical trial agreement operates in a single chat window with a single document. The protocol, consent form, budget, and coverage analysis — all of which must be internally consistent with the contract — exist nowhere in that workflow. Imagine five users with five separate skills in five disconnected chat windows, each trying to keep their work coordinated, cross-checked, and accurate. There is no shared data model. No audit trail. No collaboration layer. No mechanism to ensure that a change to the protocol ripples correctly through the budget, the consent form, and the contract.

The natural counterargument is that agentic AI frameworks — autonomous agents that chain tasks, manage state, and coordinate across documents — will close this gap. They will have an impact, we use them ourselves and we take that seriously. But agentic frameworks don’t arrive pre-built with plug-and-play domain solutions. They are tools, not answers. An agent orchestrating clinical trial study startup still needs deep context understanding of the subject matter, the stakeholder requirements, and the interconnectedness of every document and every party involved. It needs to know that a change to a protocol’s schedule of events must ripple through the budget, the consent form, and the coverage analysis — and it needs to know how. That’s not something you install. It’s something you build — substantial work that relies on deep expertise with respect to the subject matter and AI implementation, refined across thousands of agreements. The same architectural principles that separate a plugin from a platform will separate a generic agent from a team of purpose-built ones.

Insight

Deep context understanding of the subject matter, the stakeholder requirements, and the interconnectedness of all things — that’s not something you install. It’s something you build.

If the argument is that AI broadly threatens vertical software, that argument has been valid for two years. Foundation models have been capable of impressive single-task performance since GPT-4. The market should have repriced then — or better yet, asked the question we asked: capable at what level of reliability, on what complexity of task, in what kind of workflow?

The answer, as our data shows, is that horizontal AI handles most individual tasks very effectively, struggles with complex tasks that require systematic coverage, and has nothing to say about the multi-party orchestration that defines most enterprise vertical software. That’s not a knock on Claude. It’s a description of what plugins are and aren’t.

Where this leaves us

We use Claude in our own pipeline — which is precisely why we understand both its strengths and its architectural constraints. We use Claude daily. We build with it. This isn’t dismissal — it’s differentiation.

The past week has been a useful stress test for a question that matters to every stakeholder in technology: can general-purpose AI, tailored with some vertical capabilities, replace purpose-built systems for complex, highly-orchestrated work?

The answer, today, is no. And the reasons are architectural, not temporary. Closing the gap on even a single task — contract review — requires decomposed pipelines, structured enforcement, and substantial domain-specific engineering grounded in deep subject-matter and AI implementation expertise. Closing the gap on the full workflow — orchestration, collaboration, congruence across every document and every party in a study startup — isn’t even what Anthropic is attempting. They build foundation models. They build connectors and plugins to make these models as useful as possible. But that’s a different business.

The real story of this moment isn’t that foundation models are coming for vertical software. It’s that the market is finally being forced to distinguish between software that uses AI and software that is AI infrastructure for a specific domain. That distinction will determine which companies thrive, which adapt, and which discover that their product was a feature all along.

The market correction that followed Anthropic’s announcement was a reaction to a press release, not to a product evaluation. If investors had tested the tools before repricing the sector, they would have found what we found: a capable model, a useful plugin, and an enormous distance between that and the operational software that enterprises actually rely on. Better AI is a tailwind for the companies that absorb it into purpose-built platforms. It is a threat to thin wrappers and labor-intensive services. The market would benefit from knowing the difference.

Insight

The market correction was a reaction to a press release, not to a product evaluation.

The full results of our study, including interactive section-by-section comparisons and the actual Word markups, are available here.

Key Takeaways

Single-pass vs. decomposed pipeline is the fundamental architectural difference. A single-pass system’s reliability decays as input complexity increases; a pipeline scales linearly.
Structured playbook enforcement means every position is evaluated programmatically — the system cannot skip one or decide it “probably doesn’t apply.”
Purpose-built prompts encode domain expertise into structured analytical frameworks, not general instructions to “review carefully.”
Consistency and auditability are requirements, not features. For regulated industries, unmanaged variability across runs can be a disqualifier.

If you read our analysis of Anthropic’s legal plugin against TCN’s platform, you saw the headline numbers. On a clinical trial agreement, using the identical playbook, TCN produced 163 substantive changes. Claude produced 11. The section-by-section comparison was 22-1.

The natural question is: why?

This piece is for the people asking that question — particularly those evaluating whether to build contract review workflows internally using a general-purpose AI platform. If you’re a CTO, a VP of research operations, or a legal ops leader at a health system, and you’re looking at your Claude or ChatGPT enterprise license and thinking “we could build this ourselves,” what follows is a frank explanation of why the gap exists and what it would take to close it.

Insight

If you’re looking at your enterprise AI license and thinking “we could build this ourselves,” what follows is a frank explanation of why the gap exists and what it would take to close it.

What Claude's legal plugin actually is

Anthropic’s tool is a set of markdown instructions layered on top of Claude’s Cowork and Code environments. A user uploads a contract and a playbook — any text file describing negotiating positions in prose — and the model reads both documents and attempts to flag issues and suggest redlines, all within a single conversational window.

This is important to understand precisely. It is not a purpose-built contract analysis system. It is a very capable general-purpose language model given a role description for a specific task. The playbook is input text that the model interprets on a best-effort basis. There is no mechanism ensuring that every playbook position has been evaluated. There is no structured logic governing how positions are applied.

For short, simple agreements, this works surprisingly well. For a thirty-section clinical trial agreement with fifty-plus playbook positions touching regulatory, financial, operational, and liability frameworks, it does not. And the reasons are structural, not cosmetic.

The single-pass problem

The plugin reads the entire agreement and the entire playbook, then produces all of its analysis and redlines in one chain of thought. Picture what you’re asking the system to do: hold a thirty-page contract in working memory, simultaneously hold a fifty-topic playbook with distinct positions on confidentiality, HIPAA, monitoring, publication rights, indemnification, insurance, termination, and dozens of other subjects, then systematically evaluate every contract provision against every applicable playbook position, and generate precise redline language for each identified gap — all in a single generation pass.

This is analogous to handing a lawyer a contract and a playbook, then asking them to dictate every markup from memory without referring back to either document. Issues get lost. Not because the lawyer is incompetent, but because the task exceeds what a single-pass cognitive process can reliably accomplish.

Insight

This is analogous to handing a lawyer a contract and a playbook, then asking them to dictate every markup from memory without referring back to either document. The degradation is predictable and consistent — and it scales with document complexity.

The degradation is predictable and consistent. Simple, prominent issues get caught: “best efforts” becomes “commercially reasonable efforts,” an insurance figure gets increased, pandemic language gets added to force majeure. Complex, interacting issues get missed: the relationship between a publication review period and a patent deferral timeline, the need for a safety-deviation safe harbor in a subject injury provision, the requirement for financial disclosure under 21 CFR Part 54.

And critically, the degradation scales with document complexity. On the NDA — a fifteen-section document — the plugin’s coverage gaps were meaningful but bounded. On the clinical trial agreement — thirty sections, deeper regulatory context, more interacting provisions — the gap widened dramatically. This is predictable behavior given the architecture. A single-pass system’s reliability decays as the input complexity increases.

How a purpose-built pipeline works differently

TCN’s system is decomposed into discrete analytical stages. The differences fall into four categories, each of which addresses a specific failure mode of the single-pass approach.

1. Decomposed analysis vs. single-pass generation.

TCN breaks the work apart. Each playbook position is matched against the agreement independently and analyzed by a dedicated AI call with only the relevant clause text and the applicable guidance in front of it. When the system evaluates whether the agreement’s monitoring provisions comply with the playbook’s monitoring position, it is not simultaneously trying to hold the indemnification analysis, the publication rights analysis, and forty-eight other analyses in working memory. Nothing competes for attention.

Insight

You can’t solve an attention allocation problem by writing better instructions. You solve it by not asking the system to allocate attention across fifty competing tasks simultaneously.

2. Structured playbook enforcement.

The plugin treats the playbook as text for the model to interpret on a best-effort basis. If the model decides a particular playbook position isn’t relevant, or simply doesn’t get to it before its generation completes, there is no backstop.

TCN’s playbook positions are structured data objects. Each position has a classification — Required, Preferred, Permitted, or Prohibited — with distinct evaluation logic for each type. A Required position that isn’t reflected in the agreement will always be flagged. A Permitted position that’s present in an acceptable form will always be cleared. The pipeline iterates through every position programmatically. It cannot skip one. It cannot decide that a position “probably doesn’t apply here.” Every position is evaluated, every time, on every agreement.

For a hospital system negotiating clinical trial agreements, this distinction is the difference between “the AI checked most of the important things” and “every institutional position was evaluated and documented.” When your Office of Research needs to certify that an agreement complies with institutional policy, “mostly” is not an acceptable answer.

Insight

The pipeline cannot skip a position. It cannot decide that a position “probably doesn’t apply here.” Every position is evaluated, every time, on every agreement.

3. Purpose-built analytical prompts.

Each stage of TCN’s pipeline uses prompts that encode contract negotiation expertise into structured analytical frameworks. The alignment analysis prompt contains a complete taxonomy engine that classifies every playbook position by type and sub-type, with explicit evaluation logic and fallback rules for each. The revision prompt requires the system to inventory every defined term in the agreement, map synonymous terms between playbook and contract, preserve existing clause numbering and headers, and produce surgical inline edits rather than wholesale replacements.

These prompts are specialized instruments refined through thousands of iterations on real agreements. They are not general instructions telling the model to “review this contract carefully.” They are precise specifications for a particular analytical task, designed to interact with a particular data structure, producing a particular output format. The difference in specificity is comparable to the difference between telling someone “analyze this data” and giving them a detailed statistical analysis protocol.

4. Consistency and auditability.

As is typical of generative AI systems, the output of a single-pass plugin can vary meaningfully with each execution. Run the same agreement through the plugin twice and you may get different issues flagged, different severity assessments, and different redlines. For a research institution that needs to demonstrate consistent application of institutional policy across hundreds of agreements — to sponsors, to IRBs, to federal auditors — this variability is not a minor inconvenience. It can be a disqualifier.

Purpose-built systems are not immune to variability — any system that uses generative AI has some range of variation. But a staged architecture constrains it: deterministic matching, scoped inputs, and model parameters configured for consistency reduce the range of variation to a level that institutions can validate and rely on. The output for a given agreement and playbook combination is reproducible, auditable, and explainable — you can trace exactly which playbook position generated which finding and which redline.

Insight

A staged architecture doesn’t eliminate variability — it constrains it to a level that institutions can validate and rely on.

What this looks like in practice

To make this concrete, here is what each system did with three provisions from the clinical trial agreement, using the identical playbook.

Safety reporting (§1.6). The playbook requires specific timelines, expanded triggering events, and reporting in accordance with the IRB-approved informed consent form. TCN rewrote the provision: changed “at least two years” to “no more than two years,” replaced the vague “promptly, or in a timely manner” with “within thirty days,” expanded triggering events from two categories to four, and specified the informed consent form as the communication standard. The plugin made minor deletions and left the substantive framework untouched.

HIPAA restrictions (§5.1). The playbook requires restrictions on sponsor use of PHI and bars third-party disclosure without institution consent. TCN inserted both provisions. The plugin made no changes to this section.

Publication rights (§9.1). The playbook requires an extended review period but elimination of the stacking patent deferral. TCN changed the review period from thirty to forty-five days and capped total delay at forty-five days — eliminating the separate sixty-day deferral that would otherwise allow ninety days total. Net result: a longer initial review window but a dramatically shorter maximum delay. The plugin made no changes to this section.

In each case, the playbook clearly specified the institutional position. In each case, TCN’s pipeline identified the relevant contract language, evaluated it against the position, and generated a targeted revision. In each case, the plugin’s single-pass approach either didn’t reach the provision or didn’t generate a redline.

Insight

In each case, the playbook clearly specified the institutional position. In each case, the plugin’s single-pass approach either didn’t reach the provision or didn’t generate a redline.

The DIY question

If you’re evaluating whether to build contract review workflows on a general-purpose AI platform, the question isn’t whether the underlying model is capable. It is. Claude and GPT-4 are extraordinary language models. The question is whether a model alone — even with a good playbook, even in a specialized mode — is sufficient for the level of reliability your institution requires.

For low-stakes, well-understood agreements where a missed issue is correctable, a general-purpose model with a solid playbook may be adequate. Material transfer agreements, straightforward service contracts, simple NDAs — these are reasonable candidates for a lighter-touch approach.

Insight

Not because the model is weak, but because the plugin architecture is not designed for this level of complexity.

For clinical trial agreements, sponsored research agreements, data use agreements, and other documents where the stakes include regulatory compliance, institutional liability, patient safety, and federal audit exposure, the evidence says a single-pass approach is not sufficient. Not because the model is weak, but because the plugin architecture is not designed for this level of complexity.

Building a system that closes the gap requires decomposed pipeline architecture, structured playbook data models, purpose-built analytical prompts refined over thousands of agreements, consistency and auditability mechanisms, and deep domain expertise in the specific agreement types you’re negotiating. That is not a weekend project for your innovation team. It’s a substantial engineering effort requiring deep expertise in both the subject matter and the AI implementation — which is exactly what purpose-built platforms represent.

The emergence of agentic AI frameworks — systems that chain tasks, manage state, and coordinate across documents — will make it easier to build decomposed pipelines. We use them ourselves. But an agentic framework is infrastructure, not a solution. The agent still needs to know what to decompose, how to evaluate each component, and what domain-specific logic governs the interactions between them. The same deep context understanding that separates a plugin from a platform will separate a generic agent from a team of purpose-built ones.

A note on what Claude does well

Fairness requires saying this: Claude’s user experience is genuinely excellent. The ability to upload a contract, describe what you need in natural language, and get a working first pass in minutes is valuable. The conversational iteration — “be more aggressive on the indemnification clause,” “what’s the risk if we accept this term?” — is something purpose-built platforms should learn from. And for education and triage — helping a non-specialist understand what’s in an agreement before routing it to the right expert — it’s a powerful tool.

The risk is assuming that capability extends to comprehensive, reliable, auditable contract negotiation at institutional scale. It doesn’t, yet. And the reasons are architectural, which means the path to getting there runs through the kind of engineering we’ve described here, not through better prompting or larger context windows.

The full results

The detailed section-by-section comparisons, including interactive analysis of both the NDA and clinical trial agreement benchmarks and the actual Word markups, are available here.

Full Benchmark Results

Three head-to-head tests — same agreements, same playbooks — comparing TCN’s platform against Claude’s legal AI. Section-by-section data for every provision.

This benchmark compares the Claude legal plugin against a purpose-built vertical application. TCN is used as the comparator, but the takeaways apply to any sophisticated domain-specific tool.

Test 1 · NDA

Claude’s Built-in Skill

TCN 50+ vs ~6 Claude

substantive text changes

Test 2 · NDA

Same Playbook

TCN ~45 vs ~30 Claude

substantive changes

Test 3 · CTA

Same Playbook

TCN 163 vs 11 Claude

substantive changes

The first draft of every section-by-section determination and summary on this page was generated by Anthropic’s own Claude Opus 4.6 — the most capable publicly available model. We reviewed the output for directional accuracy and agree with the conclusions.

Clause-by-Clause Comparisons

Below are example excerpts from the larger clause-by-clause comparisons, shown side by side. Claude’s version appears on the left, TCN’s on the right. Insertions appear in blue underline, deletions in red strikethrough.

NDA — Claude’s Built-in Skill vs. TCN with Playbook

Test 1

Claude reviewed this M&A NDA using its default built-in legal skill — no custom playbook. TCN reviewed the same NDA using its structured playbook profile.

Section 7 — Non-Solicitation

Comparison 1 of 3

Claude

Carveouts present (public knowledge, prior possession, third-party receipt, independent development). However, burden of proof is on Recipient and third-party receipt requires reasonable investigation, which is more burdensome than standard.

7. In consideration of the Evaluation Material being furnished to you, you hereby agree that, for a period of three (3) years from the date hereof, neither you nor any of your Representatives will employ or otherwise retain in any capacity including as a consultant or independent contractor, nor will you solicit for employment or retention in any capacity, nor identify for solicitation for employment or retention in any capacity any of our or the Company’s directors, officers, employees, contractors, consultants, or other representatives without obtaining our prior written consent.

TCN

7. In consideration of the Evaluation Material being furnished to you, you hereby agree that, for a period of threeeighteen (318) yearsmonths from the date hereof, neither you nor any of your Representatives who actually receive Evaluation Material will employ or otherwise retain in any capacity, including as a consultant or independent contractor, nor will you solicit for employment or retention in any capacity, nor identify for solicitation for employment or retention in any capacity any of our or the Company’s directors,executive or management-level employees (including officers, and/or key employees,contractors, consultants,) who were first introduced or other representativesinvolved in the Transaction without obtaining our prior written consent; provided, however, that the foregoing shall not prohibit general solicitations or advertisements that are not specifically targeted at our or the Company’s executive or management-level employees (including officers and/or key employees) who were first introduced or involved in the Transaction; provided, further, that this restriction shall not apply to any individual who is no longer employed by us or the Company at the time of such solicitation, employment, or retention; provided, further, that this restriction shall not apply to any individual who contacts you or your Representatives regarding employment or retention without any solicitation by you or your Representatives; and provided, further, that this restriction shall not apply to any individual with whom you or your Representatives were already in discussions regarding potential employment or retention prior to the date of this Agreement.

Section 4 — Use and Disclosure

Comparison 2 of 3

Claude

✔︎ No Edits Applied

4. You agree that you will use the Evaluation Material strictly for internal purposes related to evaluating a Transaction only, and for no other purpose. You agree not to disclose or allow disclosure to others of any Evaluation Material, except that you may disclose Evaluation Material to your directors, officers, employees, partners, affiliates, agents, advisors, or representatives, including legal counsel, accountants, and financial advisors, (i) who clearly need to know such information for the purpose of assisting you with your evaluation of a Transaction, and (ii) who have a legal, fiduciary, or professional obligation of confidentiality with respect to the Evaluation Material (hereinafter, “Representatives”), provided, however, that you shall cause your Representatives to comply with the obligations and restrictions set forth herein.

TCN

4. You agree that you will use the Evaluation Material strictly for internal purposes related tothe purpose of evaluating, negotiating, or consummating a Transaction only, and for no other purpose. You agree not to disclose or allow disclosure to others of any Evaluation Material, except that you may disclose Evaluation Material to your directors, officers, employees, partners, affiliates, portfolio companies, agents, advisors, or representatives, including legal counsel, accountants, and financial advisors, or directors of your portfolio companies, (i) who clearly need to know or reasonably require access to such information for the purpose of assisting you with your evaluation ofevaluating, negotiating, or consummating a Transaction, and (ii) who have a legal, fiduciary, or professional obligation of confidentiality with respect to the Evaluation Material (hereinafter, “Representatives”), provided, however, that you shall causeadvise and direct your Representatives to comply with the obligations and restrictions set forth herein. For the avoidance of doubt, your Representatives may serve as representatives, directors, officers, employees, or agents of your affiliates or portfolio companies, and the receipt of Evaluation Material by such Representatives in such capacities shall not be deemed to impute receipt of Evaluation Material to such affiliates or portfolio companies unless such Evaluation Material is actually provided to them. Nothing in this Agreement shall prevent you or your Representatives from conducting no-names market research related to the Transaction or the industry in which the Company operates, provided that no Evaluation Material is used or disclosed in connection with such research. For the avoidance of doubt, any portfolio company that does not receive Evaluation Material shall not be bound by the terms of this Agreement. You shall not be liable for any Representative who has entered into a separate confidentiality agreement or joinder with the Company.

Section 2 — Confidentiality Exceptions

Comparison 3 of 3

Claude

✔︎ No Edits Applied

2. Notwithstanding the foregoing, you shall not be required to maintain the confidentiality of those portions of the Evaluation Material that you can show (a) became generally available to the public other than as a result of a disclosure by you, (b) was available to you on a non-confidential basis before the disclosure of such Evaluation Material to you pursuant to this Agreement, or became available to you on a non-confidential basis from a source other than ourselves, the Company, provided in each case that the source of such information was not known by you, after reasonable investigation, to be bound by a confidentiality agreement with or other contractual, legal, or fiduciary obligation of confidentiality to us, to the Company with respect to such material, or (c) was independently developed by you without use of or reference to any Evaluation Material.

TCN

2. Notwithstanding the foregoing, you shall not be required to maintain the confidentiality of those portions of the Evaluation Material that you can show (a) became generally available to the public other than as a result of a disclosure by you, (b) was available to you or your Representatives on a non-confidential basis before the disclosure of such Evaluation Material to you or your Representatives pursuant to this Agreement, or became available to you or your Representatives on a non-confidential basis from a source other than ourselves; or the Company, provided in each case that the source of such information was not known by you, after reasonable investigation, or your Representatives to be bound by a confidentiality agreement with or other contractual, legal, or fiduciary obligation of confidentiality to us, or the Company with respect to such material, or (c) was independently developed by you or your Representatives without use of or reference to any Evaluation Material.

Before the market declared SaaS dead, it should have tested Anthropic’s new tools first. We did.

The stock market wiped hundreds of billions in value before anyone ran an actual test. We ran three.

Key Takeaways

Why the trust threshold matters

Test one: Claude's built-in NDA skill vs. TCN

Test two: Give Claude our playbook

Test three: The clinical trial agreement

Why the gap exists

What this means — and for whom

What the market missed

Where this leaves us

Key Takeaways

What Claude's legal plugin actually is

The single-pass problem

How a purpose-built pipeline works differently

1. Decomposed analysis vs. single-pass generation.

2. Structured playbook enforcement.

3. Purpose-built analytical prompts.

4. Consistency and auditability.

What this looks like in practice

The DIY question

A note on what Claude does well

The full results

Full Benchmark Results

Clause-by-Clause Comparisons

NDA — Claude’s Built-in Skill vs. TCN with Playbook

Section 7 — Non-Solicitation

Section 4 — Use and Disclosure

Section 2 — Confidentiality Exceptions

NDA — Same Playbook, Both Systems

Conflicts Check Provision

Section 7 — Non-Solicitation

Section 13 — Remedies

CTA — Same Playbook, Both Systems

Section 10.1 — Use of Name

Section 7.2 — Regulatory Audit

Section 13.2 — Insurance