We’re Building Our Own AI Report Card. Here’s Why We Did It and Why We Want You on the Journey.

Clinical research moves carefully. That’s not an accident, it’s a feature. When the stakes are human health and scientific integrity, you don’t cut corners on quality control. You build systems. You test. You verify. And then you verify again.

So why should AI be any different?

At The Contract Network, we build AI systems that help research sites, sponsors, and CROs negotiate contracts, analyze budgets, and digitize protocols. These are mission-critical workflows where a missed clause or a misread budget line has real consequences for institutions, for studies, and ultimately for patients.

That reality forced a question we couldn’t duck: how do we know our AI is actually good at this?

Not “good” in the abstract sense of scoring well on some leaderboard. Good at this — at contract redlining for clinical trial agreements, at catching misalignments between a protocol and a budget, at extracting the right data points from a dense sponsor document.

Generic benchmarks don’t answer that question. A model can ace a math reasoning test and still fumble a budget mapping task. So, we’re building our own benchmark. We call it CTAdminVals: Clinical Trial Administration Evaluations.

What It Is

CTAdminVals is an evaluation framework purpose-built for clinical trial administrative tasks. It tests AI systems against specific tasks we and our customers actually care about: research agreement redline accuracy, protocol data extraction, budget to protocol mapping, informed consent form congruence, and conversational quality.

Clinical Trial Chat Quality screenshot slideshow

Why We’re Building It

The AI landscape moves fast. New models ship every few weeks (if not weekly). Each one claims to be better. Most are for some things, but “better at coding” and “better at clinical research contracting” are not the same sentence. Without a systematic way to measure performance on our specific tasks, we’d be flying blind relying on informal testing, subjective judgment, or worse, waiting for client feedback after deployment.

That’s not good enough. Not for this industry.

CTAdminVals is also a commitment to responsible AI deployment. We’re not here to push the newest model into production because the benchmark PDF looks impressive. We’re here to make sure the AI we deploy has been tested and validated against the work it’s actually going to do. This is what responsible AI looks like in practice: not a policy document, but a testing regime.

We Want You on This Journey

CTAdminVals is still evolving. We’re refining scoring methods, expanding our gold standard libraries, and working through hard questions about how to handle client data in evaluations responsibly. We don’t have it all figured out and we’re not pretending otherwise.

What we do believe is that research institutions deserve transparency about how AI systems get built and validated.

So we’re inviting you in. If you’re a research site or sponsor interested in how we evaluate AI for your specific workflows, let’s talk. If you’re a fellow technologist working on evaluation in regulated industries, we’d love to compare notes.

Faster contracts. Faster cures. But only if the AI is actually good at its job.