Output Evaluation Framework

AAIDD 2026 Annual Meeting — Doug Kerwin, VillageMetrics

This page is take-home material from a poster presentation at the AAIDD 2026 Annual Meeting in Chicago. AAIDD — the American Association on Intellectual and Developmental Disabilities — is the field's primary professional society for clinicians, researchers, and educators. The resources below extend that conference work to anyone applying AI in IDD practice. — Doug Kerwin, Founder, VillageMetrics.

Most concerns about AI-generated clinical content come from one of three sources: the model wasn't capable enough for the task, the prompt didn't constrain the output, or the reviewer didn't sanity-check what came back. This framework addresses all three in order.

Step 1 — Use a frontier model

Most "AI got it wrong" stories trace back to using a free-tier or fast-tier model on a task that needed reasoning. For any clinical-adjacent extraction or analysis task, use one of these:

Anthropic Claude Opus 4.7 — strongest default for grounded extraction; available on claude.ai paid plans or via API
OpenAI GPT-5.5 thinking mode — strong alternative; available on ChatGPT paid plans

Avoid: anything labeled "fast," "mini," "flash," or "free tier"; older Sonnet/Haiku/GPT-4o-mini-class models. The capability gap on extraction tasks is large enough to matter clinically.

Step 2 — Use a structured prompt

A frontier model with a sloppy prompt still produces sloppy output. The structural elements that matter most:

An evidence requirement. Force the model to ground every claim in source text ("include an evidence array of near-quoted statements"). This single instruction eliminates the majority of fabrication risk.
A null option. Tell the model to return null when source evidence is insufficient, rather than guess. This prevents confident-sounding gap-filling.
A schema. A JSON schema with explicit field types and ranges constrains output far more than free-form prose.
A do-not-invent rule. Stated bluntly: "Do not invent details. If the source is silent, the output must be silent."

The three templates in Handout 1 are built on these. If you're writing your own, start there.

Step 3 — Five checks worth running on the output

These are quick — none should take more than a minute on a well-constrained output. They become more important when prompts are looser, when the model is older or weaker, or when stakes are high.

1. Source Grounding

Pick three claims from the output at random. Find the source text that supports each. If you can't, the output has a grounding problem — check whether the prompt forced evidence citation. With frontier models + evidence-array prompts, this rarely fails. With weaker models or unstructured prompts, it fails often.

✅ Pass · ⚠️ Some claims unsupported · ❌ Frequently unsupported (discard)

2. Clinical Validity

Does the output align with current evidence-based practice in your field? Watch for outdated terminology, discredited interventions, or recommendations that contradict current standards. The frontier models are trained on broad clinical literature, but your discipline-specific knowledge still trumps theirs.

✅ Pass · ⚠️ Some validity issues to fix · ❌ Material conflicts with established practice

3. Bias / Over-pathologizing

Does the output describe the individual more negatively than the source supports? Frontier models trained on clinical literature sometimes interpret typical neurodivergent behaviors as deficits, frame strengths as compensations, or escalate severity beyond the source. Read the output through a strengths-based lens — does it feel proportionate?

✅ Pass · ⚠️ Some over-pathologizing · ❌ Substantially deficit-framed

4. Specificity vs. Generic

Could this output apply to "any child with autism" rather than to this individual? AI defaults to averages. A response that strips out the individual-specific detail has lost the value of the source. Mentally substitute the input for a different similar narrative — would the output change meaningfully? If not, the prompt needs sharper "respond to this specific scenario" framing.

✅ Pass · ⚠️ Partly generic · ❌ Effectively generic

5. Actionability

Can you do something useful with this output? Vague recommendations ("continue to monitor," "a multidisciplinary approach is recommended") are filler. Strong outputs surface specific patterns, name specific candidate interventions, or flag specific gaps in the source.

✅ Pass · ⚠️ Partly actionable · ❌ Vague filler

Quick Decision Rubric

After running the five checks:

Result	Action
All ✅	Use the output, with clinical judgment
Mostly ✅ with one or two ⚠️	Use the output, fix or annotate flagged items
Multiple ⚠️ or any ❌	Discard. Re-prompt with sharper instructions, or fall back to manual analysis
❌ on Source Grounding	Always discard — grounding failure contaminates everything else

One Last Principle

This framework evaluates AI output. It does not transfer responsibility from the clinician to the tool. If you put your name on the documentation, you own its accuracy — including the parts the AI helped you write.

— Doug Kerwin · doug@villagemetrics.com · villagemetrics.com