Output Evaluation Framework
AAIDD 2026 Annual Meeting — Doug Kerwin, VillageMetrics
This page is take-home material from a poster presentation at the AAIDD 2026 Annual Meeting in Chicago. AAIDD — the American Association on Intellectual and Developmental Disabilities — is the field's primary professional society for clinicians, researchers, and educators. The resources below extend that conference work to anyone applying AI in IDD practice. — Doug Kerwin, Founder, VillageMetrics.
Most concerns about AI-generated clinical content come from one of three sources: the model wasn't capable enough for the task, the prompt didn't constrain the output, or the reviewer didn't sanity-check what came back. This framework addresses all three in order.
Step 1 — Use a frontier model
Most "AI got it wrong" stories trace back to using a free-tier or fast-tier model on a task that needed reasoning. For any clinical-adjacent extraction or analysis task, use one of these:
- Anthropic Claude Opus 4.7 — strongest default for grounded extraction; available on claude.ai paid plans or via API
- OpenAI GPT-5.5 thinking mode — strong alternative; available on ChatGPT paid plans
Avoid: anything labeled "fast," "mini," "flash," or "free tier"; older Sonnet/Haiku/GPT-4o-mini-class models. The capability gap on extraction tasks is large enough to matter clinically.
Step 2 — Use a structured prompt
A frontier model with a sloppy prompt still produces sloppy output. The structural elements that matter most:
- An evidence requirement. Force the model to ground every claim in source text ("include an
evidencearray of near-quoted statements"). This single instruction eliminates the majority of fabrication risk. - A null option. Tell the model to return
nullwhen source evidence is insufficient, rather than guess. This prevents confident-sounding gap-filling. - A schema. A JSON schema with explicit field types and ranges constrains output far more than free-form prose.
- A do-not-invent rule. Stated bluntly: "Do not invent details. If the source is silent, the output must be silent."
The three templates in Handout 1 are built on these. If you're writing your own, start there.
Step 3 — Five checks worth running on the output
These are quick — none should take more than a minute on a well-constrained output. They become more important when prompts are looser, when the model is older or weaker, or when stakes are high.
1. Source Grounding
Pick three claims from the output at random. Find the source text that supports each. If you can't, the output has a grounding problem — check whether the prompt forced evidence citation. With frontier models + evidence-array prompts, this rarely fails. With weaker models or unstructured prompts, it fails often.
✅ Pass · ⚠️ Some claims unsupported · ❌ Frequently unsupported (discard)
2. Clinical Validity
Does the output align with current evidence-based practice in your field? Watch for outdated terminology, discredited interventions, or recommendations that contradict current standards. The frontier models are trained on broad clinical literature, but your discipline-specific knowledge still trumps theirs.
✅ Pass · ⚠️ Some validity issues to fix · ❌ Material conflicts with established practice
3. Bias / Over-pathologizing
Does the output describe the individual more negatively than the source supports? Frontier models trained on clinical literature sometimes interpret typical neurodivergent behaviors as deficits, frame strengths as compensations, or escalate severity beyond the source. Read the output through a strengths-based lens — does it feel proportionate?
✅ Pass · ⚠️ Some over-pathologizing · ❌ Substantially deficit-framed
4. Specificity vs. Generic
Could this output apply to "any child with autism" rather than to this individual? AI defaults to averages. A response that strips out the individual-specific detail has lost the value of the source. Mentally substitute the input for a different similar narrative — would the output change meaningfully? If not, the prompt needs sharper "respond to this specific scenario" framing.
✅ Pass · ⚠️ Partly generic · ❌ Effectively generic
5. Actionability
Can you do something useful with this output? Vague recommendations ("continue to monitor," "a multidisciplinary approach is recommended") are filler. Strong outputs surface specific patterns, name specific candidate interventions, or flag specific gaps in the source.
✅ Pass · ⚠️ Partly actionable · ❌ Vague filler
Quick Decision Rubric
After running the five checks:
| Result | Action |
|---|---|
| All ✅ | Use the output, with clinical judgment |
| Mostly ✅ with one or two ⚠️ | Use the output, fix or annotate flagged items |
| Multiple ⚠️ or any ❌ | Discard. Re-prompt with sharper instructions, or fall back to manual analysis |
| ❌ on Source Grounding | Always discard — grounding failure contaminates everything else |
One Last Principle
This framework evaluates AI output. It does not transfer responsibility from the clinician to the tool. If you put your name on the documentation, you own its accuracy — including the parts the AI helped you write.
— Doug Kerwin · doug@villagemetrics.com · villagemetrics.com
More AAIDD 2026 resources
This material is educational and not clinical guidance. AI tools should supplement, not replace, professional clinical judgment. © 2026 VillageMetrics.