Three Prompt Templates: Turning Narratives into Metrics

AAIDD 2026 Annual Meeting — Doug Kerwin, VillageMetrics

This page is take-home material from a poster presentation at the AAIDD 2026 Annual Meeting in Chicago. AAIDD — the American Association on Intellectual and Developmental Disabilities — is the field's primary professional society for clinicians, researchers, and educators. The resources below extend that conference work to anyone applying AI in IDD practice. — Doug Kerwin, Founder, VillageMetrics.

🔒 Use only with de-identified data — or with a BAA-covered tool. Consumer AI tools (ChatGPT, Claude.ai) require genuinely Safe-Harbor-compliant input. See the companion Two-Path De-identification Guide for the workflow, including tool recommendations and the IDD-specific second-pass review.

These three templates demonstrate the core technique behind VillageMetrics: defining a structured rubric for the individual you support, then having AI score every observation against it consistently. The output is quantitative, auditable, and trend-able — not just a one-off summary.

Pick your model first

The prompts below assume a frontier reasoning model. Strong defaults as of mid-2026:

Anthropic Claude Opus 4.7 (claude.ai or via API) — strongest default for grounded extraction
OpenAI GPT-5.5 thinking mode (chatgpt.com paid plans) — strong alternative

Avoid for clinical-adjacent work: free-tier models, mini-class models, older Haiku/Sonnet generations, anything labeled "fast" rather than "thinking." For structured extraction tasks like the ones below, the gap between frontier and free-tier is the difference between auditable output and confident-sounding guesswork.

Template 1 — Score Against Custom Behavior Goals

Why this is the primary technique. Generic AI scoring rubrics produce generic results. The leverage comes from defining your own behavior goals for the specific individual you support — then scoring every observation against that rubric. Each score traces back to evidence in the source. This is what makes the output trend-able week over week.

Step 1 — Define 3–5 behavior goals for this individual (do this once).

Each goal should be:

Specific to this individual — not generic
Behavioral — observable, not inferred
Phrased positively — what you want to see, not what you don't

Example goal set for a 9-year-old with autism:

Follow Directions — responds to first or second request without escalation
Maintain Safety — stays in safe spaces, no aggression toward self or others
Use Functional Communication — asks for breaks, help, or what she needs instead of meltdown
Stay Calm During Transitions — handles unexpected changes without prolonged dysregulation

Step 2 — Paste this prompt with your goals filled in:

You analyze caregiver journal entries to score behavior goal progress
for a child with intellectual or developmental disabilities. Use a
0.0–1.0 scale with named anchors:

  0.0  = Not at all (the goal was not met or the child regressed)
  0.25 = A little (some movement toward the goal, but inconsistent or
         heavily prompted)
  0.75 = Mostly (the goal was met more often than not, with reasonable
         independence)
  1.0  = Completely (the goal was met consistently and independently)

You may use values between the anchors (e.g., 0.5, 0.85, 0.95) when the
observation falls between named levels.

Use null when the entry contains insufficient evidence to score that
goal. Do not guess.

For each scored goal, include an `evidence` array of 1–5 concrete,
near-quoted statements from the entry that support the score. If you
cannot point to specific source text, use null instead of guessing.

Behavior goals to score:
1. Follow Directions — responds to first or second request without escalation
2. Maintain Safety — stays in safe spaces, no aggression
3. Use Functional Communication — asks for breaks/help/what she needs
4. Stay Calm During Transitions — handles changes without prolonged dysregulation

Return a single JSON object in this format:

{
  "follow_directions":            { "score": <0.0-1.0 or null>, "evidence": [...] },
  "maintain_safety":              { "score": <0.0-1.0 or null>, "evidence": [...] },
  "use_functional_communication": { "score": <0.0-1.0 or null>, "evidence": [...] },
  "stay_calm_during_transitions": { "score": <0.0-1.0 or null>, "evidence": [...] },
  "overall": <average of non-null scores, two decimals>,
  "notes": "<one sentence highlighting anything notable about this entry>"
}

Journal entry:
[PASTE DE-IDENTIFIED ENTRY HERE]

Sample input (de-identified caregiver journal entry):

S. had a really good day today. She was playing with her blocks and stayed focused on building. She built a tower and then said "I'm gonna knock it down, daddy!" She used "one more minute" on her own when I asked her to come to dinner. There was one moment where the dog barked and she covered her ears and went under the table for a few minutes, but she came back out on her own.

Sample output:

{
  "follow_directions": {
    "score": 0.95,
    "evidence": [
      "Used 'one more minute' independently when asked to come to dinner"
    ]
  },
  "maintain_safety": {
    "score": null,
    "evidence": null
  },
  "use_functional_communication": {
    "score": 0.95,
    "evidence": [
      "Used the phrase 'one more minute' to negotiate a transition",
      "Narrated her play: 'I'm gonna knock it down, daddy!'"
    ]
  },
  "stay_calm_during_transitions": {
    "score": 0.70,
    "evidence": [
      "Dog barking triggered withdrawal under the table",
      "Came back out on her own without adult intervention"
    ]
  },
  "overall": 0.87,
  "notes": "Strong self-recovery after an unexpected sensory event."
}

Why this works:

A 0.0–1.0 scale with named anchors at 0 / 0.25 / 0.75 / 1.0 is interpretable and consistent across observers; the AI can interpolate between anchors for nuanced cases
Forcing an evidence array prevents invented support and makes each score audit-traceable to source text
Allowing null prevents the model from filling gaps with guesses ("Maintain Safety" wasn't observable in this entry, so it's null rather than a fabricated 1.0)
JSON output makes the data usable downstream — pivot tables, trend charts, monthly summaries

Common adaptations:

ABA settings: align goals with the child's BIP or treatment plan goals
School / IEP: align goals with IEP behavioral objectives — and you can do this from the parent side too, separate from the school's tracking, to triangulate
SLP: replace behavior goals with communication goals ("Initiates social interaction with peers")
OT: sensory regulation or functional independence goals
Direct support staff: community access goals, daily living independence goals

Template 2 — Multi-Dimensional Entry Scoring

Why this is useful. Goal scoring tells you how the individual did. Multi-dimensional scoring tells you how the entry itself should be triaged — was today a routine day or a key moment? Is the caregiver depleted? Was there a crisis worth flagging? This catches things that goal scoring alone misses, especially caregiver wellbeing, which most clinical tools ignore entirely.

Paste this prompt:

You analyze caregiver journal entries about a child with intellectual
or developmental disabilities. Score the entry on the following six
dimensions. Each score should be a float between the bounds shown.

DIMENSIONS:

1. Sentiment: -1.0 (overwhelmingly negative) to +1.0 (overwhelmingly
   positive). 0 = neutral or mixed. Reflects the emotional tone of the
   entry overall, not individual events.

2. Detail Score: 0.0 to 1.0. How much actionable specificity?
   0.0 = vague generalities ("had a tough day")
   1.0 = specific behaviors, durations, contexts, antecedents, outcomes

3. Key Moment Score: 0.0 to 1.0. How significant or unusual were the
   events?
   0.0 = routine, typical day
   1.0 = significant, unusual, highly positive or negative event
   IMPORTANT: a vague "had a really great day" without specifics
   should score LOW here even if the tone is positive.

4. Crisis Intensity: 0.0 to 1.0. Severity of any safety/crisis moments,
   independent of the child's overall behavior. Did anything happen
   that needed immediate intervention or posed a safety risk?

5. Caregiver Wellbeing: 0.0 to 1.0. The caregiver's apparent emotional
   state and resilience, separate from how the child did.
   Low = caregiver sounds depleted, overwhelmed, distressed.
   High = caregiver sounds steady, supported, resourced.

6. Effective Strategies Score: 0.0 to 1.0. Are specific strategies or
   approaches mentioned that positively affected behavior? Can be high
   even on a hard day if the caregiver is naming what worked.

Return a single JSON object:
{
  "sentiment": <float>,
  "detailScore": <float>,
  "keyMomentScore": <float>,
  "crisisIntensityScore": <float>,
  "caregiverWellbeingScore": <float>,
  "effectiveStrategiesScore": <float>,
  "summary": "<one-sentence summary of the entry>"
}

Journal entry:
[PASTE DE-IDENTIFIED ENTRY HERE]

Sample input (a harder day):

Today was a rough one. S. melted down at the grocery store when they were out of her brand of chicken nuggets. We had to leave the cart and walk out. She kept screaming in the parking lot for about 20 minutes. I forgot to bring her noise-canceling headphones. I was crying by the time we got home. I don't know how I'm going to get through next week — we have two doctor's appointments and a school IEP meeting.

Sample output:

{
  "sentiment": -0.65,
  "detailScore": 0.80,
  "keyMomentScore": 0.55,
  "crisisIntensityScore": 0.40,
  "caregiverWellbeingScore": 0.20,
  "effectiveStrategiesScore": 0.10,
  "summary": "A grocery store meltdown triggered by unavailable preferred food; caregiver expresses depletion and anticipatory anxiety about upcoming week."
}

Why the caregiver wellbeing dimension matters. Most clinical tools track only the individual being supported. But the system that produces good outcomes for the child is the caregiver-child dyad — and a depleted caregiver predicts harder days ahead. Scoring caregiver wellbeing across 30 days of entries can surface burnout patterns weeks before they become a crisis, and pairs naturally with team or respite intervention.

Template 3 — Pattern Analysis Across Scored Entries

Why this is the payoff. Templates 1 and 2 produce structured records per entry. Template 3 turns 10–30 of those records into quantitative patterns — the aggregate view that single-entry review can't deliver. This is the take-home that matches the abstract's "pattern identification across multiple observations" promise, but with quantitative grounding rather than qualitative impressions.

Paste this prompt with a batch of already-scored entries:

You are analyzing a batch of scored journal entries about a child
with intellectual or developmental disabilities. Each entry has
behavior goal scores and a date attached. Identify quantitative
trends and patterns across the batch.

For each behavior goal:
- Calculate the average score (excluding nulls)
- Compare the first half of the batch to the second half
- Identify the trend direction (improving, stable, declining)
- Flag any individual entries that scored notably above or below the
  average — quote the date and the entry's notable detail

Also identify:
- Days of the week or times that correlate with higher or lower scores
- Activities, environments, or people mentioned in higher- vs.
  lower-scoring entries
- Goals with consistently null evidence (suggests the goal isn't
  showing up in observations OR the goal needs redefinition)

Return a structured analysis:

## Trends by Goal
[For each goal: average, trend direction, first-half vs second-half average, brief commentary]

## Correlations
[Any clear context-to-score correlations]

## Anomalies
[Specific dates that stood out, with quoted detail]

## Goals to Revisit
[Goals with consistently null evidence]

DO NOT speculate beyond what the data supports. If a pattern has only
one or two supporting entries, mark it "weak signal — not yet a
trend." If you cannot find a pattern, say so.

Scored entries:
[PASTE SCORED ENTRIES HERE — one per day, with date stamps]

Sample input (5 days of scored entries):

Mon 11/03: { follow_directions: 1.0, maintain_safety: 0.75, communication: null,
             stay_calm: 0.75, sentiment: 0.55,
             notes: "Quiet morning, school went well, family dinner at home" }

Tue 11/04: { follow_directions: 0.75, maintain_safety: 1.0, communication: 1.0,
             stay_calm: 1.0, sentiment: 0.65,
             notes: "Therapy day, swim class after school" }

Wed 11/05: { follow_directions: 0.50, maintain_safety: 0.50, communication: null,
             stay_calm: 0.25, sentiment: -0.40,
             notes: "Substitute teacher, missed swim because of sub. Tough afternoon." }

Thu 11/06: { follow_directions: 1.0, maintain_safety: 1.0, communication: 1.0,
             stay_calm: 1.0, sentiment: 0.75,
             notes: "Therapy day, regular routine" }

Fri 11/07: { follow_directions: 0.75, maintain_safety: 0.75, communication: 0.75,
             stay_calm: 0.75, sentiment: 0.45,
             notes: "Long week, end-of-week tiredness showing" }

Sample output:

## Trends by Goal

- Follow Directions: average 0.80. Bimodal — strong on therapy days
  (Tue/Thu both 0.75–1.0), dipped Wednesday (0.50) with the schedule
  disruption.
- Maintain Safety: average 0.80. Same bimodal pattern.
- Communication: average 1.0 (n=2). Insufficient data — only scored
  on Tue and Thu. Likely a goal-evidence-coverage issue, not a skill
  issue. See Goals to Revisit.
- Stay Calm During Transitions: average 0.75. Wednesday's score of 0.25
  pulls the average down sharply; the other four days average 0.875.

## Correlations

- Therapy days (Tue, Thu) show the highest scores across all goals.
- Schedule disruption (Wed: substitute teacher + missed swim) correlates
  with scores 0.25–0.50 lower across every goal scored that day.
- This is a 5-day batch — pattern is plausible but needs more data
  before treating as causal.

## Anomalies

- Wed 11/05: stay_calm=0.25 (vs week average 0.75). Quoted detail:
  "Substitute teacher, missed swim because of sub." Schedule
  disruption is the most likely candidate.

## Goals to Revisit

- Communication scored only Tue and Thu (both null Mon, Wed, Fri).
  Either restructure the goal definition to capture more contexts,
  or accept that this goal requires structured therapy time to
  observe and only score it on therapy days.

Weak signal — not yet a trend: 5 days is too short for stable
conclusions. Re-run with 30+ days for confident trends.

Why this matters. Single-entry impressions are unreliable. But after 30+ scored entries, real patterns emerge — and they're patterns the caregiver or clinician can act on. "Schedule disruption days reliably score 0.25–0.50 lower across every goal" is actionable in a way that "Wednesday was rough" is not.

Why these three templates work together

The three templates form a small data pipeline you can run by hand:

[ Narrative entry ]  →  Template 1  →  [ Goal scores + evidence ]
                                         ↓
                                     Template 2  →  [ Multi-dim scores ]
                                                       ↓
                                          (run for 30+ entries)
                                                       ↓
                                                   Template 3  →  [ Patterns ]

The output of Template 1 feeds Template 3. The output of Template 2 adds caregiver and crisis dimensions that goal scoring alone misses. With a frontier model and well-defined goals, an individual practitioner can produce structured, trend-able behavioral data from narrative observations on their own — without custom infrastructure.

This is what the VillageMetrics platform automates at scale — including continuous data flow from a child's whole "village" of caregivers (parents, ABA therapists, teachers, babysitters, family) so the structured record reflects more than one observer's view. These templates show the technique by hand so you can adapt it to your own work.

A note on qualitative outputs

The same prompt-engineering principles work when you want narrative output instead of scores. For example, a prompt that asks for BCBA-level behavioral analysis of an entry — antecedent, behavior, consequence, environmental factors — produces a structured paragraph rather than a JSON object, but uses the same five-component skeleton: role, evidence requirement, output schema, do-not-invent rule, plus a clinical framework anchor (ABC, in the BCBA case). The scoring templates above are the highest-leverage starting point because they produce trend-able data; qualitative analysis is a natural next step once you've got those working.

— Doug Kerwin · doug@villagemetrics.com · villagemetrics.com

Three Prompt Templates: Turning Narratives into Metrics

Pick your model first

Template 1 — Score Against Custom Behavior Goals

Template 2 — Multi-Dimensional Entry Scoring

Template 3 — Pattern Analysis Across Scored Entries

Why these three templates work together

A note on qualitative outputs

More AAIDD 2026 resources