Background and Rationale
Problem Statement
Consumer medical AI systems (symptom checkers, diagnostic assistants) are increasingly used by millions of people to assess health concerns. These systems influence:
- Whether users seek emergency care
- How urgently users pursue treatment
- What conditions users consider
- How seriously users take their symptoms
Prior research has demonstrated that AI systems exhibit bias based on demographic signals, including names. In educational contexts, identical essays receive different grades when submitted under different names (Bertrand & Mullainathan, 2004). In hiring, resumes with names signaling minority status receive fewer callbacks.
Critical Question
Do medical AI systems exhibit similar name-based bias that could affect health outcomes?
Potential Impact
If medical AI systems discriminate based on names:
- Minority patients may receive lower urgency ratings
- Pain may be undertreated for certain groups
- Serious conditions may be missed or deprioritized
- Healthcare disparities may be amplified at algorithmic scale
Objectives
Primary Objective
Determine whether consumer medical AI systems produce systematically different outputs (diagnoses, urgency ratings, recommended actions) based on patient names associated with different demographic groups.
Secondary Objectives
- Quantify the effect size of any observed bias
- Identify which types of name contrasts produce the largest effects
- Compare bias levels across different AI systems
- Identify which symptom categories show the most bias
Study Design
Design Type
Matched-pair experimental design with repeated measures
Design Description
For each symptom profile:
- Submit identical symptoms to the same AI system
- Vary only the patient name between submissions
- Record all outputs (diagnoses, urgency, recommendations)
- Compare outputs across name pairs
Sample Size
Variables
Independent Variable: Patient Name
Varied to signal different demographic characteristics:
| Contrast Category | Example Pair |
|---|---|
| Anglo vs. African-American associated | Emily Johnson vs. Lakisha Williams |
| Anglo vs. Hispanic/Latino associated | Michael Smith vs. José Rodriguez |
| Anglo vs. Asian associated | Sarah Miller vs. Wei Chen |
| Male vs. Female | James vs. Jennifer (within ethnicity) |
| High-SES vs. Low-SES signals | Thurston vs. Cletus |
| Professional title vs. none | Dr. Smith vs. Smith |
Dependent Variables
Primary Outcomes
- Urgency Rating: Emergency / Urgent / Non-urgent / Self-care
- Top Diagnosis: First suggested condition
- Diagnosis List: All suggested conditions
Secondary Outcomes
- Recommended action
- Specialist referral type
- Urgency language (qualitative)
Statistical Analysis Plan
Primary Analysis: Effect Size
Cohen's d for continuous/ordinal outcomes:
d = (M₁ - M₂) / SD_pooled
| Effect Size | Interpretation |
|---|---|
| d < 0.2 | Negligible |
| d = 0.2 – 0.5 | Small (concerning) |
| d = 0.5 – 0.8 | Medium (actionable) |
| d > 0.8 | Large (severe) |
Multiple Comparison Correction
Bonferroni correction for primary analyses:
- 6 name contrast types × 5 systems = 30 comparisons
- Adjusted α = 0.05 / 30 = 0.00167
Literature Basis
This research follows established methodologies from:
Obermeyer et al. (2019)
Science
Demonstrated racial bias in healthcare algorithm affecting millions of patients
Hoffman et al. (2016)
PNAS
Documented racial bias in pain assessment among medical professionals
Schulman et al. (1999)
NEJM
Found cardiac referral disparities based on race and gender
Bertrand & Mullainathan (2004)
American Economic Review
Established name-based discrimination methodology in labor markets
Continue Learning
Explore our methodology in detail or see the research in action.