RLEF-DataGen

Harnessing Expert Feedback to Generate Representative Synthetic Clinical Data for Underserved Populations

Sean Sylvia, PhD

DHEPlab, UNC Gillings School of Global Public Health

June 27, 2025

The Data Representation Gap

The Problem

EHR Adoption Disparities:

  • HICs: 78% of physicians use EHRs (US, 2021)
  • LMICs: <15% adoption rates
  • Only 27 of 194 WHO countries have national EHR systems

AI Training Data Bias:

  • >90% of clinical AI datasets from HICs
  • <3% representation from African populations
  • 74% of AI studies use data from US/China only

“Clinical AI models risk perpetuating healthcare inequities precisely where they could have the most impact”

Synthetic Data as a Potential Solution

The Growing Use of Synthetic Data

What is Synthetic Clinical Data?

  • Artificially generated patient records
  • Preserves statistical properties
  • Protects privacy while enabling research

Market For Synthetic Data:

  • 2023: $351M → 2030: $2.3B
  • 31.1% CAGR (2023-2030)
  • 75% of businesses will generate synthetic customer data by 2026, up from less than 5% in 2023 (Gartner)

Key Challenges with Synthetic Data

  • Validation Paradox: How to verify accuracy without real data?
  • Rare Event Capture: May miss critical edge cases & comorbidities
  • Bias Amplification: Risk of encoding generation assumptions as “truth”
  • May not preserve causal relationships

Traditionally need a real dataset to train a model to produce (using Generative Adversarial Networks)

Current Approaches & Limitations

  • GANs: Require substantial training data (doesn’t solve our problem!)
  • LLMs: Can generate from limited examples (but quality?)

The Fundamental Challenge

How do we generate representative data when none exists for these populations?

Hypothesis: We can use local expert knowledge (clinician, providers of various types) to guide generation and validate quality

Our Approach: Reinforcement Learning with Expert Feedback

Note:

RLHF and RLEF are not exactly the same thing. RLHF is about learning what humans like; RLEF is about learning to do what experts do. RLHF handles subjective alignment, while RLEF handles competence transfer.

How RLEF Addresses Key Challenges

Traditional Limitations:

  • ❌ Need existing data
  • ❌ Miss rare events
  • ❌ Can’t validate without ground truth

Potential RLEF Solutions:

  • ✓ Bootstrap from expert knowledge
  • ✓ Experts flag missing conditions
  • ✓ Continuous validation through feedback

Why Not Train AI Models Directly?

1. Cost-effectiveness: One dataset → Many applications

2. Reusability: Create once, use repeatedly

3. Flexibility: Application-agnostic resource

The Economics of Expert Feedback

Optimization Framework

\[\text{Maximize: } Q(n,e) = f(n \times I(e) \times E(g,s,a)) - CL(e)\]

\[\text{Subject to: } n \times c(e) \leq B\]

Where:

  • \(Q\) = Quality of synthetic data
  • \(n\) = Number of expert evaluations
  • \(I(e)\) = Information value per response
  • \(E(g,s,a)\) = Engagement multiplier (gamification, skills, analytics)
  • \(CL(e)\) = Cognitive load cost
  • \(c(e)\) = Monetary cost per response
  • \(B\) = Budget constraint

RLEF as a Socio-Technical System

Socio-Technical Systems: Social and technical elements are fundamentally intertwined.

Healthcare Elements:

🔧 Technology
AI algorithms, devices, platforms

👥 Practitioners
Doctors, nurses, administrators

🏥 Patients
Individuals, families, communities

System Context:

📋 Policies
Regulations, protocols, guidelines

💰 Incentives
Financial, professional, social

🔄 Dynamic Interactions
All elements continuously influence each other

Optimizing technical components alone can lead to system-wide failure

Traditional AI Implementation Framework: CRISP-DM

Extended Socio-Technical Framework

Missing Elements

  • Feedback loops and performativity
  • Multiple stakeholders with conflicting objectives
  • Dynamic data distributions
  • Behavioral adaptation to AI systems

New Components Needed

  • System dynamics mapping
  • Continuous behavioral monitoring
  • Intervention design as core activity
  • Stakeholder impact assessment

CRISP-DM Extended

Key Addition: Feedback Loops. Models change the world they operate in

Integration

Economics / Behavioral Science

  • Incentives
  • Market mechanisms
  • Behavioral economics
  • Game theory applications
  • Decision-making under uncertainty

Specialist Knowledge

  • Implementation science
  • Clinical
  • Biomedical
  • Data Engineering and Architecture

Causal Inference

  • Moving beyond correlation
  • Identification strategies
  • Confounding control
  • Treatment effect estimation

RLEF as a Socio-technical System

AugMed Platform

}

Key Socio-Technical Elements

Technical Components:

  • AI/ML models
  • Data infrastructure
  • Platform interface
  • Feedback mechanisms

Social Components:

  • Expert motivations
  • Trust dynamics
  • Learning effects
  • Community norms

“Healthcare AI systems exist within complex human and organizational contexts”

Platform Features as Leverage Points

Cognitive Burden Factors:

  • Task complexity
  • Time requirements
  • Mental effort
  • Decision fatigue

Engagement Boosters:

  • 🏆 Status badges
  • 📊 Skill measurement
  • 📈 Peer comparisons
  • 🎯 Adaptive learning

Mechanism Design

Spectrum of Expert Effort

Low Effort

  • Binary: “Real or Synthetic?”
  • Quick response (~30 sec)
  • Need many samples

Medium Effort

  • Pairwise comparison
  • “Which is more realistic?”
  • Moderate time (~1 min)

High Effort

  • Feature-level feedback
  • “What’s unrealistic?”
  • Detailed review (~3 min)

Example Platform UI - Smartphone Version

}

Here: Clear binary choice

Research Questions

  1. What design will maximize information gain per dollar spent?
  1. Can the system be designed so it is feasible and scalable?
  1. Would the system be cost-effective, accounting for costs and downstream benefits at scale?

Study Design: Kibera Pilot

Setting

Why Kibera?

  • World’s largest urban slum
  • ~250,000 residents
  • Limited healthcare infrastructure
  • High disease burden

Partner: CFK Africa

  • Community health organization
  • 20+ years serving Kibera
  • Deep local clinical expertise
  • Trusted by community

Research Timeline

  • Phase 1: Remote interviews with 10 clinicians
  • 🔄 Phase 2: UI/UX prototyping & testing
  • 📊 Phase 3: Systems mapping & agent-based modeling
  • 🚀 Phase 4: 1-month pilot implementation

Expected Impact & Next Steps

Beyond Technology: Systems Change

Technical Innovation:

  • RLEF methodology validated
  • Cost-effectiveness proven
  • Quality metrics established
  • Platform features optimized

Social Innovation:

  • Expert engagement models
  • Trust-building mechanisms
  • Community ownership
  • Sustainable incentives

Path Forward (if successful)

  1. Q3 2025: Complete pilot & analyze socio-technical dynamics
  2. Q4 2025: Publish framework with behavioral insights
  3. Q1 2026: Find someone to pay for full study Hardest Part
  4. 2026-2030ish: Design multi-site study with systems perspective
  5. 2030 and beyond: Scale through partner networks

“Creating AI that learns from and serves those who need it most”

Thank You!

Questions?

Contact:

  • Sean Sylvia: ssylvia@email.unc.edu
  • DHEPlab: dheplab.org

Partners:

  • CFK Africa
  • UNC Gillings School of Global Public Health

Funders:

  • Gillings Gift (Gillings Innovation Labs)
  • NIH AIM-AHEAD Consortium (previous funding for AugMed Platform)

References

Key Sources:

  • OECD (2023). Progress on implementing and using electronic health record systems.
  • Celi et al. (2022). Sources of bias in artificial intelligence that perpetuate healthcare disparities. PLOS Digital Health.
  • Fortune Business Insights (2023). Synthetic Data Generation Market Forecast 2030.
  • ONC/HealthIT.gov (2021). National Trends in Hospital and Physician Adoption of Electronic Health Records.
  • Gartner (2023). Predicts 2024: AI and Machine Learning.
  • NIH N3C (2020). National COVID Cohort Collaborative synthetic data initiative.
  • Sylvia, S. (2025). Digital Health and the Leapfrog Illusion: Socio-technical Systems in Global Health.