RLEF-DataGen

The Data Representation Gap

The Problem

EHR Adoption Disparities:

HICs: 78% of physicians use EHRs (US, 2021)
LMICs: <15% adoption rates
Only 27 of 194 WHO countries have national EHR systems

AI Training Data Bias:

>90% of clinical AI datasets from HICs
<3% representation from African populations
74% of AI studies use data from US/China only

“Clinical AI models risk perpetuating healthcare inequities precisely where they could have the most impact”

Synthetic Data as a Potential Solution

The Growing Use of Synthetic Data

What is Synthetic Clinical Data?

Artificially generated patient records
Preserves statistical properties
Protects privacy while enabling research

Market For Synthetic Data:

2023: $351M → 2030: $2.3B
31.1% CAGR (2023-2030)
75% of businesses will generate synthetic customer data by 2026, up from less than 5% in 2023 (Gartner)

Key Challenges with Synthetic Data

Validation Paradox: How to verify accuracy without real data?
Rare Event Capture: May miss critical edge cases & comorbidities
Bias Amplification: Risk of encoding generation assumptions as “truth”
May not preserve causal relationships

Traditionally need a real dataset to train a model to produce (using Generative Adversarial Networks)

Current Approaches & Limitations

GANs: Require substantial training data (doesn’t solve our problem!)
LLMs: Can generate from limited examples (but quality?)

The Fundamental Challenge

How do we generate representative data when none exists for these populations?

Hypothesis: We can use local expert knowledge (clinician, providers of various types) to guide generation and validate quality

Our Approach: Reinforcement Learning with Expert Feedback

Note:

RLHF and RLEF are not exactly the same thing. RLHF is about learning what humans like; RLEF is about learning to do what experts do. RLHF handles subjective alignment, while RLEF handles competence transfer.

How RLEF Addresses Key Challenges

Traditional Limitations:

❌ Need existing data
❌ Miss rare events
❌ Can’t validate without ground truth

Potential RLEF Solutions:

✓ Bootstrap from expert knowledge
✓ Experts flag missing conditions
✓ Continuous validation through feedback

Why Not Train AI Models Directly?

1. Cost-effectiveness: One dataset → Many applications

2. Reusability: Create once, use repeatedly

3. Flexibility: Application-agnostic resource

The Economics of Expert Feedback

Optimization Framework

\[\text{Maximize: } Q(n,e) = f(n \times I(e) \times E(g,s,a)) - CL(e)\]

\[\text{Subject to: } n \times c(e) \leq B\]

Where:

$Q$ = Quality of synthetic data
$n$ = Number of expert evaluations
$I(e)$ = Information value per response
$E(g,s,a)$ = Engagement multiplier (gamification, skills, analytics)
$CL(e)$ = Cognitive load cost
$c(e)$ = Monetary cost per response
$B$ = Budget constraint

RLEF as a Socio-Technical System

Socio-Technical Systems: Social and technical elements are fundamentally intertwined.

Healthcare Elements:

🔧 Technology
AI algorithms, devices, platforms

👥 Practitioners
Doctors, nurses, administrators

🏥 Patients
Individuals, families, communities

System Context:

📋 Policies
Regulations, protocols, guidelines

💰 Incentives
Financial, professional, social

🔄 Dynamic Interactions
All elements continuously influence each other

Optimizing technical components alone can lead to system-wide failure

Traditional AI Implementation Framework: CRISP-DM

Extended Socio-Technical Framework

Missing Elements

Feedback loops and performativity
Multiple stakeholders with conflicting objectives
Dynamic data distributions
Behavioral adaptation to AI systems

New Components Needed

System dynamics mapping
Continuous behavioral monitoring
Intervention design as core activity
Stakeholder impact assessment

CRISP-DM Extended

Key Addition: Feedback Loops. Models change the world they operate in

Integration

Economics / Behavioral Science

Incentives
Market mechanisms
Behavioral economics
Game theory applications
Decision-making under uncertainty

Specialist Knowledge

Implementation science
Clinical
Biomedical
Data Engineering and Architecture

Causal Inference

Moving beyond correlation
Identification strategies
Confounding control
Treatment effect estimation

RLEF as a Socio-technical System

AugMed Platform

}

Key Socio-Technical Elements

Technical Components:

AI/ML models
Data infrastructure
Platform interface
Feedback mechanisms

Social Components:

Expert motivations
Trust dynamics
Learning effects
Community norms

“Healthcare AI systems exist within complex human and organizational contexts”

Platform Features as Leverage Points

Cognitive Burden Factors:

Task complexity
Time requirements
Mental effort
Decision fatigue

Engagement Boosters:

🏆 Status badges
📊 Skill measurement
📈 Peer comparisons
🎯 Adaptive learning

Mechanism Design

Spectrum of Expert Effort

Low Effort

Binary: “Real or Synthetic?”
Quick response (~30 sec)
Need many samples

Medium Effort

Pairwise comparison
“Which is more realistic?”
Moderate time (~1 min)

High Effort

Feature-level feedback
“What’s unrealistic?”
Detailed review (~3 min)

Example Platform UI - Smartphone Version

}

Here: Clear binary choice

Research Questions

What design will maximize information gain per dollar spent?

Can the system be designed so it is feasible and scalable?

Would the system be cost-effective, accounting for costs and downstream benefits at scale?

Study Design: Kibera Pilot

Setting

Why Kibera?

World’s largest urban slum
~250,000 residents
Limited healthcare infrastructure
High disease burden

Partner: CFK Africa

Community health organization
20+ years serving Kibera
Deep local clinical expertise
Trusted by community

Research Timeline

✅ Phase 1: Remote interviews with 10 clinicians
🔄 Phase 2: UI/UX prototyping & testing
📊 Phase 3: Systems mapping & agent-based modeling
🚀 Phase 4: 1-month pilot implementation

Expected Impact & Next Steps

Beyond Technology: Systems Change

Technical Innovation:

RLEF methodology validated
Cost-effectiveness proven
Quality metrics established
Platform features optimized

Social Innovation:

Expert engagement models
Trust-building mechanisms
Community ownership
Sustainable incentives

Path Forward (if successful)

Q3 2025: Complete pilot & analyze socio-technical dynamics
Q4 2025: Publish framework with behavioral insights
Q1 2026: Find someone to pay for full study Hardest Part
2026-2030ish: Design multi-site study with systems perspective
2030 and beyond: Scale through partner networks

“Creating AI that learns from and serves those who need it most”

Thank You!

Questions?

Contact:

Sean Sylvia: ssylvia@email.unc.edu
DHEPlab: dheplab.org

Partners:

CFK Africa
UNC Gillings School of Global Public Health

Funders:

Gillings Gift (Gillings Innovation Labs)
NIH AIM-AHEAD Consortium (previous funding for AugMed Platform)

References

Key Sources:

OECD (2023). Progress on implementing and using electronic health record systems.
Celi et al. (2022). Sources of bias in artificial intelligence that perpetuate healthcare disparities. PLOS Digital Health.
Fortune Business Insights (2023). Synthetic Data Generation Market Forecast 2030.
ONC/HealthIT.gov (2021). National Trends in Hospital and Physician Adoption of Electronic Health Records.
Gartner (2023). Predicts 2024: AI and Machine Learning.
NIH N3C (2020). National COVID Cohort Collaborative synthetic data initiative.
Sylvia, S. (2025). Digital Health and the Leapfrog Illusion: Socio-technical Systems in Global Health.