Integrating LLM using R Shiny for Clinical Data Review

Ensuring Data Privacy and Validity in AI-Powered Applications

AI
LLM
Shiny
Privacy
Authors

Zhen Wu (CIMS Global)

Peng Zhang (CIMS Global)

Overview

Intermediate AI/LLM Shiny Data Privacy

The pharmaceutical industry is shifting from traditional SAS-based workflows toward the open-source R ecosystem. This workshop presents {DataChat}, an innovative R Shiny application that enables users to β€œchat with data” through a conversational interface while maintaining strict compliance with data privacy requirements and statistical validity standards.

What You’ll Learn

  • πŸ›‘οΈ Data privacy in LLM applications
  • πŸ’¬ Conversational interfaces for clinical data
  • πŸ” RAG (Retrieval-Augmented Generation) for pharma domain
  • βœ… Statistical validity in AI-generated results
  • 🎯 User-friendly design for non-programmers

Prerequisites

Required Knowledge:

  • Intermediate R and Shiny
  • Basic understanding of clinical trial data structures
  • Familiarity with data privacy regulations (GDPR, HIPAA)

Technical Setup:

  • R/RStudio with Shiny
  • Access to sample clinical datasets

Key Packages & Tools

{ellmer}

{shinychat}

{ragnar}

{shiny}

Internal statistical tools

The Challenge

Traditional R Shiny applications for clinical data often require:

  • πŸ“š Strong understanding of data structures (SDTM, ADaM)
  • πŸ–±οΈ Familiarity with complex UI components (dropdowns, filters)
  • πŸ’» Programming knowledge for data exploration

This creates barriers for clinical reviewers, physicians, and medical writers who need to access insights but lack technical expertise.

The Solution: {DataChat}

An AI-powered conversational interface that allows natural language interaction with clinical data while ensuring:

  • πŸ”’ Data never leaves the secure environment
  • βœ… Statistical calculations are validated
  • πŸ“Š Results are reproducible and auditable
  • πŸ‘₯ Accessible to non-technical users

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          User Interface (Shiny)                 β”‚
β”‚  "Show me adverse events for patients >65"      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       LLM Orchestration ({ellmer})              β”‚
β”‚  β€’ Intent classification                        β”‚
β”‚  β€’ Tool selection                               β”‚
β”‚  β€’ Response generation                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAG System   β”‚   β”‚ Statistical     β”‚
β”‚ ({ragnar})   β”‚   β”‚ Tools           β”‚
β”‚              β”‚   β”‚ (validated)     β”‚
β”‚ β€’ Document   β”‚   β”‚ β€’ Summaries     β”‚
β”‚   retrieval  β”‚   β”‚ β€’ Plots         β”‚
β”‚ β€’ Context    β”‚   β”‚ β€’ Models        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features

1. Conversational Data Exploration

Natural language queries like:

  • β€œWhat’s the average age of patients in the treatment arm?”
  • β€œShow me serious adverse events by system organ class”
  • β€œCompare baseline demographics between arms”

2. RAG for Domain Knowledge

{ragnar} provides retrieval-augmented generation capabilities:

library(ragnar)

# Create vector database from study documents
vector_db <- ragnar_db() %>%
  add_documents(
    path = "study_protocols/",
    chunk_size = 500
  )

# Query with context
context <- vector_db$search(
  query = user_question,
  top_k = 5
)

3. Privacy-Preserving Design

Critical Privacy Features:

  • βœ… On-premise deployment - No data sent to external APIs
  • βœ… Local LLMs supported - Can use llama.cpp or similar
  • βœ… Query sanitization - Remove PII before processing
  • βœ… Audit logging - Track all data access
  • βœ… Role-based access - Control data visibility

4. Statistical Validity

Ensuring Accurate Results:

  • All statistical calculations use validated R functions
  • LLM suggests approach, validated code executes
  • Results include confidence intervals and p-values
  • Automatic flagging of statistical assumptions
  • Human review required for critical decisions

Workshop Content

Module 1: Setting Up Secure LLM Integration

  • Configuring {ellmer} for private deployments
  • Local vs. cloud LLM considerations
  • API security and authentication

Module 2: Building the Conversational Interface

Using {shinychat} for user interaction:

library(shinychat)
library(shiny)

ui <- fluidPage(
  chat_ui("clinical_chat")
)

server <- function(input, output, session) {
  chat <- chat_server("clinical_chat",
    system_prompt = "You are a clinical data assistant.
                     Only answer questions about the loaded study data.
                     Never make up information.",
    tools = list(
      summarize_demographics,
      plot_adverse_events,
      query_database
    )
  )
}

Module 3: Implementing RAG

Domain-specific context retrieval:

  • Indexing study protocols and SAPs
  • Medical terminology databases
  • Previous study reports

Module 4: Privacy Controls

Practical Implementation:

# Anonymize data before LLM processing
sanitize_query <- function(query, data) {
  # Remove patient identifiers
  query <- remove_pii(query)
  
  # Check for sensitive fields
  if (contains_sensitive_terms(query)) {
    return(list(
      allowed = FALSE,
      message = "Query contains sensitive information"
    ))
  }
  
  # Log for audit
  log_query(query, user_id = session$user)
  
  return(list(allowed = TRUE, query = query))
}

Module 5: Validation Strategy

Ensuring Reliability:

  1. Tool validation - Each statistical function tested independently
  2. Response validation - LLM output checked against expected format
  3. User verification - Results shown with source data
  4. Expert review - Critical decisions flagged for human oversight

Use Cases in Pharma

1. Clinical Review Meetings

  • Quick ad-hoc analyses during discussions
  • Exploration of safety signals
  • Subgroup identification

2. Medical Writing

  • Extracting statistics for CSR
  • Verifying data consistency
  • Generating descriptive text

3. Safety Monitoring

  • DSMB data reviews
  • Adverse event trending
  • Safety signal detection

4. Regulatory Queries

  • Rapid response to agency questions
  • Data subsetting and analysis
  • Documentation generation

Privacy Compliance

GDPR Considerations

  • βœ… Data minimization
  • βœ… Purpose limitation
  • βœ… Right to explanation (audit logs)
  • βœ… Data encryption at rest and in transit

HIPAA Compliance

  • βœ… Access controls
  • βœ… Audit trails
  • βœ… De-identification support
  • βœ… Business associate agreements (if using cloud LLMs)

21 CFR Part 11

  • βœ… Electronic signatures
  • βœ… Audit trails
  • βœ… System validation
  • βœ… Controlled access

Validation Approach

IQ (Installation Qualification)

  • Environment setup documentation
  • Version control
  • Access controls verification

OQ (Operational Qualification)

  • Test each statistical tool independently
  • Verify LLM response formatting
  • Confirm privacy controls function

PQ (Performance Qualification)

  • End-to-end testing with real scenarios
  • User acceptance testing
  • Performance benchmarking

Learning Outcomes

By the end of this workshop, you will be able to:

βœ… Design privacy-preserving LLM applications
βœ… Implement RAG for pharmaceutical domain knowledge
βœ… Build conversational interfaces with {shinychat}
βœ… Ensure statistical validity in AI-generated results
βœ… Deploy compliant AI solutions in regulated environments
βœ… Create user-friendly tools for non-technical stakeholders

Demo Application

Workshop includes hands-on work with {DataChat} demo:

  • Sample CDISC SDTM/ADaM datasets
  • Pre-configured LLM (local or API)
  • Example queries and workflows
  • Privacy controls demonstration

Best Practices

Do’s βœ…

  • Always validate statistical outputs
  • Log all data access for audit
  • Use validated tools for calculations
  • Implement role-based access control
  • Test privacy controls thoroughly

Don’ts ❌

  • Never send raw clinical data to external APIs (unless approved)
  • Don’t rely solely on LLM for critical decisions
  • Avoid exposing PII in queries
  • Don’t skip validation documentation
  • Never deploy without proper testing

Future Directions

  • Integration with electronic data capture (EDC) systems
  • Multi-lingual support for global trials
  • Advanced visualization capabilities
  • Automated report generation
  • Real-time safety monitoring

Additional Resources

  • CDISC standards: cdisc.org
  • FDA guidance on AI/ML: fda.gov
  • Privacy regulations: GDPR, HIPAA guidelines
WarningImportant Note

This workshop demonstrates privacy-preserving approaches but should not be considered legal or regulatory advice. Always consult with your organization’s legal, compliance, and IT security teams before deploying AI applications with clinical data.


Similar Workshops

Next Steps


Last updated: November 2025 | R/Pharma 2025 Conference