datasetjson: Read and Write CDISC Dataset JSON

Modern data exchange format for clinical trials in R and Python

CDISC

Data Exchange

Intermediate

Authors

Michael Stackhouse (Chief Innovation Officer, Atorus)

Sam Hume (Research Data Engineer, CDISC)

Nick Masel (Associate Director, Johnson & Johnson)

Eli Miller (Senior Manager, Atorus)

Overview

Intermediate CDISC Data Exchange

Join us for an engaging workshop on Dataset-JSON, a powerful format for sharing datasets. Learn why Dataset-JSON is preferable to formats like Parquet, explore the specification in detail, and get hands-on experience implementing it in both R and Python.

What You’ll Learn

📋 Dataset-JSON specification - Format details
🆚 Comparison - vs Parquet, XPT, and other formats
🐍 R and Python - Implementation in both languages
🔄 Adoption plans - Industry roadmap
🚀 API integration - Future developments
🔧 Practical implementation - Real-world usage

Prerequisites

Required Knowledge:

Basic R or Python programming
Understanding of clinical trial data
Familiarity with CDISC standards helpful

For Hands-on:

Laptop with R or Python installed
Workshop materials (provided)

Key Tools

{datasetjson} (R)

datasetjson (Python)

CDISC Dataset-JSON

Workshop Materials

Resources

Package Documentation: atorus-research.github.io/datasetjson

Workshop Materials: atorus-research.github.io/datasetjson_workshop

What is Dataset-JSON?

CDISC Standard Format

Dataset-JSON is an emerging CDISC standard for representing clinical trial datasets in JSON format.

Key Features:

📊 Self-describing - Metadata included
🔍 Human-readable - Text-based format
🌐 Web-friendly - Native JSON support
🔗 Linked data - Relationships preserved
✅ Validated - Schema-based validation

Example Structure

{
  "datasetJSONVersion": "1.0.0",
  "fileOID": "example.adsl",
  "datasetName": "ADSL",
  "datasetLabel": "Subject-Level Analysis Dataset",
  "records": 254,
  "columns": [
    {
      "name": "USUBJID",
      "label": "Unique Subject Identifier",
      "dataType": "string",
      "length": 20
    },
    {
      "name": "AGE",
      "label": "Age",
      "dataType": "integer"
    }
  ],
  "rows": [
    ["ABC-001", 65],
    ["ABC-002", 72]
  ]
}

Why Dataset-JSON?

Advantages Over Other Formats

vs XPT (SAS Transport)

Feature	Dataset-JSON	XPT
Human-readable	✅ Yes	❌ No
Self-describing	✅ Yes	⚠️ Limited
Web APIs	✅ Native	❌ Poor
Modern tools	✅ Excellent	⚠️ Legacy
Size efficiency	⚠️ Larger	✅ Smaller

vs Parquet

Feature	Dataset-JSON	Parquet
CDISC standard	✅ Yes	❌ No
Metadata	✅ Rich	⚠️ Basic
Human-readable	✅ Yes	❌ Binary
Query performance	⚠️ Slower	✅ Very fast
Interoperability	✅ Excellent	⚠️ Tool-specific

Use Cases

Ideal for:

✅ Regulatory submissions
✅ Data exchange between organizations
✅ Web APIs and services
✅ Documentation and review
✅ Long-term archival

Not ideal for:

❌ Big data analytics (use Parquet)
❌ Real-time processing
❌ Embedded systems

Workshop Structure

Part 1: Introduction (30 min)

Topics:

Dataset-JSON motivation
Format specification walkthrough
Comparison with alternatives
Industry adoption status

Part 2: R Implementation (45 min)

Hands-on with R package:

library(datasetjson)

# Read Dataset-JSON
adsl <- read_dataset_json("adsl.json")

# Standard R data frame
head(adsl)
str(adsl)

# Access metadata
attributes(adsl)$column_labels
attributes(adsl)$column_types

# Write Dataset-JSON
write_dataset_json(adsl, "output.json")

# Validation
validate_dataset_json("adsl.json")

Features:

Read/write Dataset-JSON files
Automatic metadata handling
Schema validation
Integration with tidyverse

Part 3: Python Implementation (45 min)

Hands-on with Python package:

import datasetjson as dsj
import pandas as pd

# Read Dataset-JSON
adsl = dsj.read_json("adsl.json")

# Pandas DataFrame
print(adsl.head())
print(adsl.info())

# Access metadata
print(dsj.get_metadata(adsl))

# Write Dataset-JSON
dsj.write_json(adsl, "output.json")

# Validation
dsj.validate("adsl.json")

Features:

Pandas integration
Metadata preservation
Schema validation
Type handling

Part 4: Advanced Topics (45 min)

Topics:

Custom metadata
Large file handling
Compression options
Streaming implementations
API integration patterns

Part 5: Future Directions (30 min)

Discussion:

Regulatory acceptance timeline
Tool support roadmap
Community adoption
API standards
Your organization’s plans

Practical Exercises

Exercise 1: Read and Explore

Load Dataset-JSON file and explore structure.

Tasks:

Read ADSL dataset
Examine metadata
Compare with XPT
Validate format

Exercise 2: Convert Formats

Convert between Dataset-JSON and other formats.

Tasks:

XPT → Dataset-JSON
CSV → Dataset-JSON
Dataset-JSON → Parquet
Preserve metadata

Exercise 3: Create from Scratch

Build Dataset-JSON from raw data.

Tasks:

Define column metadata
Set data types
Add labels and descriptions
Validate output

Exercise 4: API Integration

Serve Dataset-JSON via REST API.

R (Plumber):

library(plumber)

#* Get ADSL dataset
#* @get /datasets/adsl
#* @serializer json
function() {
  read_dataset_json("data/adsl.json")
}

Python (FastAPI):

from fastapi import FastAPI
import datasetjson as dsj

app = FastAPI()

@app.get("/datasets/adsl")
def get_adsl():
    return dsj.read_json("data/adsl.json")

CDISC Dataset-JSON Specification

Key Components

File Metadata:

datasetJSONVersion
fileOID
datasetName
datasetLabel
records (row count)

Column Definitions:

name
label
dataType (string, integer, float, datetime)
length
displayFormat

Data Rows:

Array of arrays (efficient)
Column order matches definitions
Null handling

Additional Metadata:

Study information
Variable relationships
Code lists
Value-level metadata

Industry Adoption

Current Status

CDISC:

✅ Published specification
🔄 Active development
📢 Industry engagement

Tools:

✅ R package (Atorus)
✅ Python package (CDISC/Atorus)
🔄 SAS support in development
🔄 Validator tools emerging

Pharma:

🔍 Pilot projects underway
📋 Evaluation phase
🎯 2026+ adoption expected

Roadmap

2025:

Finalize specification v1.0
Expand tool support
Pilot submissions

2026+:

Broader industry adoption
Regulatory acceptance
Replace XPT in workflows

Best Practices

Creating Dataset-JSON

✅ Do:

Include comprehensive metadata
Use CDISC terminology
Validate against schema
Document conventions
Version your files

❌ Don’t:

Omit critical metadata
Use inconsistent naming
Skip validation
Ignore data types
Create huge files (split if needed)

Integration

API Design:

RESTful endpoints
Pagination for large datasets
Caching strategies
Error handling
Authentication

Storage:

Compress for archival
Index for search
Backup metadata separately
Version control friendly

Learning Outcomes

✅ Understand Dataset-JSON specification
✅ Read/write Dataset-JSON in R
✅ Read/write Dataset-JSON in Python
✅ Convert between formats
✅ Integrate with APIs
✅ Plan adoption strategy
✅ Contribute to development

Resources

Official:

CDISC Dataset-JSON specification
CDISC website and wiki
Package documentation

Community:

GitHub repositories
CDISC Dataset-JSON working group
Stack Overflow tags

Learning:

Workshop materials
Example datasets
Tutorial notebooks

Getting Involved

Contribute:

Test packages
Report issues
Submit use cases
Join working groups

Stay Informed:

CDISC newsletters
Package releases
Conference presentations

Similar Workshops

SDTM Programming - SDTM dataset creation
Python for CSR - Modern data formats

Next Steps

For SDTM: See {sdtm.oak} workshop
Standards evolution: ARS/ARM trends

Last updated: November 2025 | R/Pharma 2025 Conference

--- title: "datasetjson: Read and Write CDISC Dataset JSON" subtitle: "Modern data exchange format for clinical trials in R and Python" author: - "Michael Stackhouse (Chief Innovation Officer, Atorus)" - "Sam Hume (Research Data Engineer, CDISC)" - "Nick Masel (Associate Director, Johnson & Johnson)" - "Eli Miller (Senior Manager, Atorus)" categories: [CDISC, Data Exchange, Intermediate] --- ## Overview [Intermediate]{.badge .badge-intermediate} [CDISC]{.badge .badge-category} [Data Exchange]{.badge .badge-category} Join us for an engaging workshop on **Dataset-JSON**, a powerful format for sharing datasets. Learn why Dataset-JSON is preferable to formats like Parquet, explore the specification in detail, and get hands-on experience implementing it in both R and Python. ### What You'll Learn - 📋 **Dataset-JSON specification** - Format details - 🆚 **Comparison** - vs Parquet, XPT, and other formats - 🐍 **R and Python** - Implementation in both languages - 🔄 **Adoption plans** - Industry roadmap - 🚀 **API integration** - Future developments - 🔧 **Practical implementation** - Real-world usage ## Prerequisites ::: requirements **Required Knowledge:** - Basic R or Python programming - Understanding of clinical trial data - Familiarity with CDISC standards helpful **For Hands-on:** - Laptop with R or Python installed - Workshop materials (provided) ::: ## Key Tools ::: tool-tag {datasetjson} (R) ::: ::: tool-tag datasetjson (Python) ::: ::: tool-tag CDISC Dataset-JSON ::: ## Workshop Materials ::: callout-note ## Resources **Package Documentation:** [atorus-research.github.io/datasetjson](https://atorus-research.github.io/datasetjson/) **Workshop Materials:** [atorus-research.github.io/datasetjson_workshop](https://atorus-research.github.io/datasetjson_workshop/) ::: ## What is Dataset-JSON? ### CDISC Standard Format **Dataset-JSON** is an emerging CDISC standard for representing clinical trial datasets in JSON format. **Key Features:** - 📊 **Self-describing** - Metadata included - 🔍 **Human-readable** - Text-based format - 🌐 **Web-friendly** - Native JSON support - 🔗 **Linked data** - Relationships preserved - ✅ **Validated** - Schema-based validation ### Example Structure ``` json { "datasetJSONVersion": "1.0.0", "fileOID": "example.adsl", "datasetName": "ADSL", "datasetLabel": "Subject-Level Analysis Dataset", "records": 254, "columns": [ { "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 20 }, { "name": "AGE", "label": "Age", "dataType": "integer" } ], "rows": [ ["ABC-001", 65], ["ABC-002", 72] ] } ``` ## Why Dataset-JSON? ### Advantages Over Other Formats #### vs XPT (SAS Transport) | Feature | Dataset-JSON | XPT | |---------------------|--------------|------------| | **Human-readable** | ✅ Yes | ❌ No | | **Self-describing** | ✅ Yes | ⚠️ Limited | | **Web APIs** | ✅ Native | ❌ Poor | | **Modern tools** | ✅ Excellent | ⚠️ Legacy | | **Size efficiency** | ⚠️ Larger | ✅ Smaller | #### vs Parquet | Feature | Dataset-JSON | Parquet | |-----------------------|--------------|------------------| | **CDISC standard** | ✅ Yes | ❌ No | | **Metadata** | ✅ Rich | ⚠️ Basic | | **Human-readable** | ✅ Yes | ❌ Binary | | **Query performance** | ⚠️ Slower | ✅ Very fast | | **Interoperability** | ✅ Excellent | ⚠️ Tool-specific | ### Use Cases **Ideal for:** - ✅ Regulatory submissions - ✅ Data exchange between organizations - ✅ Web APIs and services - ✅ Documentation and review - ✅ Long-term archival **Not ideal for:** - ❌ Big data analytics (use Parquet) - ❌ Real-time processing - ❌ Embedded systems ## Workshop Structure ### Part 1: Introduction (30 min) **Topics:** - Dataset-JSON motivation - Format specification walkthrough - Comparison with alternatives - Industry adoption status ### Part 2: R Implementation (45 min) **Hands-on with R package:** ``` r library(datasetjson) # Read Dataset-JSON adsl <- read_dataset_json("adsl.json") # Standard R data frame head(adsl) str(adsl) # Access metadata attributes(adsl)$column_labels attributes(adsl)$column_types # Write Dataset-JSON write_dataset_json(adsl, "output.json") # Validation validate_dataset_json("adsl.json") ``` **Features:** - Read/write Dataset-JSON files - Automatic metadata handling - Schema validation - Integration with tidyverse ### Part 3: Python Implementation (45 min) **Hands-on with Python package:** ``` python import datasetjson as dsj import pandas as pd # Read Dataset-JSON adsl = dsj.read_json("adsl.json") # Pandas DataFrame print(adsl.head()) print(adsl.info()) # Access metadata print(dsj.get_metadata(adsl)) # Write Dataset-JSON dsj.write_json(adsl, "output.json") # Validation dsj.validate("adsl.json") ``` **Features:** - Pandas integration - Metadata preservation - Schema validation - Type handling ### Part 4: Advanced Topics (45 min) **Topics:** - Custom metadata - Large file handling - Compression options - Streaming implementations - API integration patterns ### Part 5: Future Directions (30 min) **Discussion:** - Regulatory acceptance timeline - Tool support roadmap - Community adoption - API standards - Your organization's plans ## Practical Exercises ### Exercise 1: Read and Explore Load Dataset-JSON file and explore structure. **Tasks:** - Read ADSL dataset - Examine metadata - Compare with XPT - Validate format ### Exercise 2: Convert Formats Convert between Dataset-JSON and other formats. **Tasks:** - XPT → Dataset-JSON - CSV → Dataset-JSON - Dataset-JSON → Parquet - Preserve metadata ### Exercise 3: Create from Scratch Build Dataset-JSON from raw data. **Tasks:** - Define column metadata - Set data types - Add labels and descriptions - Validate output ### Exercise 4: API Integration Serve Dataset-JSON via REST API. **R (Plumber):** ``` r library(plumber) #* Get ADSL dataset #* @get /datasets/adsl #* @serializer json function() { read_dataset_json("data/adsl.json") } ``` **Python (FastAPI):** ``` python from fastapi import FastAPI import datasetjson as dsj app = FastAPI() @app.get("/datasets/adsl") def get_adsl(): return dsj.read_json("data/adsl.json") ``` ## CDISC Dataset-JSON Specification ### Key Components **File Metadata:** - datasetJSONVersion - fileOID - datasetName - datasetLabel - records (row count) **Column Definitions:** - name - label - dataType (string, integer, float, datetime) - length - displayFormat **Data Rows:** - Array of arrays (efficient) - Column order matches definitions - Null handling **Additional Metadata:** - Study information - Variable relationships - Code lists - Value-level metadata ## Industry Adoption ### Current Status **CDISC:** - ✅ Published specification - 🔄 Active development - 📢 Industry engagement **Tools:** - ✅ R package (Atorus) - ✅ Python package (CDISC/Atorus) - 🔄 SAS support in development - 🔄 Validator tools emerging **Pharma:** - 🔍 Pilot projects underway - 📋 Evaluation phase - 🎯 2026+ adoption expected ### Roadmap **2025:** - Finalize specification v1.0 - Expand tool support - Pilot submissions **2026+:** - Broader industry adoption - Regulatory acceptance - Replace XPT in workflows ## Best Practices ### Creating Dataset-JSON **✅ Do:** - Include comprehensive metadata - Use CDISC terminology - Validate against schema - Document conventions - Version your files **❌ Don't:** - Omit critical metadata - Use inconsistent naming - Skip validation - Ignore data types - Create huge files (split if needed) ### Integration **API Design:** - RESTful endpoints - Pagination for large datasets - Caching strategies - Error handling - Authentication **Storage:** - Compress for archival - Index for search - Backup metadata separately - Version control friendly ## Learning Outcomes ✅ Understand Dataset-JSON specification\ ✅ Read/write Dataset-JSON in R\ ✅ Read/write Dataset-JSON in Python\ ✅ Convert between formats\ ✅ Integrate with APIs\ ✅ Plan adoption strategy\ ✅ Contribute to development ## Resources **Official:** - CDISC Dataset-JSON specification - CDISC website and wiki - Package documentation **Community:** - GitHub repositories - CDISC Dataset-JSON working group - Stack Overflow tags **Learning:** - Workshop materials - Example datasets - Tutorial notebooks ## Getting Involved **Contribute:** - Test packages - Report issues - Submit use cases - Join working groups **Stay Informed:** - CDISC newsletters - Package releases - Conference presentations ------------------------------------------------------------------------ ### Similar Workshops - [SDTM Programming](../specialized/sdtm-oak.qmd) - SDTM dataset creation - [Python for CSR](../specialized/python-csr-submission.qmd) - Modern data formats ### Related Presentations - [Mosaic: ARS-Driven Automation](../presentations/europe-us-sessions.qmd#mosaic-ars-driven-automation-of-standard-tfls) - CDISC standards ### Next Steps - **For SDTM:** See [{sdtm.oak} workshop](../specialized/sdtm-oak.qmd) - **Standards evolution:** [ARS/ARM trends](../summary/trends-insights.qmd#4-automation--efficiency-) ------------------------------------------------------------------------ *Last updated: November 2025 \| R/Pharma 2025 Conference*