datasetjson: Read and Write CDISC Dataset JSON
Modern data exchange format for clinical trials in R and Python
Overview
Intermediate CDISC Data Exchange
Join us for an engaging workshop on Dataset-JSON, a powerful format for sharing datasets. Learn why Dataset-JSON is preferable to formats like Parquet, explore the specification in detail, and get hands-on experience implementing it in both R and Python.
What Youβll Learn
- π Dataset-JSON specification - Format details
- π Comparison - vs Parquet, XPT, and other formats
- π R and Python - Implementation in both languages
- π Adoption plans - Industry roadmap
- π API integration - Future developments
- π§ Practical implementation - Real-world usage
Prerequisites
Required Knowledge:
- Basic R or Python programming
- Understanding of clinical trial data
- Familiarity with CDISC standards helpful
For Hands-on:
- Laptop with R or Python installed
- Workshop materials (provided)
Key Tools
{datasetjson} (R)
datasetjson (Python)
CDISC Dataset-JSON
Workshop Materials
Package Documentation: atorus-research.github.io/datasetjson
Workshop Materials: atorus-research.github.io/datasetjson_workshop
What is Dataset-JSON?
CDISC Standard Format
Dataset-JSON is an emerging CDISC standard for representing clinical trial datasets in JSON format.
Key Features:
- π Self-describing - Metadata included
- π Human-readable - Text-based format
- π Web-friendly - Native JSON support
- π Linked data - Relationships preserved
- β Validated - Schema-based validation
Example Structure
{
"datasetJSONVersion": "1.0.0",
"fileOID": "example.adsl",
"datasetName": "ADSL",
"datasetLabel": "Subject-Level Analysis Dataset",
"records": 254,
"columns": [
{
"name": "USUBJID",
"label": "Unique Subject Identifier",
"dataType": "string",
"length": 20
},
{
"name": "AGE",
"label": "Age",
"dataType": "integer"
}
],
"rows": [
["ABC-001", 65],
["ABC-002", 72]
]
}Why Dataset-JSON?
Advantages Over Other Formats
vs XPT (SAS Transport)
| Feature | Dataset-JSON | XPT |
|---|---|---|
| Human-readable | β Yes | β No |
| Self-describing | β Yes | β οΈ Limited |
| Web APIs | β Native | β Poor |
| Modern tools | β Excellent | β οΈ Legacy |
| Size efficiency | β οΈ Larger | β Smaller |
vs Parquet
| Feature | Dataset-JSON | Parquet |
|---|---|---|
| CDISC standard | β Yes | β No |
| Metadata | β Rich | β οΈ Basic |
| Human-readable | β Yes | β Binary |
| Query performance | β οΈ Slower | β Very fast |
| Interoperability | β Excellent | β οΈ Tool-specific |
Use Cases
Ideal for:
- β Regulatory submissions
- β Data exchange between organizations
- β Web APIs and services
- β Documentation and review
- β Long-term archival
Not ideal for:
- β Big data analytics (use Parquet)
- β Real-time processing
- β Embedded systems
Workshop Structure
Part 1: Introduction (30 min)
Topics:
- Dataset-JSON motivation
- Format specification walkthrough
- Comparison with alternatives
- Industry adoption status
Part 2: R Implementation (45 min)
Hands-on with R package:
library(datasetjson)
# Read Dataset-JSON
adsl <- read_dataset_json("adsl.json")
# Standard R data frame
head(adsl)
str(adsl)
# Access metadata
attributes(adsl)$column_labels
attributes(adsl)$column_types
# Write Dataset-JSON
write_dataset_json(adsl, "output.json")
# Validation
validate_dataset_json("adsl.json")Features:
- Read/write Dataset-JSON files
- Automatic metadata handling
- Schema validation
- Integration with tidyverse
Part 3: Python Implementation (45 min)
Hands-on with Python package:
import datasetjson as dsj
import pandas as pd
# Read Dataset-JSON
adsl = dsj.read_json("adsl.json")
# Pandas DataFrame
print(adsl.head())
print(adsl.info())
# Access metadata
print(dsj.get_metadata(adsl))
# Write Dataset-JSON
dsj.write_json(adsl, "output.json")
# Validation
dsj.validate("adsl.json")Features:
- Pandas integration
- Metadata preservation
- Schema validation
- Type handling
Part 4: Advanced Topics (45 min)
Topics:
- Custom metadata
- Large file handling
- Compression options
- Streaming implementations
- API integration patterns
Part 5: Future Directions (30 min)
Discussion:
- Regulatory acceptance timeline
- Tool support roadmap
- Community adoption
- API standards
- Your organizationβs plans
Practical Exercises
Exercise 1: Read and Explore
Load Dataset-JSON file and explore structure.
Tasks:
- Read ADSL dataset
- Examine metadata
- Compare with XPT
- Validate format
Exercise 2: Convert Formats
Convert between Dataset-JSON and other formats.
Tasks:
- XPT β Dataset-JSON
- CSV β Dataset-JSON
- Dataset-JSON β Parquet
- Preserve metadata
Exercise 3: Create from Scratch
Build Dataset-JSON from raw data.
Tasks:
- Define column metadata
- Set data types
- Add labels and descriptions
- Validate output
Exercise 4: API Integration
Serve Dataset-JSON via REST API.
R (Plumber):
library(plumber)
#* Get ADSL dataset
#* @get /datasets/adsl
#* @serializer json
function() {
read_dataset_json("data/adsl.json")
}Python (FastAPI):
from fastapi import FastAPI
import datasetjson as dsj
app = FastAPI()
@app.get("/datasets/adsl")
def get_adsl():
return dsj.read_json("data/adsl.json")CDISC Dataset-JSON Specification
Key Components
File Metadata:
- datasetJSONVersion
- fileOID
- datasetName
- datasetLabel
- records (row count)
Column Definitions:
- name
- label
- dataType (string, integer, float, datetime)
- length
- displayFormat
Data Rows:
- Array of arrays (efficient)
- Column order matches definitions
- Null handling
Additional Metadata:
- Study information
- Variable relationships
- Code lists
- Value-level metadata
Industry Adoption
Current Status
CDISC:
- β Published specification
- π Active development
- π’ Industry engagement
Tools:
- β R package (Atorus)
- β Python package (CDISC/Atorus)
- π SAS support in development
- π Validator tools emerging
Pharma:
- π Pilot projects underway
- π Evaluation phase
- π― 2026+ adoption expected
Roadmap
2025:
- Finalize specification v1.0
- Expand tool support
- Pilot submissions
2026+:
- Broader industry adoption
- Regulatory acceptance
- Replace XPT in workflows
Best Practices
Creating Dataset-JSON
β Do:
- Include comprehensive metadata
- Use CDISC terminology
- Validate against schema
- Document conventions
- Version your files
β Donβt:
- Omit critical metadata
- Use inconsistent naming
- Skip validation
- Ignore data types
- Create huge files (split if needed)
Integration
API Design:
- RESTful endpoints
- Pagination for large datasets
- Caching strategies
- Error handling
- Authentication
Storage:
- Compress for archival
- Index for search
- Backup metadata separately
- Version control friendly
Learning Outcomes
β
Understand Dataset-JSON specification
β
Read/write Dataset-JSON in R
β
Read/write Dataset-JSON in Python
β
Convert between formats
β
Integrate with APIs
β
Plan adoption strategy
β
Contribute to development
Resources
Official:
- CDISC Dataset-JSON specification
- CDISC website and wiki
- Package documentation
Community:
- GitHub repositories
- CDISC Dataset-JSON working group
- Stack Overflow tags
Learning:
- Workshop materials
- Example datasets
- Tutorial notebooks
Getting Involved
Contribute:
- Test packages
- Report issues
- Submit use cases
- Join working groups
Stay Informed:
- CDISC newsletters
- Package releases
- Conference presentations
Similar Workshops
- SDTM Programming - SDTM dataset creation
- Python for CSR - Modern data formats
Next Steps
- For SDTM: See {sdtm.oak} workshop
- Standards evolution: ARS/ARM trends
Last updated: November 2025 | R/Pharma 2025 Conference