datasetjson: Read and Write CDISC Dataset JSON

Modern data exchange format for clinical trials in R and Python

CDISC
Data Exchange
Intermediate
Authors

Michael Stackhouse (Chief Innovation Officer, Atorus)

Sam Hume (Research Data Engineer, CDISC)

Nick Masel (Associate Director, Johnson & Johnson)

Eli Miller (Senior Manager, Atorus)

Overview

Intermediate CDISC Data Exchange

Join us for an engaging workshop on Dataset-JSON, a powerful format for sharing datasets. Learn why Dataset-JSON is preferable to formats like Parquet, explore the specification in detail, and get hands-on experience implementing it in both R and Python.

What You’ll Learn

  • πŸ“‹ Dataset-JSON specification - Format details
  • πŸ†š Comparison - vs Parquet, XPT, and other formats
  • 🐍 R and Python - Implementation in both languages
  • πŸ”„ Adoption plans - Industry roadmap
  • πŸš€ API integration - Future developments
  • πŸ”§ Practical implementation - Real-world usage

Prerequisites

Required Knowledge:

  • Basic R or Python programming
  • Understanding of clinical trial data
  • Familiarity with CDISC standards helpful

For Hands-on:

  • Laptop with R or Python installed
  • Workshop materials (provided)

Key Tools

{datasetjson} (R)

datasetjson (Python)

CDISC Dataset-JSON

Workshop Materials

NoteResources

What is Dataset-JSON?

CDISC Standard Format

Dataset-JSON is an emerging CDISC standard for representing clinical trial datasets in JSON format.

Key Features:

  • πŸ“Š Self-describing - Metadata included
  • πŸ” Human-readable - Text-based format
  • 🌐 Web-friendly - Native JSON support
  • πŸ”— Linked data - Relationships preserved
  • βœ… Validated - Schema-based validation

Example Structure

{
  "datasetJSONVersion": "1.0.0",
  "fileOID": "example.adsl",
  "datasetName": "ADSL",
  "datasetLabel": "Subject-Level Analysis Dataset",
  "records": 254,
  "columns": [
    {
      "name": "USUBJID",
      "label": "Unique Subject Identifier",
      "dataType": "string",
      "length": 20
    },
    {
      "name": "AGE",
      "label": "Age",
      "dataType": "integer"
    }
  ],
  "rows": [
    ["ABC-001", 65],
    ["ABC-002", 72]
  ]
}

Why Dataset-JSON?

Advantages Over Other Formats

vs XPT (SAS Transport)

Feature Dataset-JSON XPT
Human-readable βœ… Yes ❌ No
Self-describing βœ… Yes ⚠️ Limited
Web APIs βœ… Native ❌ Poor
Modern tools βœ… Excellent ⚠️ Legacy
Size efficiency ⚠️ Larger βœ… Smaller

vs Parquet

Feature Dataset-JSON Parquet
CDISC standard βœ… Yes ❌ No
Metadata βœ… Rich ⚠️ Basic
Human-readable βœ… Yes ❌ Binary
Query performance ⚠️ Slower βœ… Very fast
Interoperability βœ… Excellent ⚠️ Tool-specific

Use Cases

Ideal for:

  • βœ… Regulatory submissions
  • βœ… Data exchange between organizations
  • βœ… Web APIs and services
  • βœ… Documentation and review
  • βœ… Long-term archival

Not ideal for:

  • ❌ Big data analytics (use Parquet)
  • ❌ Real-time processing
  • ❌ Embedded systems

Workshop Structure

Part 1: Introduction (30 min)

Topics:

  • Dataset-JSON motivation
  • Format specification walkthrough
  • Comparison with alternatives
  • Industry adoption status

Part 2: R Implementation (45 min)

Hands-on with R package:

library(datasetjson)

# Read Dataset-JSON
adsl <- read_dataset_json("adsl.json")

# Standard R data frame
head(adsl)
str(adsl)

# Access metadata
attributes(adsl)$column_labels
attributes(adsl)$column_types

# Write Dataset-JSON
write_dataset_json(adsl, "output.json")

# Validation
validate_dataset_json("adsl.json")

Features:

  • Read/write Dataset-JSON files
  • Automatic metadata handling
  • Schema validation
  • Integration with tidyverse

Part 3: Python Implementation (45 min)

Hands-on with Python package:

import datasetjson as dsj
import pandas as pd

# Read Dataset-JSON
adsl = dsj.read_json("adsl.json")

# Pandas DataFrame
print(adsl.head())
print(adsl.info())

# Access metadata
print(dsj.get_metadata(adsl))

# Write Dataset-JSON
dsj.write_json(adsl, "output.json")

# Validation
dsj.validate("adsl.json")

Features:

  • Pandas integration
  • Metadata preservation
  • Schema validation
  • Type handling

Part 4: Advanced Topics (45 min)

Topics:

  • Custom metadata
  • Large file handling
  • Compression options
  • Streaming implementations
  • API integration patterns

Part 5: Future Directions (30 min)

Discussion:

  • Regulatory acceptance timeline
  • Tool support roadmap
  • Community adoption
  • API standards
  • Your organization’s plans

Practical Exercises

Exercise 1: Read and Explore

Load Dataset-JSON file and explore structure.

Tasks:

  • Read ADSL dataset
  • Examine metadata
  • Compare with XPT
  • Validate format

Exercise 2: Convert Formats

Convert between Dataset-JSON and other formats.

Tasks:

  • XPT β†’ Dataset-JSON
  • CSV β†’ Dataset-JSON
  • Dataset-JSON β†’ Parquet
  • Preserve metadata

Exercise 3: Create from Scratch

Build Dataset-JSON from raw data.

Tasks:

  • Define column metadata
  • Set data types
  • Add labels and descriptions
  • Validate output

Exercise 4: API Integration

Serve Dataset-JSON via REST API.

R (Plumber):

library(plumber)

#* Get ADSL dataset
#* @get /datasets/adsl
#* @serializer json
function() {
  read_dataset_json("data/adsl.json")
}

Python (FastAPI):

from fastapi import FastAPI
import datasetjson as dsj

app = FastAPI()

@app.get("/datasets/adsl")
def get_adsl():
    return dsj.read_json("data/adsl.json")

CDISC Dataset-JSON Specification

Key Components

File Metadata:

  • datasetJSONVersion
  • fileOID
  • datasetName
  • datasetLabel
  • records (row count)

Column Definitions:

  • name
  • label
  • dataType (string, integer, float, datetime)
  • length
  • displayFormat

Data Rows:

  • Array of arrays (efficient)
  • Column order matches definitions
  • Null handling

Additional Metadata:

  • Study information
  • Variable relationships
  • Code lists
  • Value-level metadata

Industry Adoption

Current Status

CDISC:

  • βœ… Published specification
  • πŸ”„ Active development
  • πŸ“’ Industry engagement

Tools:

  • βœ… R package (Atorus)
  • βœ… Python package (CDISC/Atorus)
  • πŸ”„ SAS support in development
  • πŸ”„ Validator tools emerging

Pharma:

  • πŸ” Pilot projects underway
  • πŸ“‹ Evaluation phase
  • 🎯 2026+ adoption expected

Roadmap

2025:

  • Finalize specification v1.0
  • Expand tool support
  • Pilot submissions

2026+:

  • Broader industry adoption
  • Regulatory acceptance
  • Replace XPT in workflows

Best Practices

Creating Dataset-JSON

βœ… Do:

  • Include comprehensive metadata
  • Use CDISC terminology
  • Validate against schema
  • Document conventions
  • Version your files

❌ Don’t:

  • Omit critical metadata
  • Use inconsistent naming
  • Skip validation
  • Ignore data types
  • Create huge files (split if needed)

Integration

API Design:

  • RESTful endpoints
  • Pagination for large datasets
  • Caching strategies
  • Error handling
  • Authentication

Storage:

  • Compress for archival
  • Index for search
  • Backup metadata separately
  • Version control friendly

Learning Outcomes

βœ… Understand Dataset-JSON specification
βœ… Read/write Dataset-JSON in R
βœ… Read/write Dataset-JSON in Python
βœ… Convert between formats
βœ… Integrate with APIs
βœ… Plan adoption strategy
βœ… Contribute to development

Resources

Official:

  • CDISC Dataset-JSON specification
  • CDISC website and wiki
  • Package documentation

Community:

  • GitHub repositories
  • CDISC Dataset-JSON working group
  • Stack Overflow tags

Learning:

  • Workshop materials
  • Example datasets
  • Tutorial notebooks

Getting Involved

Contribute:

  • Test packages
  • Report issues
  • Submit use cases
  • Join working groups

Stay Informed:

  • CDISC newsletters
  • Package releases
  • Conference presentations

Similar Workshops

Next Steps


Last updated: November 2025 | R/Pharma 2025 Conference