Skip to content

Managing Metadata

Metadata in Calypr is formatted using the Fast Healthcare Interoperability Resources (FHIR) schema. If you choose to bring your own FHIR newline delimited json data, you will need to create a directory called “META” in your git-drs repository in the same directory that you initialized your git-drs repository, and place your metadata files in that directory.
The META/ folder contains newline-delimited JSON (.ndjson) files representing FHIR resources describing the project, its data, and related entities. Large files are tracked using Git LFS, with a required correlation between each data file and a DocumentReference resource. This project follows a standardized structure to manage large research data files and associated FHIR metadata in a version-controlled, DRS and FHIR compatible format.
Each file must contain only one type of FHIR resource type, for example META/ResearchStudy.ndjson only contains research study resource typed FHIR objects. The name of the file doesn’t have to match the resource type name, unless you bring your own document references, then you must use DocumentReference.ndjson. For all other FHIR file types this is simply a good organizational practice for organizing your FHIR metadata.

META/ResearchStudy.ndjson

  • The File directory structure root research study is based on the 1st Research Study in the document. This research study is the research study that the autogenerated document references are connected to. Any additional research studies that are provided will be ignored when populating the miller table file tree.
  • Contains at least one FHIR ResearchStudy resource describing the project.
  • Defines project identifiers, title, description, and key attributes.

META/DocumentReference.ndjson

  • Contains one FHIR DocumentReference resource per Git LFS-managed file.
  • Each DocumentReference.content.attachment.url field:
  • Must exactly match the relative path of the corresponding file in the repository.
  • Example:

{
"resourceType": "DocumentReference",
"id": "docref-file1",
"status": "current",
"content": [
{
"attachment": {
"url": "data/file1.bam",
"title": "BAM file for Sample X"
}
}
]
}

Place your custom FHIR ndjson files in the META/ directory:

# Copy your prepared FHIR metadata
cp \~/my-data/patients.ndjson META/
cp \~/my-data/observations.ndjson META/
cp \~/my-data/specimens.ndjson META/
cp \~/my-data/document-references.ndjson META/

Other FHIR data

[TODO More intro text here]

  • Patient.ndjson: Participant records.
  • Specimen.ndjson: Biological specimens.
  • ServiceRequest.ndjson: Requested procedures.
  • Observation.ndjson: Measurements or results.
  • Other valid FHIR resource types as required.

Ensure your FHIR DocumentReference resources reference the DRS URIs:

Example DocumentReference linking to S3 file:

{
"resourceType": "DocumentReference",
"id": "doc-001",
"status": "current",
"content": [{
"attachment": {
"url": "drs://calypr-public.ohsu.edu/your-drs-id",
"title": "sample1.bam",
"contentType": "application/octet-stream"
}
}],
"subject": {
"reference": "Patient/patient-001"
}
}


Validating Metadata

To ensure that the FHIR files you have added to the project are correct and pass schema checking, you can use the forge software.

forge validate

Successful output:

✓ Validating META/patients.ndjson... OK
✓ Validating META/observations.ndjson... OK
✓ Validating META/specimens.ndjson... OK
✓ Validating META/document-references.ndjson... OK
All metadata files are valid.

Fix any validation errors and re-run until all files pass.

Forge Data Quality Assurance Command Line Commands

If you have provided your own FHIR resources there are two commands that might be useful to you for ensuring that your FHIR metadata will appear on the CALYPR data platform as expected. These commands are validate and check-edge

Validate- Example:

```forge validate META``` or ```forge validate META/DocumentReference.ndjson```

Validate checks to see if the provided directory or file will be accepted by the CALYPR data platform or whether there are validation errors that make it not accepted into the data platform. Validation errors range from improper JSON formatting to FHIR schema validation errors. We are currently using FHIR version R5 so the earlier version will not validate against our schema.

Check-edge- Example:

```forge check-edge META``` or ```forge validate META/DocumentReference.ndjson```

Check edge emulates exactly what will happen during data submission to your FHIR files. Your FHIR files will be loaded into a graph database. In order to create the graph edges must be generated from the references specified in your FHIR data to connect your vertices, which are essentially the rest of the NDJSON FHIR files that have been provided.

Check edge aims to ensure that the references that have been specified in the files do connect to known vertices and aren’t ‘orphaned’. Check edge does not take into account existing vertices that are already in the CALYPR graph and could potentially claim certain edges do not connect to anything if they are connecting to vertices that are in CALYPR but outside of the data that is provided when doing an edge check.

Validation Process

1. Schema Validation

  • Each .ndjson file in META/ (like ResearchStudy.ndjson, DocumentReference.ndjson, etc.) is read line by line.
  • Every line is parsed as JSON and checked against the corresponding FHIR schema for that resourceType.
  • Syntax errors, missing required fields, or invalid FHIR values trigger clear error messages with line numbers.

2. Mandatory Files Presence

  • Confirms that:
  • ResearchStudy.ndjson exists and has at least one valid record.
  • DocumentReference.ndjson exists and contains at least one record.
  • If either is missing or empty, validation fails.

3. One-to-One Mapping of Files to DocumentReference

  • Scans the working directory for Git LFS-managed files in expected locations (e.g., data/).
  • For each file, locates a corresponding DocumentReference resource whose content.attachment.url matches the file’s relative path.
  • Validates:
  • All LFS files have a matching DocumentReference.
  • All DocumentReferences point to existing files.

4. Project-level Referential Checks

  • Validates that DocumentReference resources reference the same ResearchStudy via relatesTo or other linking mechanisms.
  • If FHIR resources like Patient, Specimen, ServiceRequest, Observation are present, ensures:
  • Their id fields are unique.
  • DocumentReference correctly refers to those resources (e.g., via subject or related fields).

5. Cross-Entity Consistency

  • If multiple optional FHIR .ndjson files exist:
  • Confirms IDs referenced in one file exist in others.
  • Detects dangling references (e.g., a DocumentReference.patient ID that's not in Patient.ndjson).

✅ Example Error Output

ERROR META/DocumentReference.ndjson line 4: url "data/some_missing.bam" does not resolve to an existing file
ERROR META/Specimen.ndjson line 2: id "specimen-123" referenced in Observation.ndjson but not defined


🎯 Purpose & Benefits

  • Ensures all files and metadata are in sync before submission.
  • Prevents submission failures due to missing pointers or invalid FHIR payloads.
  • Enables CI integration, catching issues early in the development workflow.

Validation Requirements

Automated tools or CI processes must:

  • Verify presence of META/ResearchStudy.ndjson with at least one record.
  • Verify presence of META/DocumentReference.ndjson with one record per LFS-managed file.
  • Confirm every DocumentReference.url matches an existing file path.
  • Check proper .ndjson formatting.