Step 2: Curating a BIDS Dataset#
Curation of neuroimaging datasets is a critical step in ensuring that your data are well-organized, accurate, and ready for processing by standardized pipelines. CuBIDS [Covitz et al., 2022] is a specialized tool designed to facilitate the curation of BIDS datasets by automatically identifying and flagging potential issues. Here’s why this step is essential:
Quality Assurance and Consistency
Standardization:
CuBIDS verifies that every element of your dataset adheres to the BIDS specification. This includes checking file naming, folder structures, and associated metadata, which helps maintain consistency across large, multi-site or multi-session datasets.Error Detection:
By automatically identifying inconsistencies, missing files, or mislabeling, CuBIDS allows you to address problems early—preventing errors from propagating through your data processing pipelines.
Enhanced Reproducibility
Reliable Analyses:
A curated dataset minimizes the risk of errors that could compromise your neuroimaging analyses. When all files are in a predictable, standardized format, downstream tools like fMRIPrep or qsiprep can operate more reliably, ultimately leading to reproducible results.Transparent Documentation:
Automated curation provides a record of detected issues and the corrections applied. This level of documentation is crucial for ensuring transparency in your research methods and for facilitating reproducibility by other researchers.
Facilitating Collaboration and Data Sharing
Interoperability:
Standardized datasets are easier to share with collaborators and integrate into multi-site studies. When everyone adheres to the same conventions, the barrier to entry for using and analyzing shared data is greatly reduced.User Confidence:
Well-curated data build confidence among lab members and external collaborators, ensuring that analyses are based on high-quality, standardized inputs.
Step 1: Let’s look at our data#
For the rest of this workshop (unless you want to look at your own data), we will be using a different dataset that I have pre-BIDSified. It can be found in ${SHARED_DATA_DIR}/grmpy_bids/bids_orig. I only made a few modifications after dicom conversion. Those were:
Deleting original non-defaced anatomicals and dicoms (for privacy)
Removing some protected health information (PHI; for privacy)
Alphabetizing the json metadata fields (for convenience / readability)
Added some information to the
dataset_description.jsonandREADMEAdded participant names to
participants.tsv(false information for privacy)
The first thing you will do is to create your own copy (not a symlink) of the data, then we will redefine this as BIDS.
cp -r $SHARED_DATA_DIR/grmpy_bids/bids_orig /path/where/you/want/bids_orig
export BIDS=/path/where/you/want/bids_orig/
From here we will essentially be following the great CuBIDS documentation.
We will be using DataLad to version control our dataset. It is a pretty amazing tool. But before we version control everything, lets confirm that there is no PHI that we should remove. We can run the following, which prints out all available json metadata fields:
cubids print-metadata-fields $BIDS
which returns:
Loading BIDS schema version: 0.11.3. BIDS version: 1.10.0 Acknowledgments AcquisitionMatrixPE AcquisitionNumber AcquisitionTime Authors B0FieldIdentifier B0FieldSource BIDSVersion BandwidthPerPixelPhaseEncode BaseResolution BidsGuess BodyPartExamined CoilCombinationMethod ConsistencyInfo ConversionSoftware ConversionSoftwareVersion DatasetDOI Dcm2bidsVersion DerivedVendorReportedEchoSpacing DeviceSerialNumber DiffusionScheme DwellTime EchoNumber EchoTime EchoTime1 EchoTime2 EchoTrainLength EffectiveEchoSpacing FlipAngle Funding HowToAcknowledge ImageComments ImageOrientationPatientDICOM ImageOrientationText ImageType ImagingFrequency InPlanePhaseEncodingDirectionDICOM InstitutionAddress InstitutionName InstitutionalDepartmentName IntendedFor InversionTime License MRAcquisitionType MagneticFieldStrength Manufacturer ManufacturersModelName MatrixCoilMode Modality MultibandAccelerationFactor Name NonlinearGradientCorrection ParallelReductionFactorInPlane PartialFourier PatientName PatientPosition PercentPhaseFOV PercentSampling PhaseEncodingDirection PhaseEncodingSteps PhaseResolution PixelBandwidth ProcedureStepDescription ProtocolName PulseSequenceDetails ReceiveCoilActiveElements ReceiveCoilName ReconMatrixPE RefLinesPE ReferencesAndLinks RepetitionTime SAR ScanOptions ScanningSequence SequenceName SequenceVariant SeriesDescription SeriesNumber ShimSetting SliceThickness SliceTiming SoftwareVersions SpacingBetweenSlices SpoilingState StationName TaskName TotalReadoutTime TxRefAmp VariableFlipAngleFlag WipMemBlock age group sex
Note
The lowercase age and sex are from the participants.json file.
For this exercise, I added a fake PatientName to one of the files, which is DEFINITELY PHI that should be removed. We can remove it with:
cubids remove-metadata-fields $BIDS --fields PatientName
Rerunning the print-metadata-fields command now should reveal that PatientName is no longer present.
Step 2: Check the Dataset into DataLad#
Now we are going to create a version-controlled version of our dataset, starting with initializing an empty directory:
datalad create -c text2git /path/where/you/want/BIDS_Dataset_DataLad
Now let’s copy our files from the original BIDS directory to the tracked version:
cp -r $BIDS/* /path/where/you/want/BIDS_Dataset_DataLad
When that’s done we can safely remove our original BIDS data and redefine BIDS to be our tracked version moving forward:
rm -rf $BIDS/
export BIDS=/path/where/you/want/BIDS_Dataset_DataLad
We can see that in our tracked version, we now have new data that is waiting to be saved. cd $BIDS && datalad status returns:
untracked: CHANGES (file)
untracked: README (file)
untracked: dataset_description.json (file)
untracked: participants.json (file)
untracked: participants.tsv (file)
untracked: sub-20017 (directory)
untracked: sub-20153 (directory)
untracked: sub-20326 (directory)
untracked: sub-20461 (directory)
untracked: sub-20589 (directory)
untracked: sub-20649 (directory)
untracked: sub-20809 (directory)
untracked: sub-20888 (directory)
untracked: sub-20963 (directory)
untracked: sub-21016 (directory)
untracked: task-fracback_acq-singleband_events.tsv (file)
Let’s save these initial versions of our data now:
datalad save -d $BIDS -m "checked dataset into datalad"
We can see this is now tracked when we run cd $BIDS && git log:
commit 3f1d126550fe8ccec2d143ee3c8347e64f807aa4 (HEAD -> master)
Author: Steven Meisler <steven.meisler@pennmedicine.upenn.edu>
Date: Thu Mar 13 14:28:27 2025 -0400
checked dataset into datalad
commit 91e457485db085252ee7a1d04dc010393537f24d
Author: Steven Meisler <steven.meisler@pennmedicine.upenn.edu>
Date: Thu Mar 13 14:23:29 2025 -0400
Instruct annex to add text files to Git
commit e6ebaab0c0740518425091398d5cf940b41618ff
Author: Steven Meisler <steven.meisler@pennmedicine.upenn.edu>
Date: Thu Mar 13 14:23:27 2025 -0400
[DATALAD] new dataset
Step 3: Add NIFTI data to the JSONS#
Next, we seek to add more image parameters to our sidecars so that we can better define our Entity Sets. Historically, only a subset of parameters in the NIfTI image header have been included in a BIDS sidecar. Parameters such as image dimensions, number of volumes, image obliquity, and voxel sizes are all important data that can change how our pipelines will eventually run!
To add them to the sidecar metadata, run:
cubids add-nifti-info $BIDS --use-datalad
The --use-datalad flag saves the changes for you! Now we can see the following fields were added to our image files:
Dim1Size
Dim2Size
Dim3Size
NumVolumes
Obliquity
VoxelSizeDim1
VoxelSizeDim2
VoxelSizeDim3
Step 4: Run CuBIDS Validation#
CuBIDS uses the BIDS validator output to create tabular summaries of validation errors and warnings. We already know our dataset here is valid, and we are not concerned with any of the warnings, but this is still good to have.
cubids validate $BIDS v0
This validates the data and creates validation summaries in $BIDS/code/CuBIDS/.
If there were errors (e.g., a scan was missing too many volumes from being stopped early) and you wanted to safely purge the volume from the dataset, while updating the IntendedFor accordingly, refer to the cubids purge function documented here.
Now that we added the v0 validation summaries to the code directory, we have to datalad save our status before we proceed:
datalad save -d $BIDS -m "Ran v0 validation"
Step 5: Check for Variant Groups#
Using the metadata of the images, CuBIDS will check for images that have variants in fields such as voxel dimensions, repetition times, and other important fields. To do that, we run:
cubids group $BIDS v0
Now let’s look at the outputs with cat $BIDS/code/CuBIDS/v0_AcqGroupInfo.txt:
1 9 ('datatype-anat_reconstruction-defaced_suffix-T1w', 1) ('datatype-anat_reconstruction-defaced_suffix-T2w', 1) ('datatype-dwi_direction-AP_suffix-dwi_acquisition-multiband', 1) ('datatype-fmap_direction-PA_fmap-epi_suffix-epi_acquisition-multiband', 1) ('datatype-fmap_fmap-magnitude1_suffix-magnitude1', 1) ('datatype-fmap_fmap-magnitude2_suffix-magnitude2', 1) ('datatype-fmap_fmap-phasediff_suffix-phasediff', 1) ('datatype-func_suffix-bold_task-fracback_acquisition-singleband', 1) ('datatype-func_suffix-bold_task-rest_acquisition-multiband', 1) ('datatype-func_suffix-bold_task-rest_acquisition-singleband', 1)
2 1 ('datatype-anat_reconstruction-defaced_suffix-T1w', 1) ('datatype-anat_reconstruction-defaced_suffix-T2w', 1) ('datatype-dwi_direction-AP_suffix-dwi_acquisition-multiband', 1) ('datatype-fmap_direction-PA_fmap-epi_suffix-epi_acquisition-multiband', 1) ('datatype-fmap_fmap-magnitude1_suffix-magnitude1', 1) ('datatype-fmap_fmap-magnitude2_suffix-magnitude2', 2) ('datatype-fmap_fmap-phasediff_suffix-phasediff', 2) ('datatype-func_suffix-bold_task-fracback_acquisition-singleband', 2) ('datatype-func_suffix-bold_task-rest_acquisition-multiband', 2) ('datatype-func_suffix-bold_task-rest_acquisition-singleband', 2)
We can see that there are two groups of this data. A dominant group with 9 subjects, and another group with one subject that had variant BOLD runs and a magnitude2 and phasediff fieldmap file. We see with cat $BIDS/code/CuBIDS/v0_AcqGrouping.tsv that the subject is sub-20017.
How were these files different?
Looking at cat $BIDS/code/CuBIDS/v0_summary.tsv, we see that the variants have names:
datatype-fmap_fmap-magnitude2_suffix-magnitude2_acquisition-VARIANTMultibandAccelerationFactor2
datatype-fmap_fmap-phasediff_suffix-phasediff_acquisition-VARIANTMultibandAccelerationFactor2
datatype-func_suffix-bold_task-fracback_acquisition-singlebandVARIANTObliquityFalse
datatype-func_suffix-bold_task-rest_acquisition-multibandVARIANTObliquityFalse
datatype-func_suffix-bold_task-rest_acquisition-singlebandVARIANTObliquityFalse
This implies that our variant subject’s fieldmaps had different Multiband Acceleration Factors and BOLD files that were not acquired obliquely.
Note
This is an odd dataset in which non-oblique was variant. Most data isn’t collected obliquely! Note that fMRIPrep and QSIPrep can handle oblique files fine, but it has to make adjustments, so this is good to note.
Following instructions here, we can decide how we want these variants renamed and whether we want to keep them in the dataset. But for the sake of this exercise, we will use the default renaming scheme and not do any purging.
We save our progress before proceeding:
datalad save -d $BIDS -m "run cubids grouping"
Apply the Variant Renaming#
Now we want to rename our files to indicate the variant groups. Note that if we were to do this manually, we might miss important details like updating paths in IntendedFor entries, and CuBIDS takes care of this for us! We can apply our renaming with:
cubids apply $BIDS v0_summary.tsv v0_files.tsv v1 --use-datalad
Note
This step can take a long time on large datasets, and could be worth submitting to a compute node.
When this finishes, we see that the variant subject’s files have been renamed, with ls $BIDS/sub-20017/ses-1/func:
sub-20017_ses-1_task-fracback_acq-singlebandVARIANTObliquityFalse_bold.json sub-20017_ses-1_task-rest_acq-multibandVARIANTObliquityFalse_bold.nii.gz
sub-20017_ses-1_task-fracback_acq-singlebandVARIANTObliquityFalse_bold.nii.gz sub-20017_ses-1_task-rest_acq-singlebandVARIANTObliquityFalse_bold.json
sub-20017_ses-1_task-rest_acq-multibandVARIANTObliquityFalse_bold.json sub-20017_ses-1_task-rest_acq-singlebandVARIANTObliquityFalse_bold.nii.gz
and in the fieldmaps, we see, with cat $BIDS/sub-20017/ses-1/fmap/sub-20017_ses-1_phasediff.json:
{
"IntendedFor": [
"ses-1/func/sub-20017_ses-1_task-fracback_acq-singlebandVARIANTObliquityFalse_bold.nii.gz",
"ses-1/func/sub-20017_ses-1_task-rest_acq-multibandVARIANTObliquityFalse_bold.nii.gz",
"ses-1/func/sub-20017_ses-1_task-rest_acq-singlebandVARIANTObliquityFalse_bold.nii.gz"
]
}
CuBIDS errantly put a file called v1_full_cmd.sh in the BIDS root directory. Let’s move it to the code directory:
mv $BIDS/v1_full_cmd.sh $BIDS/code/CuBIDS/
Rerunning the BIDS validator (while ignoring warnings) reveals the dataset is still valid:
deno run -ERWN jsr:@bids/validator $PWD --ignoreWarnings
This dataset appears to be BIDS compatible.
Summary: Available Tasks: Available Modalities:
246 Files, 7.95 GB Fractal N-Back MRI
10 - Subjects 1 - Sessions Resting State
If you have any questions, please post on https://neurostars.org/tags/bids.
Now that this is all ready to go, we save one last time:
datalad save -d $BIDS -m "finished v1 curation"
If wanted to be really careful, we can create an exemplar dataset that has one subject per variant group to test our pipelines. That could be helpful for large datasets with several variants. But for our purposes, we will just proceed with processing everyone.