Prepare the Lookup CSV

Create a lookup.csv to be used in prepare.py. The prepare.py script uses the lookup.csv file to determine which subjects and sessions will be uploaded. 

  1. Run the make_lookupcsv.py under the utilities directory within this repository.

  2. The --info_file can either be the abcd_mri01.txt or the abcd_fastqc01.txt file. Specify which file you're using with the --abcdmri or --fasttrack flag.

  3. While either file can be used, the fastqc file typically will have more subjects than the mri file.

  4. You will also need to provide the path to where you want your lookup.csv with the --lookup_csv flag.

  5. Verify that all of the subjects you want are included by running validate-lookupcsv.py also under the utilities directory.

  6. This script will make sure that all of the subjects in your subject list are present in the lookup.csv.

  7. This script will also remove any lines with a duplicate subject,session pair that have a different interview date by choosing whichever line has the earliest date.

  8. Please note that this will not fix the issue of the same subject,session pair having different age/sex markers (an issue found in the fastqc file). That will have to be fixed manually by comparing the lookup.csv to the abcd_mri file.

How to Download the Information File

  1. Login to the NIMH Data Archive

  2. For the abcd_mri01.txt file, navigate to this page. Navigate to this page for the abcd_fastqc.txt file.

  3. Click Add to Filter Cart at the bottom

  4. Once the filter cart in the top right corner updates, click on Create Data Package/Add Data to Study

    • For the fastqc file, double check that the ABCD Dataset and ABCD Fasttrack QC Instrument checkboxes are selected (they should be by default)
  5. Click Create Data Package and name it something identifiable to you

  6. Make sure Include Documentation is selected before clicking Create Data Package. It will take a while to create

  7. Once the package is created, download it to your system using the downloadcmd within nda-tools by running:

    downloadcmd -dp data_package_id -d /download/output/directory
    

NOTE: You will need to have nda-tools downloaded either to your system directly or in a conda environment that needs to be active to run this command

Contents of the Lookup CSV

The lookup.csv contains metadata about all of the subjects included in the collection. The lookup.csv file must exist and reside in the upload folder.  It has six columns and N+1 rows where N is the number of subject and session pairs. The first row MUST be a header row with the exact keys:

bids_subject, bids_session,subjectkey,src_subject_id,interview_date,interview_age,sex

On each row, every column should hold the information for that row's bids_subject_session in a specific way.  Read on for the exact specifications.

To see an example of how we do this for the ABCC, see Appendix: Links for more information.

bids_subject_session

This follows the formatting below.

sub-<SUBJECT>[_ses-<SESSION>]

Where:

  • <SUBJECT> needs to be replaced by your actual BIDS-standard subject label

  • <SESSION> should be used if you have multiple sessions for single subjects within a dataset.  When in use, <SESSION> needs to be replaced by your actual BIDS-standard session label.  The square brackets around [_ses-<SESSION>] imply this block is optional.

Remember: BIDS labels (<SUBJECT> and <SESSION>) are ONLY alphanumeric. Spaces, underscores, hyphens, and any other separators are NOT ALLOWED. To be explicit, there are no - (hyphens) or _ (underscores) allowed in the subject or session IDs.

subjectkey

This is generated by the NDA GUID tool and always starts with NDAR. It is a globally unique identifier for the subject. If you are not the one that uploaded the data to the NDA originally, this should be in the NDA manifest text file.

src_subject_id

This is the subject ID that was used by the lab or project.  Alphanumeric-only formatting is recommended, though not strictly necessary, for src_subject_id.

interview_date

The date on which the interview/genetic test/sampling/imaging/biospecimen was completed. It MUST be in the format MM/DD/YYYY. Using either the correct acquisition dates or \"masked dates\" (anonymized to the first of the acquisition month) for the interview_date must be decided upon based on any relevant IRB restrictions.

interview_age

The age in months of the subject at the time of the interview_date. This value MUST always be an integer.

sex

The sex of the subject. There are four values accepted by the NDA at the time of this writing (January 2022):

  • M = Male

  • F = Female

  • O = Other

  • NR = Not Reported