Processing pipelines

The processing pipelines provided out-of-the-box by the Data Factory enable an automated processing of data made available to MIP Local or MIP Federated.

Overview of all pipelines

graph LR data_in(Anonymised data from Data Capture or other sources) data_out(Research-grade data) reorg_pipeline> Reorganisation pipeline] ehr_pipeline> EHR curation pipeline] metadata_pipeline> Metadata curation pipeline] preprocessing_pipeline> MRI pre-processing and feature extraction pipeline] normalisation_pipeline> Normalisation and data export pipeline] data_in --> reorg_pipeline reorg_pipeline --> ehr_pipeline reorg_pipeline --> metadata_pipeline reorg_pipeline --> preprocessing_pipeline ehr_pipeline --> normalisation_pipeline metadata_pipeline --> normalisation_pipeline preprocessing_pipeline --> normalisation_pipeline normalisation_pipeline --> data_out

Reorganisation pipeline

This pipeline takes data organised on the disk in its original format and reorganise it into something that fits the layout expected by the following pipelines (EHR, pre-processing, metadata).

graph LR data_in(Anonymised data from Data Capture or other sources) data_out(Reorganised data) processing> Reorganisation of MRI scans and EHR data] data_in --> processing processing --> data_out

EHR curation pipeline

This pipeline captures as many variables as possible from the patient records and stores the data into a database compliant with I2B2 schema (‘I2B2 capture’ database)

graph LR data_in(CSV files or other files containing EHR data) data_out(I2B2 capture database) processing> ETL with light mapping of EHR data to I2B2 schema] data_in --> processing processing --> data_out

Metadata curation pipeline

This pipeline collects the information associated with MRI scans and present either in DICOM headers or in associated metadata files and stores in into the (‘I2B2 capture’ database)

graph LR data_in(Metadata extracted from MRI scans) data_out(I2B2 capture database) processing> ETL with light mapping of metadata to I2B2 schema] data_in --> processing processing --> data_out

MRI pre-processing and feature extraction pipeline

This pipeline takes MRI data organised following the directory structure /PatientID/StudyID/SeriesProtocol/SeriesID/ and applies a series of processing steps on it, including:

  • Conversion from DICOM to Nifti
  • Neuromorphometric pipeline
  • Quality control

For each step, data provenance is tracked and stored in a ‘Data Catalog’ database.

graph LR data_in(MRI scans) data_out(Features stored into 'I2B2 capture' database) processing> Neuromorphometric pipeline + quality control + provenance] data_in --> processing processing --> data_out

Normalisation and data export pipeline

This pipeline is triggered on a patient record when there is enough information collected (both EHR and biomarkers from MRI are required). It uses the data mapping and transformation specifications provided by the DGDS committee to select the variables of interest and normalise them into the MIP Common Data Elements reference.

graph LR data_in('I2B2 capture' database) data_normalised('I2B2 MIP CDE' database) data_out(Features table containing research-grade data) processing> Selection of variables and normalisation] export> Export MIP CDE variables and other variables to a Features table] data_in --> processing processing --> data_normalised data_normalised --> export export --> data_out

This pipeline provides the final results produced by the Data Factory.

From Data Factory output specifications
The ouput of Data Factory is a set of research-grade data containing the biomarkers extracted from MRI scans and the variables extracted from the patient (or research subject) EHR records.

This information is sent to the Hospital Database and provided to the machine learning and statistical analysis algorithms of the Algorithm Factory as well as the distributed queries when the instance of MIP at a hospital is connected to the Federation.