Processing pipelines
The processing pipelines provided out-of-the-box by the Data Factory enable an
automated processing of data made available to MIP Local or MIP Federated.
Overview of all pipelines
graph LR
data_in(Anonymised data from Data Capture or other sources)
data_out(Research-grade data)
reorg_pipeline> Reorganisation pipeline]
ehr_pipeline> EHR curation pipeline]
metadata_pipeline> Metadata curation pipeline]
preprocessing_pipeline> MRI pre-processing and feature extraction pipeline]
normalisation_pipeline> Normalisation and data export pipeline]
data_in --> reorg_pipeline
reorg_pipeline --> ehr_pipeline
reorg_pipeline --> metadata_pipeline
reorg_pipeline --> preprocessing_pipeline
ehr_pipeline --> normalisation_pipeline
metadata_pipeline --> normalisation_pipeline
preprocessing_pipeline --> normalisation_pipeline
normalisation_pipeline --> data_out
Reorganisation pipeline
This pipeline takes data organised on the disk in its original format and reorganise it into
something that fits the layout expected by the following pipelines (EHR, pre-processing, metadata).
graph LR
data_in(Anonymised data from Data Capture or other sources)
data_out(Reorganised data)
processing> Reorganisation of MRI scans and EHR data]
data_in --> processing
processing --> data_out
EHR curation pipeline
This pipeline captures as many variables as possible from the patient records and stores the
data into a database compliant with I2B2 schema (‘I2B2 capture’ database)
graph LR
data_in(CSV files or other files containing EHR data)
data_out(I2B2 capture database)
processing> ETL with light mapping of EHR data to I2B2 schema]
data_in --> processing
processing --> data_out
This pipeline collects the information associated with MRI scans and present either in
DICOM headers or in associated metadata files and stores in into the (‘I2B2 capture’ database)
graph LR
data_in(Metadata extracted from MRI scans)
data_out(I2B2 capture database)
processing> ETL with light mapping of metadata to I2B2 schema]
data_in --> processing
processing --> data_out
MRI pre-processing and feature extraction pipeline
This pipeline takes MRI data organised following the directory structure /PatientID/StudyID/SeriesProtocol/SeriesID/
and applies a series of processing steps on it, including:
- Conversion from DICOM to Nifti
- Neuromorphometric pipeline
- Quality control
For each step, data provenance is tracked and stored in a ‘Data Catalog’ database.
graph LR
data_in(MRI scans)
data_out(Features stored into 'I2B2 capture' database)
processing> Neuromorphometric pipeline + quality control + provenance]
data_in --> processing
processing --> data_out
Normalisation and data export pipeline
This pipeline is triggered on a patient record when there is enough information collected
(both EHR and biomarkers from MRI are required). It uses the data mapping and transformation
specifications provided by the DGDS committee to select the variables of interest and normalise
them into the MIP Common Data Elements reference.
graph LR
data_in('I2B2 capture' database)
data_normalised('I2B2 MIP CDE' database)
data_out(Features table containing research-grade data)
processing> Selection of variables and normalisation]
export> Export MIP CDE variables and other variables to a Features table]
data_in --> processing
processing --> data_normalised
data_normalised --> export
export --> data_out
This pipeline provides the final results produced by the Data Factory.
From Data Factory output specifications
The ouput of Data Factory is a set of research-grade data containing the biomarkers extracted
from MRI scans and the variables extracted from the patient (or research subject) EHR records.
This information is sent to the Hospital Database and provided to the
machine learning and statistical analysis algorithms of the Algorithm Factory
as well as the distributed queries when the instance of
MIP at a hospital is connected to the Federation.