`mip_dmp.process`

The mip_dmp.process subpackage contains modules that provide processing functions for the MIP Dataset Mapper.

`mip_dmp.process.embedding`

Module that provides function to handle word embeddings and operations on them.

mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]

Find the chars2vec embedding for the text.

Parameters

textstr: Text to be embedded.
chars2vec_modelstr: chars2vec model to be used, loaded by the gensim library.

Returns

numpy.ndarray: chars2vec embedding for the text.

mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]

Find the matches based on chars2vec embeddings and cosine similarity.

Parameters

x_embeddingstr: String to compare against.
y_embeddingstr: String to compare.
chars2vec_modelstr: chars2vec model to be used, loaded by the gensim library.

Returns

float: Cosine similarity between the two chars2vec embeddings of the strings.

mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]

Find the n closest embeddings to the given embedding.

Parameters

word_embeddingnumpy.ndarray: Embedding to find the n closest embeddings to.
embeddingslist: List of embeddings to find the closest embeddings to the given embedding in.
embedding_wordslist: List of words corresponding to the embeddings that will be resorted and reduced accordingly.
nint: Number of closest embeddings to find.

Returns

dict

Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:

{
    "distances": [float],
    "embeddings": [numpy.ndarray],
    "embedding_words": [str]
}

mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]

Generate embeddings for a list of words.

Parameters

wordslist: List of words to generate embeddings for.
embedding_methodstr: Embedding method to be used, either “chars2vec” or “glove”.

Returns

list: List of embeddings for the words.

mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]

Find the Glove embedding for the text.

Parameters

textstr: Text to be embedded.
glove_modelstr: Glove model to be used, loaded by the gensim library.

Returns

numpy.ndarray: Glove embedding for the text.

mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]

Reduce the dimension of the embeddings, mainly for visualization purposes.

Parameters

embeddingslist: List of embeddings to reduce the dimension of.
reduce_methodstr: Method to use to reduce the dimension, either “tsne” or “pca”.
n_componentsint: Number of components to reduce the dimension to.

Returns

list: List of reduced embeddings.

`mip_dmp.process.mapping`

Module that provides functions to support the mapping of datasets to a specific CDEs metadata schema.

mip_dmp.process.mapping.apply_transform_map(dataset_column, transform)[source]

Apply the transform map for binomial and multinominal variables.

Parameters

dataset_columnpandas.DataFrame: Dataset column to be transformed.
transformstr: Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.

Returns

dataset_column: pandas.DataFrame: The transformed dataset column.

mip_dmp.process.mapping.apply_transform_scale(dataset_column, cde_code, cde_type, scaling_factor)[source]

Apply the transform scale for real and integer variables.

Parameters

dataset_columnpandas.DataFrame: Dataset column to be transformed.
cde_codestr: CDE code of the dataset column.
cde_typestr: CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.
scaling_factorfloat: Scaling factor to be applied to the dataset column.

Returns

dataset_column: pandas.DataFrame: The transformed dataset column.

mip_dmp.process.mapping.map_dataset(dataset, mappings, cde_codes)[source]

Map the dataset to the schema.

Parameters

datasetpandas.DataFrame: Dataset to be mapped.
mappingsdict: Mappings of the dataset columns to the schema columns.
cde_codeslist: List of codes of the CDE metadata schema.

Returns

pandas.DataFrame: Mapped dataset.

mip_dmp.process.mapping.transform_dataset_column(dataset_column, cde_code, cde_type, transform_type, transform)[source]

Transform the dataset column.

Parameters

dataset_columnpandas.DataFrame: Dataset column to be transformed.
cde_codestr: CDE code of the dataset column.
cde_typestr: CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.
transform_typestr: Type of transformation to be applied to the dataset column. Can be “map” or “scale”.
transformstr: Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.

Returns

dataset_column: pandas.DataFrame: The transformed dataset column.

`mip_dmp.process.matching`

Module that provides functions to support the matching of dataset columns to CDEs.

mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]

Generate the initial transform.

Parameters

dataset_column_valueslist of str: Dataset column values.
cde_code_valueslist of str: CDE code values.
dataset_columnstr: Dataset column.

Returns

initial_transformstr: Initial transform.

mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]

Make the n closest match distance vector for a given input dataset column.

Parameters

matchedCdeCodesdict

Dictionary of the matching results in the form:

{
    "inputDatasetColumn1": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    "inputDatasetColumn2": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    ...
}

inputDatasetColumnlstr

Input dataset column name.

Returns

distanceVectornumpy.ndarray: Similarity/distance vector.

mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]

Make the initial transform.

Parameters

datasetpandas.DataFrame: Dataset to be mapped.
schemapandas.DataFrame: Schema to which the dataset is mapped.
dataset_columnstr: Dataset column.
cde_codestr: CDE code.

Returns

dict: Initial transform.

mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]

Match a dataset column to CDEs using fuzzy matching.

Parameters

dataset_columnstr: Dataset column.
schemapandas.DataFrame: Schema to which the dataset is mapped.

Returns

list: List of matched CDE codes ordered by decreasing fuzzy ratio.

mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]

Initialize the mapping table by matching the dataset columns with the CDE codes.

Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)

Parameters

datasetpandas.DataFrame: Dataset to be mapped.
schemapandas.DataFrame: Schema to which the dataset is mapped.
nb_kept_matchesint: Number of matches to keep for each dataset column.
matching_methodstr: Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.

Returns

pandas.DataFrame

Mapping table represented as a Pandas DataFrame.

matched_cde_codesdict

Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:

{
    "dataset_column_1": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    "dataset_column_2": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    ...
}

dataset_column_embeddingslist

List of embedding vectors for the dataset columns.

schema_code_embeddingslist

List of embedding vectors for the CDE codes.

`mip_dmp.process.utils`

Module that provides functions to support the modules of the mip_dmp.process sub-package.

mip_dmp.process.utils.is_number(s)[source]

Check if a string is a number.

Parameters

sstr: String to check.

Returns

bool: True if the string is a number, False otherwise.

mip_dmp.process

mip_dmp.process.embedding

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

mip_dmp.process.mapping

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

mip_dmp.process.matching

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

Parameters

Returns

mip_dmp.process.utils

Parameters

Returns

`mip_dmp.process`

`mip_dmp.process.embedding`

`mip_dmp.process.mapping`

`mip_dmp.process.matching`

`mip_dmp.process.utils`