mip_dmp.process

The mip_dmp.process subpackage contains modules that provide processing functions for the MIP Dataset Mapper.

mip_dmp.process.embedding

Module that provides function to handle word embeddings and operations on them.

mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]

Find the chars2vec embedding for the text.

Parameters

textstr

Text to be embedded.

chars2vec_modelstr

chars2vec model to be used, loaded by the gensim library.

Returns

numpy.ndarray

chars2vec embedding for the text.

mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]

Find the matches based on chars2vec embeddings and cosine similarity.

Parameters

x_embeddingstr

String to compare against.

y_embeddingstr

String to compare.

chars2vec_modelstr

chars2vec model to be used, loaded by the gensim library.

Returns

float

Cosine similarity between the two chars2vec embeddings of the strings.

mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]

Find the n closest embeddings to the given embedding.

Parameters

word_embeddingnumpy.ndarray

Embedding to find the n closest embeddings to.

embeddingslist

List of embeddings to find the closest embeddings to the given embedding in.

embedding_wordslist

List of words corresponding to the embeddings that will be resorted and reduced accordingly.

nint

Number of closest embeddings to find.

Returns

dict

Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:

{
    "distances": [float],
    "embeddings": [numpy.ndarray],
    "embedding_words": [str]
}
mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]

Generate embeddings for a list of words.

Parameters

wordslist

List of words to generate embeddings for.

embedding_methodstr

Embedding method to be used, either “chars2vec” or “glove”.

Returns

list

List of embeddings for the words.

mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]

Find the Glove embedding for the text.

Parameters

textstr

Text to be embedded.

glove_modelstr

Glove model to be used, loaded by the gensim library.

Returns

numpy.ndarray

Glove embedding for the text.

mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]

Reduce the dimension of the embeddings, mainly for visualization purposes.

Parameters

embeddingslist

List of embeddings to reduce the dimension of.

reduce_methodstr

Method to use to reduce the dimension, either “tsne” or “pca”.

n_componentsint

Number of components to reduce the dimension to.

Returns

list

List of reduced embeddings.

mip_dmp.process.mapping

Module that provides functions to support the mapping of datasets to a specific CDEs metadata schema.

mip_dmp.process.mapping.apply_transform_map(dataset_column, transform)[source]

Apply the transform map for binomial and multinominal variables.

Parameters

dataset_columnpandas.DataFrame

Dataset column to be transformed.

transformstr

Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.

Returns

dataset_column: pandas.DataFrame

The transformed dataset column.

mip_dmp.process.mapping.apply_transform_scale(dataset_column, cde_code, cde_type, scaling_factor)[source]

Apply the transform scale for real and integer variables.

Parameters

dataset_columnpandas.DataFrame

Dataset column to be transformed.

cde_codestr

CDE code of the dataset column.

cde_typestr

CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.

scaling_factorfloat

Scaling factor to be applied to the dataset column.

Returns

dataset_column: pandas.DataFrame

The transformed dataset column.

mip_dmp.process.mapping.map_dataset(dataset, mappings, cde_codes)[source]

Map the dataset to the schema.

Parameters

datasetpandas.DataFrame

Dataset to be mapped.

mappingsdict

Mappings of the dataset columns to the schema columns.

cde_codeslist

List of codes of the CDE metadata schema.

Returns

pandas.DataFrame

Mapped dataset.

mip_dmp.process.mapping.transform_dataset_column(dataset_column, cde_code, cde_type, transform_type, transform)[source]

Transform the dataset column.

Parameters

dataset_columnpandas.DataFrame

Dataset column to be transformed.

cde_codestr

CDE code of the dataset column.

cde_typestr

CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.

transform_typestr

Type of transformation to be applied to the dataset column. Can be “map” or “scale”.

transformstr

Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.

Returns

dataset_column: pandas.DataFrame

The transformed dataset column.

mip_dmp.process.matching

Module that provides functions to support the matching of dataset columns to CDEs.

mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]

Generate the initial transform.

Parameters

dataset_column_valueslist of str

Dataset column values.

cde_code_valueslist of str

CDE code values.

dataset_columnstr

Dataset column.

Returns

initial_transformstr

Initial transform.

mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]

Make the n closest match distance vector for a given input dataset column.

Parameters

matchedCdeCodesdict

Dictionary of the matching results in the form:

{
    "inputDatasetColumn1": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    "inputDatasetColumn2": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    ...
}
inputDatasetColumnlstr

Input dataset column name.

Returns

distanceVectornumpy.ndarray

Similarity/distance vector.

mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]

Make the initial transform.

Parameters

datasetpandas.DataFrame

Dataset to be mapped.

schemapandas.DataFrame

Schema to which the dataset is mapped.

dataset_columnstr

Dataset column.

cde_codestr

CDE code.

Returns

dict

Initial transform.

mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]

Match a dataset column to CDEs using fuzzy matching.

Parameters

dataset_columnstr

Dataset column.

schemapandas.DataFrame

Schema to which the dataset is mapped.

Returns

list

List of matched CDE codes ordered by decreasing fuzzy ratio.

mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]

Initialize the mapping table by matching the dataset columns with the CDE codes.

Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)

Parameters

datasetpandas.DataFrame

Dataset to be mapped.

schemapandas.DataFrame

Schema to which the dataset is mapped.

nb_kept_matchesint

Number of matches to keep for each dataset column.

matching_methodstr

Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.

Returns

pandas.DataFrame

Mapping table represented as a Pandas DataFrame.

matched_cde_codesdict

Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:

{
    "dataset_column_1": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    "dataset_column_2": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    ...
}
dataset_column_embeddingslist

List of embedding vectors for the dataset columns.

schema_code_embeddingslist

List of embedding vectors for the CDE codes.

mip_dmp.process.utils

Module that provides functions to support the modules of the mip_dmp.process sub-package.

mip_dmp.process.utils.is_number(s)[source]

Check if a string is a number.

Parameters

sstr

String to check.

Returns

bool

True if the string is a number, False otherwise.