mip_dmp.process.matching module

Module that provides functions to support the matching of dataset columns to CDEs.

mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]

Generate the initial transform.

Parameters

dataset_column_valueslist of str: Dataset column values.
cde_code_valueslist of str: CDE code values.
dataset_columnstr: Dataset column.

Returns

initial_transformstr: Initial transform.

mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]

Make the n closest match distance vector for a given input dataset column.

Parameters

matchedCdeCodesdict

Dictionary of the matching results in the form:

{
    "inputDatasetColumn1": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    "inputDatasetColumn2": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    ...
}

inputDatasetColumnlstr

Input dataset column name.

Returns

distanceVectornumpy.ndarray: Similarity/distance vector.

mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]

Make the initial transform.

Parameters

datasetpandas.DataFrame: Dataset to be mapped.
schemapandas.DataFrame: Schema to which the dataset is mapped.
dataset_columnstr: Dataset column.
cde_codestr: CDE code.

Returns

dict: Initial transform.

mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]

Match a dataset column to CDEs using fuzzy matching.

Parameters

dataset_columnstr: Dataset column.
schemapandas.DataFrame: Schema to which the dataset is mapped.

Returns

list: List of matched CDE codes ordered by decreasing fuzzy ratio.

mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]

Initialize the mapping table by matching the dataset columns with the CDE codes.

Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)

Parameters

datasetpandas.DataFrame: Dataset to be mapped.
schemapandas.DataFrame: Schema to which the dataset is mapped.
nb_kept_matchesint: Number of matches to keep for each dataset column.
matching_methodstr: Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.

Returns

pandas.DataFrame

Mapping table represented as a Pandas DataFrame.

matched_cde_codesdict

Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:

{
    "dataset_column_1": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    "dataset_column_2": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    ...
}

dataset_column_embeddingslist

List of embedding vectors for the dataset columns.

schema_code_embeddingslist

List of embedding vectors for the CDE codes.