mip_dmp.process.matching module

Module that provides functions to support the matching of dataset columns to CDEs.

mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]

Generate the initial transform.

Parameters

dataset_column_valueslist of str

Dataset column values.

cde_code_valueslist of str

CDE code values.

dataset_columnstr

Dataset column.

Returns

initial_transformstr

Initial transform.

mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]

Make the n closest match distance vector for a given input dataset column.

Parameters

matchedCdeCodesdict

Dictionary of the matching results in the form:

{
    "inputDatasetColumn1": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    "inputDatasetColumn2": {
        "words": ["word1", "word2", ...],
        "distances": [distance1, distance2, ...],
        "embeddings": [embedding1, embedding2, ...]
    },
    ...
}
inputDatasetColumnlstr

Input dataset column name.

Returns

distanceVectornumpy.ndarray

Similarity/distance vector.

mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]

Make the initial transform.

Parameters

datasetpandas.DataFrame

Dataset to be mapped.

schemapandas.DataFrame

Schema to which the dataset is mapped.

dataset_columnstr

Dataset column.

cde_codestr

CDE code.

Returns

dict

Initial transform.

mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]

Match a dataset column to CDEs using fuzzy matching.

Parameters

dataset_columnstr

Dataset column.

schemapandas.DataFrame

Schema to which the dataset is mapped.

Returns

list

List of matched CDE codes ordered by decreasing fuzzy ratio.

mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]

Initialize the mapping table by matching the dataset columns with the CDE codes.

Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)

Parameters

datasetpandas.DataFrame

Dataset to be mapped.

schemapandas.DataFrame

Schema to which the dataset is mapped.

nb_kept_matchesint

Number of matches to keep for each dataset column.

matching_methodstr

Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.

Returns

pandas.DataFrame

Mapping table represented as a Pandas DataFrame.

matched_cde_codesdict

Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:

{
    "dataset_column_1": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    "dataset_column_2": {
        "words": ["cde_code_1", "cde_code_2", ...],
        "distances": [0.9, 0.8, ...],
        "embeddings": [None, None, ...]
    },
    ...
}
dataset_column_embeddingslist

List of embedding vectors for the dataset columns.

schema_code_embeddingslist

List of embedding vectors for the CDE codes.