mip_dmp.process.matching module
Module that provides functions to support the matching of dataset columns to CDEs.
- mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]
Generate the initial transform.
Parameters
- dataset_column_valueslist of str
Dataset column values.
- cde_code_valueslist of str
CDE code values.
- dataset_columnstr
Dataset column.
Returns
- initial_transformstr
Initial transform.
- mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]
Make the n closest match distance vector for a given input dataset column.
Parameters
- matchedCdeCodesdict
Dictionary of the matching results in the form:
{ "inputDatasetColumn1": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, "inputDatasetColumn2": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, ... }
- inputDatasetColumnlstr
Input dataset column name.
Returns
- distanceVectornumpy.ndarray
Similarity/distance vector.
- mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]
Make the initial transform.
Parameters
- datasetpandas.DataFrame
Dataset to be mapped.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
- dataset_columnstr
Dataset column.
- cde_codestr
CDE code.
Returns
- dict
Initial transform.
- mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]
Match a dataset column to CDEs using fuzzy matching.
Parameters
- dataset_columnstr
Dataset column.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
Returns
- list
List of matched CDE codes ordered by decreasing fuzzy ratio.
- mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]
Initialize the mapping table by matching the dataset columns with the CDE codes.
Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)
Parameters
- datasetpandas.DataFrame
Dataset to be mapped.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
- nb_kept_matchesint
Number of matches to keep for each dataset column.
- matching_methodstr
Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.
Returns
- pandas.DataFrame
Mapping table represented as a Pandas DataFrame.
- matched_cde_codesdict
Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:
{ "dataset_column_1": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, "dataset_column_2": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, ... }
- dataset_column_embeddingslist
List of embedding vectors for the dataset columns.
- schema_code_embeddingslist
List of embedding vectors for the CDE codes.