mip_dmp.process
The mip_dmp.process
subpackage contains modules that provide processing functions for the MIP Dataset Mapper.
mip_dmp.process.embedding
Module that provides function to handle word embeddings and operations on them.
- mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]
Find the chars2vec embedding for the text.
Parameters
- textstr
Text to be embedded.
- chars2vec_modelstr
chars2vec model to be used, loaded by the gensim library.
Returns
- numpy.ndarray
chars2vec embedding for the text.
- mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]
Find the matches based on chars2vec embeddings and cosine similarity.
Parameters
- x_embeddingstr
String to compare against.
- y_embeddingstr
String to compare.
- chars2vec_modelstr
chars2vec model to be used, loaded by the gensim library.
Returns
- float
Cosine similarity between the two chars2vec embeddings of the strings.
- mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]
Find the n closest embeddings to the given embedding.
Parameters
- word_embeddingnumpy.ndarray
Embedding to find the n closest embeddings to.
- embeddingslist
List of embeddings to find the closest embeddings to the given embedding in.
- embedding_wordslist
List of words corresponding to the embeddings that will be resorted and reduced accordingly.
- nint
Number of closest embeddings to find.
Returns
- dict
Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:
{ "distances": [float], "embeddings": [numpy.ndarray], "embedding_words": [str] }
- mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]
Generate embeddings for a list of words.
Parameters
- wordslist
List of words to generate embeddings for.
- embedding_methodstr
Embedding method to be used, either “chars2vec” or “glove”.
Returns
- list
List of embeddings for the words.
- mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]
Find the Glove embedding for the text.
Parameters
- textstr
Text to be embedded.
- glove_modelstr
Glove model to be used, loaded by the gensim library.
Returns
- numpy.ndarray
Glove embedding for the text.
- mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]
Reduce the dimension of the embeddings, mainly for visualization purposes.
Parameters
- embeddingslist
List of embeddings to reduce the dimension of.
- reduce_methodstr
Method to use to reduce the dimension, either “tsne” or “pca”.
- n_componentsint
Number of components to reduce the dimension to.
Returns
- list
List of reduced embeddings.
mip_dmp.process.mapping
Module that provides functions to support the mapping of datasets to a specific CDEs metadata schema.
- mip_dmp.process.mapping.apply_transform_map(dataset_column, transform)[source]
Apply the transform map for binomial and multinominal variables.
Parameters
- dataset_columnpandas.DataFrame
Dataset column to be transformed.
- transformstr
Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.
Returns
- dataset_column: pandas.DataFrame
The transformed dataset column.
- mip_dmp.process.mapping.apply_transform_scale(dataset_column, cde_code, cde_type, scaling_factor)[source]
Apply the transform scale for real and integer variables.
Parameters
- dataset_columnpandas.DataFrame
Dataset column to be transformed.
- cde_codestr
CDE code of the dataset column.
- cde_typestr
CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.
- scaling_factorfloat
Scaling factor to be applied to the dataset column.
Returns
- dataset_column: pandas.DataFrame
The transformed dataset column.
- mip_dmp.process.mapping.map_dataset(dataset, mappings, cde_codes)[source]
Map the dataset to the schema.
Parameters
- datasetpandas.DataFrame
Dataset to be mapped.
- mappingsdict
Mappings of the dataset columns to the schema columns.
- cde_codeslist
List of codes of the CDE metadata schema.
Returns
- pandas.DataFrame
Mapped dataset.
- mip_dmp.process.mapping.transform_dataset_column(dataset_column, cde_code, cde_type, transform_type, transform)[source]
Transform the dataset column.
Parameters
- dataset_columnpandas.DataFrame
Dataset column to be transformed.
- cde_codestr
CDE code of the dataset column.
- cde_typestr
CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”.
- transform_typestr
Type of transformation to be applied to the dataset column. Can be “map” or “scale”.
- transformstr
Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor.
Returns
- dataset_column: pandas.DataFrame
The transformed dataset column.
mip_dmp.process.matching
Module that provides functions to support the matching of dataset columns to CDEs.
- mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]
Generate the initial transform.
Parameters
- dataset_column_valueslist of str
Dataset column values.
- cde_code_valueslist of str
CDE code values.
- dataset_columnstr
Dataset column.
Returns
- initial_transformstr
Initial transform.
- mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]
Make the n closest match distance vector for a given input dataset column.
Parameters
- matchedCdeCodesdict
Dictionary of the matching results in the form:
{ "inputDatasetColumn1": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, "inputDatasetColumn2": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, ... }
- inputDatasetColumnlstr
Input dataset column name.
Returns
- distanceVectornumpy.ndarray
Similarity/distance vector.
- mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]
Make the initial transform.
Parameters
- datasetpandas.DataFrame
Dataset to be mapped.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
- dataset_columnstr
Dataset column.
- cde_codestr
CDE code.
Returns
- dict
Initial transform.
- mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]
Match a dataset column to CDEs using fuzzy matching.
Parameters
- dataset_columnstr
Dataset column.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
Returns
- list
List of matched CDE codes ordered by decreasing fuzzy ratio.
- mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]
Initialize the mapping table by matching the dataset columns with the CDE codes.
Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec)
Parameters
- datasetpandas.DataFrame
Dataset to be mapped.
- schemapandas.DataFrame
Schema to which the dataset is mapped.
- nb_kept_matchesint
Number of matches to keep for each dataset column.
- matching_methodstr
Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”.
Returns
- pandas.DataFrame
Mapping table represented as a Pandas DataFrame.
- matched_cde_codesdict
Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form:
{ "dataset_column_1": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, "dataset_column_2": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, ... }
- dataset_column_embeddingslist
List of embedding vectors for the dataset columns.
- schema_code_embeddingslist
List of embedding vectors for the CDE codes.
mip_dmp.process.utils
Module that provides functions to support the modules of the mip_dmp.process
sub-package.