mip_dmp.process
The mip_dmp.process subpackage contains modules that provide processing functions for the MIP Dataset Mapper.
mip_dmp.process.embedding
Module that provides function to handle word embeddings and operations on them.
- mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]
- Find the chars2vec embedding for the text. - Parameters- textstr
- Text to be embedded. 
- chars2vec_modelstr
- chars2vec model to be used, loaded by the gensim library. 
 - Returns- numpy.ndarray
- chars2vec embedding for the text. 
 
- mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]
- Find the matches based on chars2vec embeddings and cosine similarity. - Parameters- x_embeddingstr
- String to compare against. 
- y_embeddingstr
- String to compare. 
- chars2vec_modelstr
- chars2vec model to be used, loaded by the gensim library. 
 - Returns- float
- Cosine similarity between the two chars2vec embeddings of the strings. 
 
- mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]
- Find the n closest embeddings to the given embedding. - Parameters- word_embeddingnumpy.ndarray
- Embedding to find the n closest embeddings to. 
- embeddingslist
- List of embeddings to find the closest embeddings to the given embedding in. 
- embedding_wordslist
- List of words corresponding to the embeddings that will be resorted and reduced accordingly. 
- nint
- Number of closest embeddings to find. 
 - Returns- dict
- Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form: - { "distances": [float], "embeddings": [numpy.ndarray], "embedding_words": [str] } 
 
- mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]
- Generate embeddings for a list of words. - Parameters- wordslist
- List of words to generate embeddings for. 
- embedding_methodstr
- Embedding method to be used, either “chars2vec” or “glove”. 
 - Returns- list
- List of embeddings for the words. 
 
- mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]
- Find the Glove embedding for the text. - Parameters- textstr
- Text to be embedded. 
- glove_modelstr
- Glove model to be used, loaded by the gensim library. 
 - Returns- numpy.ndarray
- Glove embedding for the text. 
 
- mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]
- Reduce the dimension of the embeddings, mainly for visualization purposes. - Parameters- embeddingslist
- List of embeddings to reduce the dimension of. 
- reduce_methodstr
- Method to use to reduce the dimension, either “tsne” or “pca”. 
- n_componentsint
- Number of components to reduce the dimension to. 
 - Returns- list
- List of reduced embeddings. 
 
mip_dmp.process.mapping
Module that provides functions to support the mapping of datasets to a specific CDEs metadata schema.
- mip_dmp.process.mapping.apply_transform_map(dataset_column, transform)[source]
- Apply the transform map for binomial and multinominal variables. - Parameters- dataset_columnpandas.DataFrame
- Dataset column to be transformed. 
- transformstr
- Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor. 
 - Returns- dataset_column: pandas.DataFrame
- The transformed dataset column. 
 
- mip_dmp.process.mapping.apply_transform_scale(dataset_column, cde_code, cde_type, scaling_factor)[source]
- Apply the transform scale for real and integer variables. - Parameters- dataset_columnpandas.DataFrame
- Dataset column to be transformed. 
- cde_codestr
- CDE code of the dataset column. 
- cde_typestr
- CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”. 
- scaling_factorfloat
- Scaling factor to be applied to the dataset column. 
 - Returns- dataset_column: pandas.DataFrame
- The transformed dataset column. 
 
- mip_dmp.process.mapping.map_dataset(dataset, mappings, cde_codes)[source]
- Map the dataset to the schema. - Parameters- datasetpandas.DataFrame
- Dataset to be mapped. 
- mappingsdict
- Mappings of the dataset columns to the schema columns. 
- cde_codeslist
- List of codes of the CDE metadata schema. 
 - Returns- pandas.DataFrame
- Mapped dataset. 
 
- mip_dmp.process.mapping.transform_dataset_column(dataset_column, cde_code, cde_type, transform_type, transform)[source]
- Transform the dataset column. - Parameters- dataset_columnpandas.DataFrame
- Dataset column to be transformed. 
- cde_codestr
- CDE code of the dataset column. 
- cde_typestr
- CDE type of the dataset column. Can be “binomial”, “multinomial”, “integer” or “real”. 
- transform_typestr
- Type of transformation to be applied to the dataset column. Can be “map” or “scale”. 
- transformstr
- Transformation to be applied to the dataset column. Can be a JSON string for the “map” transformation type or a scaling factor. 
 - Returns- dataset_column: pandas.DataFrame
- The transformed dataset column. 
 
mip_dmp.process.matching
Module that provides functions to support the matching of dataset columns to CDEs.
- mip_dmp.process.matching.generate_initial_transform(dataset_column_values, cde_code_values, dataset_column)[source]
- Generate the initial transform. - Parameters- dataset_column_valueslist of str
- Dataset column values. 
- cde_code_valueslist of str
- CDE code values. 
- dataset_columnstr
- Dataset column. 
 - Returns- initial_transformstr
- Initial transform. 
 
- mip_dmp.process.matching.make_distance_vector(matchedCdeCodes, inputDatasetColumn)[source]
- Make the n closest match distance vector for a given input dataset column. - Parameters- matchedCdeCodesdict
- Dictionary of the matching results in the form: - { "inputDatasetColumn1": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, "inputDatasetColumn2": { "words": ["word1", "word2", ...], "distances": [distance1, distance2, ...], "embeddings": [embedding1, embedding2, ...] }, ... } 
- inputDatasetColumnlstr
- Input dataset column name. 
 - Returns- distanceVectornumpy.ndarray
- Similarity/distance vector. 
 
- mip_dmp.process.matching.make_initial_transform(dataset, schema, dataset_column, cde_code)[source]
- Make the initial transform. - Parameters- datasetpandas.DataFrame
- Dataset to be mapped. 
- schemapandas.DataFrame
- Schema to which the dataset is mapped. 
- dataset_columnstr
- Dataset column. 
- cde_codestr
- CDE code. 
 - Returns- dict
- Initial transform. 
 
- mip_dmp.process.matching.match_column_to_cdes(dataset_column, schema)[source]
- Match a dataset column to CDEs using fuzzy matching. - Parameters- dataset_columnstr
- Dataset column. 
- schemapandas.DataFrame
- Schema to which the dataset is mapped. 
 - Returns- list
- List of matched CDE codes ordered by decreasing fuzzy ratio. 
 
- mip_dmp.process.matching.match_columns_to_cdes(dataset, schema, nb_kept_matches=10, matching_method='fuzzy')[source]
- Initialize the mapping table by matching the dataset columns with the CDE codes. - Different matching methods can be used: - “fuzzy”: Fuzzy matching using the Levenshtein distance. (https://github.com/seatgeek/thefuzz) - “glove”: Embedding matching using Glove embeddings at the character level. (https://nlp.stanford.edu/projects/glove/) - “chars2vec”: Embedding matching using Chars2Vec embeddings. (https://github.com/IntuitionEngineeringTeam/chars2vec) - Parameters- datasetpandas.DataFrame
- Dataset to be mapped. 
- schemapandas.DataFrame
- Schema to which the dataset is mapped. 
- nb_kept_matchesint
- Number of matches to keep for each dataset column. 
- matching_methodstr
- Method to be used for matching the dataset columns with the CDE codes. Can be “fuzzy”, “glove” or “chars2vec”. 
 - Returns- pandas.DataFrame
- Mapping table represented as a Pandas DataFrame. 
- matched_cde_codesdict
- Dictionary of dictionaries storing the first 10 matched CDE codes with corresponding fuzzy ratio / cosine similarity (value) / and embedding vector for each dataset column (key). It has the form: - { "dataset_column_1": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, "dataset_column_2": { "words": ["cde_code_1", "cde_code_2", ...], "distances": [0.9, 0.8, ...], "embeddings": [None, None, ...] }, ... } 
- dataset_column_embeddingslist
- List of embedding vectors for the dataset columns. 
- schema_code_embeddingslist
- List of embedding vectors for the CDE codes. 
 
mip_dmp.process.utils
Module that provides functions to support the modules of the mip_dmp.process sub-package.