mip_dmp.process.embedding module
Module that provides function to handle word embeddings and operations on them.
- mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]
Find the chars2vec embedding for the text.
Parameters
- textstr
Text to be embedded.
- chars2vec_modelstr
chars2vec model to be used, loaded by the gensim library.
Returns
- numpy.ndarray
chars2vec embedding for the text.
- mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]
Find the matches based on chars2vec embeddings and cosine similarity.
Parameters
- x_embeddingstr
String to compare against.
- y_embeddingstr
String to compare.
- chars2vec_modelstr
chars2vec model to be used, loaded by the gensim library.
Returns
- float
Cosine similarity between the two chars2vec embeddings of the strings.
- mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]
Find the n closest embeddings to the given embedding.
Parameters
- word_embeddingnumpy.ndarray
Embedding to find the n closest embeddings to.
- embeddingslist
List of embeddings to find the closest embeddings to the given embedding in.
- embedding_wordslist
List of words corresponding to the embeddings that will be resorted and reduced accordingly.
- nint
Number of closest embeddings to find.
Returns
- dict
Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:
{ "distances": [float], "embeddings": [numpy.ndarray], "embedding_words": [str] }
- mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]
Generate embeddings for a list of words.
Parameters
- wordslist
List of words to generate embeddings for.
- embedding_methodstr
Embedding method to be used, either “chars2vec” or “glove”.
Returns
- list
List of embeddings for the words.
- mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]
Find the Glove embedding for the text.
Parameters
- textstr
Text to be embedded.
- glove_modelstr
Glove model to be used, loaded by the gensim library.
Returns
- numpy.ndarray
Glove embedding for the text.
- mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]
Reduce the dimension of the embeddings, mainly for visualization purposes.
Parameters
- embeddingslist
List of embeddings to reduce the dimension of.
- reduce_methodstr
Method to use to reduce the dimension, either “tsne” or “pca”.
- n_componentsint
Number of components to reduce the dimension to.
Returns
- list
List of reduced embeddings.