mip_dmp.process.embedding module

Module that provides function to handle word embeddings and operations on them.

mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]

Find the chars2vec embedding for the text.

Parameters

textstr: Text to be embedded.
chars2vec_modelstr: chars2vec model to be used, loaded by the gensim library.

mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]

Find the matches based on chars2vec embeddings and cosine similarity.

x_embeddingstr: String to compare against.
y_embeddingstr: String to compare.
chars2vec_modelstr: chars2vec model to be used, loaded by the gensim library.

float: Cosine similarity between the two chars2vec embeddings of the strings.

mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]

Find the n closest embeddings to the given embedding.

word_embeddingnumpy.ndarray: Embedding to find the n closest embeddings to.
embeddingslist: List of embeddings to find the closest embeddings to the given embedding in.
embedding_wordslist: List of words corresponding to the embeddings that will be resorted and reduced accordingly.
nint: Number of closest embeddings to find.

dict

Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:

{
    "distances": [float],
    "embeddings": [numpy.ndarray],
    "embedding_words": [str]
}

mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]

Generate embeddings for a list of words.

wordslist: List of words to generate embeddings for.
embedding_methodstr: Embedding method to be used, either “chars2vec” or “glove”.

mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]

Find the Glove embedding for the text.

mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]

Reduce the dimension of the embeddings, mainly for visualization purposes.

embeddingslist: List of embeddings to reduce the dimension of.
reduce_methodstr: Method to use to reduce the dimension, either “tsne” or “pca”.
n_componentsint: Number of components to reduce the dimension to.