mip_dmp.process.embedding module

Module that provides function to handle word embeddings and operations on them.

mip_dmp.process.embedding.chars2vec_embedding(text, chars2vec_model)[source]

Find the chars2vec embedding for the text.

Parameters

textstr

Text to be embedded.

chars2vec_modelstr

chars2vec model to be used, loaded by the gensim library.

Returns

numpy.ndarray

chars2vec embedding for the text.

mip_dmp.process.embedding.embedding_similarity(x_embedding, y_embedding)[source]

Find the matches based on chars2vec embeddings and cosine similarity.

Parameters

x_embeddingstr

String to compare against.

y_embeddingstr

String to compare.

chars2vec_modelstr

chars2vec model to be used, loaded by the gensim library.

Returns

float

Cosine similarity between the two chars2vec embeddings of the strings.

mip_dmp.process.embedding.find_n_closest_embeddings(word_embedding: array, embeddings: list, embedding_words: list, n: int = 5)[source]

Find the n closest embeddings to the given embedding.

Parameters

word_embeddingnumpy.ndarray

Embedding to find the n closest embeddings to.

embeddingslist

List of embeddings to find the closest embeddings to the given embedding in.

embedding_wordslist

List of words corresponding to the embeddings that will be resorted and reduced accordingly.

nint

Number of closest embeddings to find.

Returns

dict

Dictionary containing the n closest embeddings, their distances to the given embedding, and the words corresponding to the embeddings in the form:

{
    "distances": [float],
    "embeddings": [numpy.ndarray],
    "embedding_words": [str]
}
mip_dmp.process.embedding.generate_embeddings(words: list, embedding_method: str = 'chars2vec')[source]

Generate embeddings for a list of words.

Parameters

wordslist

List of words to generate embeddings for.

embedding_methodstr

Embedding method to be used, either “chars2vec” or “glove”.

Returns

list

List of embeddings for the words.

mip_dmp.process.embedding.glove_embedding(text, glove_model)[source]

Find the Glove embedding for the text.

Parameters

textstr

Text to be embedded.

glove_modelstr

Glove model to be used, loaded by the gensim library.

Returns

numpy.ndarray

Glove embedding for the text.

mip_dmp.process.embedding.reduce_embeddings_dimension(embeddings: list, reduce_method: str = 'tsne', n_components: int = 3)[source]

Reduce the dimension of the embeddings, mainly for visualization purposes.

Parameters

embeddingslist

List of embeddings to reduce the dimension of.

reduce_methodstr

Method to use to reduce the dimension, either “tsne” or “pca”.

n_componentsint

Number of components to reduce the dimension to.

Returns

list

List of reduced embeddings.