Hugging Face's sentence-transformers library simplifies the process of generating dense vector representations (embeddings) for text, which are useful for semantic similarity, clustering, and search tasks. To start, install the library using pip install sentence-transformers
. Once installed, you can load a pre-trained model like all-MiniLM-L6-v2
(a popular lightweight option) with from sentence_transformers import SentenceTransformer; model = SentenceTransformer('model_name')
. This model converts input text into 384-dimensional vectors by default. For example, embeddings = model.encode(["Hello, world!", "How are you?"])
generates two vectors that numerically represent the input sentences.
The library handles most preprocessing automatically, including tokenization and padding. The encode()
method supports lists of strings or individual sentences, and you can adjust parameters like batch_size
for large datasets. For instance, model.encode(sentences, batch_size=32, convert_to_tensor=True)
processes 32 sentences at a time and returns PyTorch tensors instead of NumPy arrays. You can also normalize embeddings to unit vectors with normalize_embeddings=True
, which is useful for cosine similarity calculations. If you need to save or reload a model, use model.save('path/')
and SentenceTransformer('path/')
to retain custom configurations.
Practical applications include semantic search and text similarity. For example, to compare two sentences, compute their embeddings and measure cosine similarity using from sklearn.metrics.pairwise import cosine_similarity; similarity = cosine_similarity([emb1], [emb2])[0][0]
. For clustering, use algorithms like K-means on the embeddings (from sklearn.cluster import KMeans; kmeans.fit(embeddings)
). The library also supports multi-lingual models (e.g., paraphrase-multilingual-MiniLM-L12-v2
) and fine-tuning custom models with your data using model.fit(train_objectives)
. Pre-trained models are available on the Hugging Face Hub, allowing you to choose models optimized for specific tasks or languages.