SentenceTransformersSparseDocumentEmbedder
Use this component to enrich a list of documents with their sparse embeddings using Sentence Transformers models.
Most common position in a pipeline | Before a DocumentWriter in an indexing pipeline |
Mandatory run variables | "documents": A list of documents |
Output variables | "documents": A list of documents (enriched with sparse embeddings) |
API reference | Embedders |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/embedders/sentence_transformers_sparse_document_embedder.py |
To compute a sparse embedding for a string, use the SentenceTransformersSparseTextEmbedder
.
Overview
SentenceTransformersSparseDocumentEmbedder
computes the sparse embeddings of a list of documents and stores the obtained vectors in the sparse_embedding
field of each document. It uses sparse embedding models supported by the Sentence Transformers library.
The vectors computed by this component are necessary to perform sparse embedding retrieval on a collection of documents. At retrieval time, the sparse vector representing the query is compared with those of the documents to find the most similar or relevant ones.
Compatible Models
The default embedding model is prithivida/Splade_PP_en_v2
. You can specify another model with the model
parameter when initializing this component.
Compatible models are based on SPLADE (SParse Lexical AnD Expansion), a technique for producing sparse representations for text, where each non-zero value in the embedding is the importance weight of a term in the vocabulary. This approach combines the benefits of learned sparse representations with the efficiency of traditional sparse retrieval methods. For more information, see our docs that explain sparse embedding-based Retrievers further.
You can find compatible SPLADE models on the Hugging Face Model Hub.
Authentication
Authentication with a Hugging Face API Token is only required to access private or gated models.
The component uses an HF_API_TOKEN
or HF_TOKEN
environment variable, or you can pass a Hugging Face API token at initialization. See our Secret Management page for more information.
from haystack.utils import Secret
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
document_embedder = SentenceTransformersSparseDocumentEmbedder(
token=Secret.from_token("<your-api-key>")
)
Backend Options
This component supports multiple backends for model execution:
- torch (default): Standard PyTorch backend
- onnx: Optimized ONNX Runtime backend for faster inference
- openvino: Intel OpenVINO backend for additional optimizations on Intel hardware
You can specify the backend during initialization:
embedder = SentenceTransformersSparseDocumentEmbedder(
model="prithivida/Splade_PP_en_v2",
backend="onnx"
)
For more information on acceleration and quantization options, refer to the Sentence Transformers documentation.
Embedding Metadata
Text documents often include metadata. If the metadata is distinctive and semantically meaningful, you can embed it along with the document's text to improve retrieval.
You can do this easily by using the Sparse Document Embedder:
from haystack import Document
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
doc = Document(
content="some text",
meta={"title": "relevant title", "page number": 18}
)
embedder = SentenceTransformersSparseDocumentEmbedder(
meta_fields_to_embed=["title"]
)
embedder.warm_up()
docs_w_sparse_embeddings = embedder.run(documents=[doc])["documents"]
Usage
On its own
from haystack import Document
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
doc = Document(content="I love pizza!")
doc_embedder = SentenceTransformersSparseDocumentEmbedder()
doc_embedder.warm_up()
result = doc_embedder.run([doc])
print(result['documents'][0].sparse_embedding)
# SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])
In a pipeline
Currently, sparse embedding retrieval is only supported by QdrantDocumentStore
.
First, install the required package:
pip install qdrant-haystack
Then, try out this pipeline:
from haystack import Document, Pipeline
from haystack.components.embedders import (
SentenceTransformersSparseDocumentEmbedder,
SentenceTransformersSparseTextEmbedder
)
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.document_stores.types import DuplicatePolicy
document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
use_sparse_embeddings=True
)
documents = [
Document(content="My name is Wolfgang and I live in Berlin"),
Document(content="I saw a black horse running"),
Document(content="Germany has many big cities"),
Document(content="Sentence Transformers provides sparse embedding models."),
]
# Indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(
"sparse_document_embedder",
SentenceTransformersSparseDocumentEmbedder()
)
indexing_pipeline.add_component(
"writer",
DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
)
indexing_pipeline.connect("sparse_document_embedder", "writer")
indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
# Query pipeline
query_pipeline = Pipeline()
query_pipeline.add_component(
"sparse_text_embedder",
SentenceTransformersSparseTextEmbedder()
)
query_pipeline.add_component(
"sparse_retriever",
QdrantSparseEmbeddingRetriever(document_store=document_store)
)
query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
query = "Who provides sparse embedding models?"
result = query_pipeline.run({"sparse_text_embedder": {"text": query}})
print(result["sparse_retriever"]["documents"][0])
# Document(id=...,
# content: 'Sentence Transformers provides sparse embedding models.',
# score: 0.75...)
Updated 10 days ago