SentenceWindowRetrieval
Use this component to retrieve neighboring sentences around relevant sentences to get the full context.
Name | SentenceWindowRetrieval |
Folder path | /retrievers/ |
Most common position in a pipeline | Used after the main Retriever component, like the InMemoryEmbeddingRetriever or any other Retriever. |
Mandatory input variables | "retrieved_documents": A list of already retrieved documents for which you want to get a context window |
Output variables | “context_windows”: A list of strings |
Overview
The "sentence window" is a retrieval technique that allows for the retrieval of the context around relevant sentences.
During indexing, documents are broken into smaller chunks or sentences and indexed. During retrieval, the sentences most relevant to a given query, based on a certain similarity metric, are retrieved.
Once we have the relevant sentences, we can retrieve neighboring sentences to provide full context. The number of neighboring sentences to retrieve is defined by a fixed number of sentences before and after the relevant sentence.
This component is meant to be used with other Retrievers, such as the InMemoryEmbeddingRetriever
. These Retrievers find relevant sentences by comparing a query against indexed sentences using a similarity metric. Then, the SentenceWindowRetrieval
component retrieves neighboring sentences around the relevant ones by leveraging metadata stored in the Document
object.
Usage
On its own
splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
text = ("This is a text with some words. There is a second sentence. And there is also a third sentence. "
"It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence")
doc = Document(content=text)
docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])
retriever = SentenceWindowRetrieval(document_store=doc_store, window_size=3)
In a Pipeline
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import SentenceWindowRetrieval
from haystack.components.preprocessors import DocumentSplitter
from haystack.document_stores.in_memory import InMemoryDocumentStore
splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
text = (
"This is a text with some words. There is a second sentence. And there is also a third sentence. "
"It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
)
doc = Document(content=text)
docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])
rag = Pipeline()
rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
rag.add_component("sentence_window_retriever", SentenceWindowRetrieval(document_store=doc_store, window_size=3))
rag.connect("bm25_retriever", "sentence_window_retriever")
rag.run({'bm25_retriever': {"query":"third"}})
Updated 5 months ago