A class for super components that wrap around a pipeline.
Module haystack_experimental.core.super_component.super_component
InvalidMappingError
Raised when input or output mappings are invalid or type conflicts are found.
SuperComponent
A class for creating super components that wrap around a Pipeline.
This component allows for remapping of input and output socket names between the wrapped pipeline and the external interface. It handles type checking and verification of all mappings.
Arguments:
pipeline
: The pipeline wrapped by the componentinput_mapping
: Mapping from component input names to lists of pipeline socket paths in format "component_name.socket_name"output_mapping
: Mapping from pipeline socket paths to component output names
Raises:
InvalidMappingError
: If any input or output mappings are invalid or if type conflicts are detectedValueError
: If no pipeline is provided
SuperComponent.__init__
def __init__(pipeline: Pipeline,
input_mapping: Optional[Dict[str, List[str]]] = None,
output_mapping: Optional[Dict[str, str]] = None) -> None
Initialize the component with optional I/O mappings.
Arguments:
pipeline
: The pipeline to wrapinput_mapping
: Optional input name mapping configurationoutput_mapping
: Optional output name mapping configuration
SuperComponent.warm_up
def warm_up() -> None
Warms up the pipeline if it has not been warmed up before.
SuperComponent.to_dict
def to_dict() -> Dict[str, Any]
Convert the SuperComponent to a dictionary representation.
Must be overwritten for custom component implementations that inherit from SuperComponent.
Returns:
Dictionary containing serialized super component data
SuperComponent.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SuperComponent"
Create the SuperComponent instance from a dictionary representation.
Must be overwritten for custom component implementations that inherit from SuperComponent.
Arguments:
data
: Dictionary containing serialized super component data
Returns:
New PipelineWrapper instance
SuperComponent.run
def run(**kwargs: Any) -> Dict[str, Any]
Run the wrapped pipeline with the given inputs.
This method:
- Maps input kwargs to pipeline component inputs
- Executes the pipeline
- Maps pipeline outputs back to wrapper outputs
Arguments:
kwargs
: Keyword arguments matching wrapper input names
Raises:
ValueError
: If no pipeline is configuredInvalidMappingError
: If output conflicts occur during auto-mapping
Returns:
Dictionary mapping wrapper output names to values
Module haystack_experimental.super_components.converters.multi_file_converter
MultiFileConverter
A file converter that handles conversion of multiple file types.
The MultiFileConverter handles the following file types:
- CSV
- DOCX
- HTML
- JSON
- MD
- TEXT
- PDF (no OCR)
- PPTX
- XLSX
Usage:
converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})
MultiFileConverter.__init__
def __init__(encoding: str = "utf-8",
json_content_key: str = "content") -> None
Initialize the MultiFileConverter.
Arguments:
encoding
: The encoding to use when reading files.json_content_key
: The key to use as content-field in a document when converting json-files.
MultiFileConverter.to_dict
def to_dict() -> Dict[str, Any]
Serialize this instance to a dictionary.
MultiFileConverter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter"
Load this instance from a dictionary.
Module haystack_experimental.super_components.indexers.sentence_transformers_document_indexer
SentenceTransformersDocumentIndexer
A document indexer that takes a list of documents, embeds them using SentenceTransformers, and stores them.
Usage:
>>> from haystack import Document
>>> from haystack.document_stores.in_memory import InMemoryDocumentStore
>>> document_store = InMemoryDocumentStore()
>>> doc = Document(content="I love pizza!")
>>> indexer = SentenceTransformersDocumentIndexer(document_store=document_store)
>>> indexer.warm_up()
>>> result = indexer.run(documents=[doc])
>>> print(result)
{'documents_written': 1}
>>> document_store.count_documents()
1
SentenceTransformersDocumentIndexer.__init__
def __init__(
document_store: DocumentStore,
model: str = "sentence-transformers/all-mpnet-base-v2",
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
prefix: str = "",
suffix: str = "",
batch_size: int = 32,
progress_bar: bool = True,
normalize_embeddings: bool = False,
meta_fields_to_embed: Optional[List[str]] = None,
embedding_separator: str = "\n",
trust_remote_code: bool = False,
truncate_dim: Optional[int] = None,
model_kwargs: Optional[Dict[str, Any]] = None,
tokenizer_kwargs: Optional[Dict[str, Any]] = None,
config_kwargs: Optional[Dict[str, Any]] = None,
precision: Literal["float32", "int8", "uint8", "binary",
"ubinary"] = "float32",
duplicate_policy: DuplicatePolicy = DuplicatePolicy.OVERWRITE) -> None
Initialize the SentenceTransformersDocumentIndexer component.
Arguments:
document_store
: The document store where the documents should be stored.model
: The embedding model to use (local path or Hugging Face model ID).device
: The device to use for loading the model.token
: The API token to download private models from Hugging Face.prefix
: String to add at the beginning of each document text.suffix
: String to add at the end of each document text.batch_size
: Number of documents to embed at once.progress_bar
: If True, shows a progress bar when embedding documents.normalize_embeddings
: If True, embeddings are L2 normalized.meta_fields_to_embed
: List of metadata fields to embed along with the document text.embedding_separator
: Separator used to concatenate metadata fields to document text.trust_remote_code
: If True, allows custom models and scripts.truncate_dim
: Dimension to truncate sentence embeddings to.model_kwargs
: Additional keyword arguments for model initialization.tokenizer_kwargs
: Additional keyword arguments for tokenizer initialization.config_kwargs
: Additional keyword arguments for model configuration.precision
: The precision to use for the embeddings.duplicate_policy
: The duplicate policy to use when writing documents.
SentenceTransformersDocumentIndexer.to_dict
def to_dict() -> Dict[str, Any]
Serialize this instance to a dictionary.
SentenceTransformersDocumentIndexer.from_dict
@classmethod
def from_dict(cls, data: Dict[str,
Any]) -> "SentenceTransformersDocumentIndexer"
Load an instance of this component from a dictionary.
Module haystack_experimental.super_components.preprocessors.document_preprocessor
DocumentPreProcessor
A SuperComponent that cleans documents and then splits them.
This component composes a DocumentCleaner followed by a DocumentSplitter in a single pipeline. It takes a list of documents as input and returns a processed list of documents.
Usage:
from haystack import Document
doc = Document(content="I love pizza!")
preprocessor = DocumentPreProcessor()
results = preprocessor.run(documents=[doc])
print(result["documents"])
DocumentPreProcessor.__init__
def __init__(remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
keep_id: bool = False,
remove_substrings: Optional[List[str]] = None,
remove_regex: Optional[str] = None,
unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
"NFKD"]] = None,
ascii_only: bool = False,
split_by: Literal["function", "page", "passage", "period", "word",
"line", "sentence"] = "word",
split_length: int = 250,
split_overlap: int = 0,
split_threshold: int = 0,
splitting_function: Optional[Callable[[str], List[str]]] = None,
respect_sentence_boundary: bool = False,
language: Language = "en",
use_split_rules: bool = True,
extend_abbreviations: bool = True) -> None
Initialize a DocumentPreProcessor that first cleans documents and then splits them.
Cleaner Params:
Arguments:
remove_empty_lines
: IfTrue
, removes empty lines.remove_extra_whitespaces
: IfTrue
, removes extra whitespaces.remove_repeated_substrings
: IfTrue
, remove repeated substrings like headers/footers across pages.keep_id
: IfTrue
, keeps the original document IDs.remove_substrings
: A list of strings to remove from the document content.remove_regex
: A regex pattern whose matches will be removed from the document content.unicode_normalization
: Unicode normalization form to apply to the text, e.g."NFC"
.ascii_only
: IfTrue
, convert text to ASCII only. Splitter Params:split_by
: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".split_length
: The maximum number of units (words, lines, pages, etc.) in each split.split_overlap
: The number of overlapping units between consecutive splits.split_threshold
: The minimum number of units per split. If a split is smaller than this, it's merged with the previous split.splitting_function
: A custom function for splitting ifsplit_by="function"
.respect_sentence_boundary
: IfTrue
, splits by words but tries not to break inside a sentence.language
: Language used by the sentence tokenizer ifsplit_by="sentence"
orrespect_sentence_boundary=True
.use_split_rules
: Whether to apply additional splitting heuristics for the sentence splitter.extend_abbreviations
: Whether to extend the sentence splitter with curated abbreviations for certain languages.
DocumentPreProcessor.to_dict
def to_dict() -> Dict[str, Any]
Serialize this instance to a dictionary.
DocumentPreProcessor.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DocumentPreProcessor"
Load this instance from a dictionary.