API Reference

A class for super components that wrap around a pipeline.

Module haystack_experimental.core.super_component.super_component

InvalidMappingError

Raised when input or output mappings are invalid or type conflicts are found.

SuperComponent

A class for creating super components that wrap around a Pipeline.

This component allows for remapping of input and output socket names between the wrapped pipeline and the external interface. It handles type checking and verification of all mappings.

Arguments:

  • pipeline: The pipeline wrapped by the component
  • input_mapping: Mapping from component input names to lists of pipeline socket paths in format "component_name.socket_name"
  • output_mapping: Mapping from pipeline socket paths to component output names

Raises:

  • InvalidMappingError: If any input or output mappings are invalid or if type conflicts are detected
  • ValueError: If no pipeline is provided

SuperComponent.__init__

def __init__(pipeline: Pipeline,
             input_mapping: Optional[Dict[str, List[str]]] = None,
             output_mapping: Optional[Dict[str, str]] = None) -> None

Initialize the component with optional I/O mappings.

Arguments:

  • pipeline: The pipeline to wrap
  • input_mapping: Optional input name mapping configuration
  • output_mapping: Optional output name mapping configuration

SuperComponent.warm_up

def warm_up() -> None

Warms up the pipeline if it has not been warmed up before.

SuperComponent.to_dict

def to_dict() -> Dict[str, Any]

Convert the SuperComponent to a dictionary representation.

Must be overwritten for custom component implementations that inherit from SuperComponent.

Returns:

Dictionary containing serialized super component data

SuperComponent.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SuperComponent"

Create the SuperComponent instance from a dictionary representation.

Must be overwritten for custom component implementations that inherit from SuperComponent.

Arguments:

  • data: Dictionary containing serialized super component data

Returns:

New PipelineWrapper instance

SuperComponent.run

def run(**kwargs: Any) -> Dict[str, Any]

Run the wrapped pipeline with the given inputs.

This method:

  1. Maps input kwargs to pipeline component inputs
  2. Executes the pipeline
  3. Maps pipeline outputs back to wrapper outputs

Arguments:

  • kwargs: Keyword arguments matching wrapper input names

Raises:

  • ValueError: If no pipeline is configured
  • InvalidMappingError: If output conflicts occur during auto-mapping

Returns:

Dictionary mapping wrapper output names to values

Module haystack_experimental.super_components.converters.multi_file_converter

MultiFileConverter

A file converter that handles conversion of multiple file types.

The MultiFileConverter handles the following file types:

  • CSV
  • DOCX
  • HTML
  • JSON
  • MD
  • TEXT
  • PDF (no OCR)
  • PPTX
  • XLSX

Usage:

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

MultiFileConverter.__init__

def __init__(encoding: str = "utf-8",
             json_content_key: str = "content") -> None

Initialize the MultiFileConverter.

Arguments:

  • encoding: The encoding to use when reading files.
  • json_content_key: The key to use as content-field in a document when converting json-files.

MultiFileConverter.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

MultiFileConverter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter"

Load this instance from a dictionary.

Module haystack_experimental.super_components.indexers.sentence_transformers_document_indexer

SentenceTransformersDocumentIndexer

A document indexer that takes a list of documents, embeds them using SentenceTransformers, and stores them.

Usage:

>>> from haystack import Document
>>> from haystack.document_stores.in_memory import InMemoryDocumentStore
>>> document_store = InMemoryDocumentStore()
>>> doc = Document(content="I love pizza!")
>>> indexer = SentenceTransformersDocumentIndexer(document_store=document_store)
>>> indexer.warm_up()
>>> result = indexer.run(documents=[doc])
>>> print(result)
{'documents_written': 1}
>>> document_store.count_documents()
1

SentenceTransformersDocumentIndexer.__init__

def __init__(
        document_store: DocumentStore,
        model: str = "sentence-transformers/all-mpnet-base-v2",
        device: Optional[ComponentDevice] = None,
        token: Optional[Secret] = Secret.from_env_var(
            ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
        prefix: str = "",
        suffix: str = "",
        batch_size: int = 32,
        progress_bar: bool = True,
        normalize_embeddings: bool = False,
        meta_fields_to_embed: Optional[List[str]] = None,
        embedding_separator: str = "\n",
        trust_remote_code: bool = False,
        truncate_dim: Optional[int] = None,
        model_kwargs: Optional[Dict[str, Any]] = None,
        tokenizer_kwargs: Optional[Dict[str, Any]] = None,
        config_kwargs: Optional[Dict[str, Any]] = None,
        precision: Literal["float32", "int8", "uint8", "binary",
                           "ubinary"] = "float32",
        duplicate_policy: DuplicatePolicy = DuplicatePolicy.OVERWRITE) -> None

Initialize the SentenceTransformersDocumentIndexer component.

Arguments:

  • document_store: The document store where the documents should be stored.
  • model: The embedding model to use (local path or Hugging Face model ID).
  • device: The device to use for loading the model.
  • token: The API token to download private models from Hugging Face.
  • prefix: String to add at the beginning of each document text.
  • suffix: String to add at the end of each document text.
  • batch_size: Number of documents to embed at once.
  • progress_bar: If True, shows a progress bar when embedding documents.
  • normalize_embeddings: If True, embeddings are L2 normalized.
  • meta_fields_to_embed: List of metadata fields to embed along with the document text.
  • embedding_separator: Separator used to concatenate metadata fields to document text.
  • trust_remote_code: If True, allows custom models and scripts.
  • truncate_dim: Dimension to truncate sentence embeddings to.
  • model_kwargs: Additional keyword arguments for model initialization.
  • tokenizer_kwargs: Additional keyword arguments for tokenizer initialization.
  • config_kwargs: Additional keyword arguments for model configuration.
  • precision: The precision to use for the embeddings.
  • duplicate_policy: The duplicate policy to use when writing documents.

SentenceTransformersDocumentIndexer.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

SentenceTransformersDocumentIndexer.from_dict

@classmethod
def from_dict(cls, data: Dict[str,
                              Any]) -> "SentenceTransformersDocumentIndexer"

Load an instance of this component from a dictionary.

Module haystack_experimental.super_components.preprocessors.document_preprocessor

DocumentPreProcessor

A SuperComponent that cleans documents and then splits them.

This component composes a DocumentCleaner followed by a DocumentSplitter in a single pipeline. It takes a list of documents as input and returns a processed list of documents.

Usage:

from haystack import Document
doc = Document(content="I love pizza!")
preprocessor = DocumentPreProcessor()
results = preprocessor.run(documents=[doc])
print(result["documents"])

DocumentPreProcessor.__init__

def __init__(remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             keep_id: bool = False,
             remove_substrings: Optional[List[str]] = None,
             remove_regex: Optional[str] = None,
             unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
                                                     "NFKD"]] = None,
             ascii_only: bool = False,
             split_by: Literal["function", "page", "passage", "period", "word",
                               "line", "sentence"] = "word",
             split_length: int = 250,
             split_overlap: int = 0,
             split_threshold: int = 0,
             splitting_function: Optional[Callable[[str], List[str]]] = None,
             respect_sentence_boundary: bool = False,
             language: Language = "en",
             use_split_rules: bool = True,
             extend_abbreviations: bool = True) -> None

Initialize a DocumentPreProcessor that first cleans documents and then splits them.

Cleaner Params:

Arguments:

  • remove_empty_lines: If True, removes empty lines.
  • remove_extra_whitespaces: If True, removes extra whitespaces.
  • remove_repeated_substrings: If True, remove repeated substrings like headers/footers across pages.
  • keep_id: If True, keeps the original document IDs.
  • remove_substrings: A list of strings to remove from the document content.
  • remove_regex: A regex pattern whose matches will be removed from the document content.
  • unicode_normalization: Unicode normalization form to apply to the text, e.g. "NFC".
  • ascii_only: If True, convert text to ASCII only. Splitter Params:
  • split_by: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
  • split_length: The maximum number of units (words, lines, pages, etc.) in each split.
  • split_overlap: The number of overlapping units between consecutive splits.
  • split_threshold: The minimum number of units per split. If a split is smaller than this, it's merged with the previous split.
  • splitting_function: A custom function for splitting if split_by="function".
  • respect_sentence_boundary: If True, splits by words but tries not to break inside a sentence.
  • language: Language used by the sentence tokenizer if split_by="sentence" or respect_sentence_boundary=True.
  • use_split_rules: Whether to apply additional splitting heuristics for the sentence splitter.
  • extend_abbreviations: Whether to extend the sentence splitter with curated abbreviations for certain languages.

DocumentPreProcessor.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

DocumentPreProcessor.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DocumentPreProcessor"

Load this instance from a dictionary.