Module haystack_experimental.core.super_component.super_component

InvalidMappingError

Raised when input or output mappings are invalid or type conflicts are found.

SuperComponent

A class for creating super components that wrap around a Pipeline.

This component allows for remapping of input and output socket names between the wrapped pipeline and the SuperComponent's input and output names. This is useful for creating higher-level components that abstract away the details of the wrapped pipeline.

SuperComponent.init

def __init__(pipeline: Pipeline,
             input_mapping: Optional[Dict[str, List[str]]] = None,
             output_mapping: Optional[Dict[str, str]] = None) -> None

Creates a SuperComponent with optional input and output mappings.

Arguments:

pipeline: The pipeline instance to be wrapped
input_mapping: A dictionary mapping component input names to pipeline input socket paths. If not provided, a default input mapping will be created based on all pipeline inputs.
output_mapping: A dictionary mapping pipeline output socket paths to component output names. If not provided, a default output mapping will be created based on all pipeline outputs.

Raises:

InvalidMappingError: Raised if any mapping is invalid or type conflicts occur
ValueError: Raised if no pipeline is provided

SuperComponent.warm_up

def warm_up() -> None

Warms up the SuperComponent by warming up the wrapped pipeline.

SuperComponent.to_dict

def to_dict() -> Dict[str, Any]

Serializes the SuperComponent into a dictionary.

Must be overwritten for prebuilt super components that inherit from SuperComponent.

Returns:

Dictionary with serialized data.

SuperComponent.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "SuperComponent"

Deserializes the SuperComponent from a dictionary.

Must be overwritten for custom component implementations that inherit from SuperComponent.

Arguments:

data: The dictionary to deserialize from.

Returns:

The deserialized SuperComponent.

SuperComponent.run

def run(**kwargs: Any) -> Dict[str, Any]

Runs the wrapped pipeline with the provided inputs.

Steps:

Maps the inputs from kwargs to pipeline component inputs
Runs the pipeline
Maps the pipeline outputs to the SuperComponent's outputs

Arguments:

kwargs: Keyword arguments matching the SuperComponent's input names

Returns:

Dictionary containing the SuperComponent's output values

Module haystack_experimental.super_components.converters.multi_file_converter

MultiFileConverter

A file converter that handles conversion of multiple file types.

The MultiFileConverter handles the following file types:

CSV
DOCX
HTML
JSON
MD
TEXT
PDF (no OCR)
PPTX
XLSX

Usage:

from haystack_experimental.super_components.converters import MultiFileConverter

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

MultiFileConverter.init

def __init__(encoding: str = "utf-8",
             json_content_key: str = "content") -> None

Initialize the MultiFileConverter.

Arguments:

encoding: The encoding to use when reading files.
json_content_key: The key to use as content-field in a document when converting json-files.

MultiFileConverter.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

MultiFileConverter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter"

Load this instance from a dictionary.

Module haystack_experimental.super_components.indexers.sentence_transformers_document_indexer

SentenceTransformersDocumentIndexer

A document indexer that takes a list of documents, embeds them using SentenceTransformers, and stores them.

Usage:

>>> from haystack import Document
>>> from haystack.document_stores.in_memory import InMemoryDocumentStore
>>> document_store = InMemoryDocumentStore()
>>> doc = Document(content="I love pizza!")
>>> indexer = SentenceTransformersDocumentIndexer(document_store=document_store)
>>> indexer.warm_up()
>>> result = indexer.run(documents=[doc])
>>> print(result)
{'documents_written': 1}
>>> document_store.count_documents()
1

SentenceTransformersDocumentIndexer.init

def __init__(
        document_store: DocumentStore,
        model: str = "sentence-transformers/all-mpnet-base-v2",
        device: Optional[ComponentDevice] = None,
        token: Optional[Secret] = Secret.from_env_var(
            ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
        prefix: str = "",
        suffix: str = "",
        batch_size: int = 32,
        progress_bar: bool = True,
        normalize_embeddings: bool = False,
        meta_fields_to_embed: Optional[List[str]] = None,
        embedding_separator: str = "\n",
        trust_remote_code: bool = False,
        truncate_dim: Optional[int] = None,
        model_kwargs: Optional[Dict[str, Any]] = None,
        tokenizer_kwargs: Optional[Dict[str, Any]] = None,
        config_kwargs: Optional[Dict[str, Any]] = None,
        precision: Literal["float32", "int8", "uint8", "binary",
                           "ubinary"] = "float32",
        duplicate_policy: DuplicatePolicy = DuplicatePolicy.OVERWRITE) -> None

Initialize the SentenceTransformersDocumentIndexer component.

Arguments:

document_store: The document store where the documents should be stored.
model: The embedding model to use (local path or Hugging Face model ID).
device: The device to use for loading the model.
token: The API token to download private models from Hugging Face.
prefix: String to add at the beginning of each document text.
suffix: String to add at the end of each document text.
batch_size: Number of documents to embed at once.
progress_bar: If True, shows a progress bar when embedding documents.
normalize_embeddings: If True, embeddings are L2 normalized.
meta_fields_to_embed: List of metadata fields to embed along with the document text.
embedding_separator: Separator used to concatenate metadata fields to document text.
trust_remote_code: If True, allows custom models and scripts.
truncate_dim: Dimension to truncate sentence embeddings to.
model_kwargs: Additional keyword arguments for model initialization.
tokenizer_kwargs: Additional keyword arguments for tokenizer initialization.
config_kwargs: Additional keyword arguments for model configuration.
precision: The precision to use for the embeddings.
duplicate_policy: The duplicate policy to use when writing documents.

SentenceTransformersDocumentIndexer.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

SentenceTransformersDocumentIndexer.from_dict

@classmethod
def from_dict(cls, data: Dict[str,
                              Any]) -> "SentenceTransformersDocumentIndexer"

Load an instance of this component from a dictionary.

Module haystack_experimental.super_components.preprocessors.document_preprocessor

DocumentPreProcessor

A SuperComponent that cleans documents and then splits them.

This component composes a DocumentCleaner followed by a DocumentSplitter in a single pipeline. It takes a list of documents as input and returns a processed list of documents.

Usage:

from haystack import Document
doc = Document(content="I love pizza!")
preprocessor = DocumentPreProcessor()
results = preprocessor.run(documents=[doc])
print(result["documents"])

DocumentPreProcessor.init

def __init__(remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             keep_id: bool = False,
             remove_substrings: Optional[List[str]] = None,
             remove_regex: Optional[str] = None,
             unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
                                                     "NFKD"]] = None,
             ascii_only: bool = False,
             split_by: Literal["function", "page", "passage", "period", "word",
                               "line", "sentence"] = "word",
             split_length: int = 250,
             split_overlap: int = 0,
             split_threshold: int = 0,
             splitting_function: Optional[Callable[[str], List[str]]] = None,
             respect_sentence_boundary: bool = False,
             language: Language = "en",
             use_split_rules: bool = True,
             extend_abbreviations: bool = True) -> None

Initialize a DocumentPreProcessor that first cleans documents and then splits them.

Cleaner Params:

Arguments:

remove_empty_lines: If True, removes empty lines.
remove_extra_whitespaces: If True, removes extra whitespaces.
remove_repeated_substrings: If True, remove repeated substrings like headers/footers across pages.
keep_id: If True, keeps the original document IDs.
remove_substrings: A list of strings to remove from the document content.
remove_regex: A regex pattern whose matches will be removed from the document content.
unicode_normalization: Unicode normalization form to apply to the text, e.g. "NFC".
ascii_only: If True, convert text to ASCII only. Splitter Params:
split_by: The unit of splitting: "function", "page", "passage", "period", "word", "line", or "sentence".
split_length: The maximum number of units (words, lines, pages, etc.) in each split.
split_overlap: The number of overlapping units between consecutive splits.
split_threshold: The minimum number of units per split. If a split is smaller than this, it's merged with the previous split.
splitting_function: A custom function for splitting if split_by="function".
respect_sentence_boundary: If True, splits by words but tries not to break inside a sentence.
language: Language used by the sentence tokenizer if split_by="sentence" or respect_sentence_boundary=True.
use_split_rules: Whether to apply additional splitting heuristics for the sentence splitter.
extend_abbreviations: Whether to extend the sentence splitter with curated abbreviations for certain languages.

DocumentPreProcessor.to_dict

def to_dict() -> Dict[str, Any]

Serialize this instance to a dictionary.

DocumentPreProcessor.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DocumentPreProcessor"

Load this instance from a dictionary.

Module haystack_experimental.core.super_component.super_component

InvalidMappingError

SuperComponent

SuperComponent.__init__

SuperComponent.warm_up

SuperComponent.to_dict

SuperComponent.from_dict

SuperComponent.run

Module haystack_experimental.super_components.converters.multi_file_converter

MultiFileConverter

MultiFileConverter.__init__

MultiFileConverter.to_dict

MultiFileConverter.from_dict

Module haystack_experimental.super_components.indexers.sentence_transformers_document_indexer

SentenceTransformersDocumentIndexer

SentenceTransformersDocumentIndexer.__init__

SentenceTransformersDocumentIndexer.to_dict

SentenceTransformersDocumentIndexer.from_dict

Module haystack_experimental.super_components.preprocessors.document_preprocessor

DocumentPreProcessor

DocumentPreProcessor.__init__

DocumentPreProcessor.to_dict

DocumentPreProcessor.from_dict

SuperComponent.init

MultiFileConverter.init

SentenceTransformersDocumentIndexer.init

DocumentPreProcessor.init