DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Extractors

Enriches documents with information extracted from machine learning models.

Module haystack_experimental.components.extractors.llm_document_content_extractor

LLMDocumentContentExtractor

Extracts textual content from image-based documents using a vision-enabled LLM (Large Language Model).

This component converts each input document into an image using the DocumentToImageContent component, uses a prompt to instruct the LLM on how to extract content, and uses a ChatGenerator to extract structured textual content based on the provided prompt.

The prompt must not contain variables; it should only include instructions for the LLM. Image data and the prompt are passed together to the LLM as a chat message.

Documents for which the LLM fails to extract content are returned in a separate failed_documents list. These failed documents will have a content_extraction_error entry in their metadata. This metadata can be used for debugging or for reprocessing the documents later.

Usage example

from haystack import Document
from haystack_experimental.components.generators.chat import OpenAIChatGenerator
from haystack_experimental.components.extractors import LLMDocumentContentExtractor
chat_generator = OpenAIChatGenerator()
extractor = LLMDocumentContentExtractor(chat_generator=chat_generator)
documents = [
    Document(content="", meta={"file_path": "image.jpg"}),
    Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
]
updated_documents = extractor.run(documents=documents)["documents"]
print(updated_documents)
# [Document(content='Extracted text from image.jpg',
#           meta={'file_path': 'image.jpg'}),
#  ...]

LLMDocumentContentExtractor.__init__

def __init__(*,
             chat_generator: ChatGenerator,
             prompt: str = DEFAULT_PROMPT_TEMPLATE,
             file_path_meta_field: str = "file_path",
             root_path: Optional[str] = None,
             detail: Optional[Literal["auto", "high", "low"]] = None,
             size: Optional[Tuple[int, int]] = None,
             raise_on_failure: bool = False,
             max_workers: int = 3)

Initialize the LLMDocumentContentExtractor component.

Arguments:

  • chat_generator: A ChatGenerator instance representing the LLM used to extract text. This generator must support vision-based input and return a plain text response. Currently, the experimental versions of OpenAIChatGenerator and AmazonBedrockChatGenerator are supported.
  • prompt: Instructional text provided to the LLM. It must not contain Jinja variables. The prompt should only contain instructions on how to extract the content of the image-based document.
  • file_path_meta_field: The metadata field in the Document that contains the file path to the image or PDF.
  • root_path: The root directory path where document files are located. If provided, file paths in document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
  • detail: Optional detail level of the image (only supported by OpenAI). Can be "auto", "high", or "low". This will be passed to chat_generator when processing the images.
  • size: If provided, resizes the image to fit within the specified dimensions (width, height) while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services.
  • raise_on_failure: If True, exceptions from the LLM are raised. If False, failed documents are logged and returned.
  • max_workers: Maximum number of threads used to parallelize LLM calls across documents using a ThreadPoolExecutor.

LLMDocumentContentExtractor.warm_up

def warm_up()

Warm up the ChatGenerator if it has a warm_up method.

LLMDocumentContentExtractor.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

LLMDocumentContentExtractor.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "LLMDocumentContentExtractor"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary with serialized data.

Returns:

An instance of the component.

LLMDocumentContentExtractor.run

@component.output_types(documents=List[Document],
                        failed_documents=List[Document])
def run(documents: List[Document]) -> Dict[str, List[Document]]

Run content extraction on a list of image-based documents using a vision-capable LLM.

Each document is passed to the LLM along with a predefined prompt. The response is used to update the document's content. If the extraction fails, the document is returned in the failed_documents list with metadata describing the failure.

Arguments:

  • documents: A list of image-based documents to process. Each must have a valid file path in its metadata.

Returns:

A dictionary with:

  • "documents": Successfully processed documents, updated with extracted content.
  • "failed_documents": Documents that failed processing, annotated with failure metadata.