Skip to main content
Version: 2.27

Tika

haystack_integrations.components.converters.tika.converter

XHTMLParser

Bases: HTMLParser

Custom parser to extract pages from Tika XHTML content.

init

python
__init__() -> None

Initialize the XHTMLParser.

handle_starttag

python
handle_starttag(tag: str, attrs: list[tuple[str, str | None]]) -> None

Identify the start of a page div.

Parameters:

  • tag (str) – The HTML tag name.
  • attrs (list[tuple[str, str | None]]) – The HTML tag attributes.

handle_endtag

python
handle_endtag(tag: str) -> None

Identify the end of a page div.

Parameters:

  • tag (str) – The HTML tag name.

handle_data

python
handle_data(data: str) -> None

Populate the page content.

Parameters:

  • data (str) – The text content of an HTML node.

TikaDocumentConverter

Converts files of different types to Documents using Apache Tika.

This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.

Usage example:

python
from haystack_integrations.components.converters.tika import TikaDocumentConverter
from datetime import datetime

converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]

print(documents[0].content)
# >> 'This is a text from the docx file.'

init

python
__init__(
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
) -> None

Create a TikaDocumentConverter component.

Parameters:

  • tika_url (str) – Tika server URL.
  • store_full_path (bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.

run

python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]

Convert files to Documents.

Parameters:

  • sources (list[str | Path | ByteStream]) – List of file paths or ByteStream objects.
  • meta (dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. If sources contains ByteStream objects, their meta will be added to the output Documents.

Returns:

  • dict[str, list[Document]] – A dictionary with the following keys:
  • documents: Created Documents