Tika
haystack_integrations.components.converters.tika.converter
XHTMLParser
Bases: HTMLParser
Custom parser to extract pages from Tika XHTML content.
init
Initialize the XHTMLParser.
handle_starttag
Identify the start of a page div.
Parameters:
- tag (
str) – The HTML tag name. - attrs (
list[tuple[str, str | None]]) – The HTML tag attributes.
handle_endtag
Identify the end of a page div.
Parameters:
- tag (
str) – The HTML tag name.
handle_data
Populate the page content.
Parameters:
- data (
str) – The text content of an HTML node.
TikaDocumentConverter
Converts files of different types to Documents using Apache Tika.
This component uses Apache Tika for parsing the files and, therefore, requires a running Tika server. For more options on running Tika, see the official documentation.
Usage example:
python
from haystack_integrations.components.converters.tika import TikaDocumentConverter
from datetime import datetime
converter = TikaDocumentConverter()
results = converter.run(
sources=["sample.docx", "my_document.rtf", "archive.zip"],
meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# >> 'This is a text from the docx file.'
init
python
__init__(
tika_url: str = "http://localhost:9998/tika", store_full_path: bool = False
) -> None
Create a TikaDocumentConverter component.
Parameters:
- tika_url (
str) – Tika server URL. - store_full_path (
bool) – If True, the full path of the file is stored in the metadata of the document. If False, only the file name is stored.
run
python
run(
sources: list[str | Path | ByteStream],
meta: dict[str, Any] | list[dict[str, Any]] | None = None,
) -> dict[str, list[Document]]
Convert files to Documents.
Parameters:
- sources (
list[str | Path | ByteStream]) – List of file paths or ByteStream objects. - meta (
dict[str, Any] | list[dict[str, Any]] | None) – Optional metadata to attach to the Documents. This value can be either a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced Documents. If it's a list, the length of the list must match the number of sources, because the two lists will be zipped. Ifsourcescontains ByteStream objects, theirmetawill be added to the output Documents.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: Created Documents