Module azure

AzureConverter

class AzureConverter(BaseConverter)

File converter that makes use of Microsoft Azure's Form Recognizer service (https://azure.microsoft.com/en-us/services/form-recognizer/). This Converter extracts both text and tables. Supported file formats are: PDF, JPEG, PNG, BMP and TIFF.

In order to be able to use this Converter, you need an active Azure account and a Form Recognizer or Cognitive Services resource. (Here you can find information on how to set this up: https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-v3-python-sdk#prerequisites)

AzureConverter.init

def __init__(endpoint: str,
             credential_key: str,
             model_id: str = "prebuilt-document",
             valid_languages: Optional[List[str]] = None,
             save_json: bool = False,
             preceding_context_len: int = 3,
             following_context_len: int = 3,
             merge_multiple_column_headers: bool = True,
             id_hash_keys: Optional[List[str]] = None,
             add_page_number: bool = True)

Arguments:

endpoint: Your Form Recognizer or Cognitive Services resource's endpoint.
credential_key: Your Form Recognizer or Cognitive Services resource's subscription key.
model_id: The identifier of the model you want to use to extract information out of your file. Default: "prebuilt-document". General purpose models are "prebuilt-document" and "prebuilt-layout". List of available prebuilt models: https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-formrecognizer/3.2.0b1/index.html#documentanalysisclient
valid_languages: Validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
save_json: Whether to save the output of the Form Recognizer to a JSON file.
preceding_context_len: Number of lines before a table to extract as preceding context (will be returned as part of meta data).
following_context_len: Number of lines after a table to extract as subsequent context (will be returned as part of meta data).
merge_multiple_column_headers: Some tables contain more than one row as a column header (i.e., column description). This parameter lets you choose, whether to merge multiple column header rows to a single row.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
add_page_number: Adds the number of the page a table occurs in to the Document's meta field "page".

AzureConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, Any]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "utf-8",
            id_hash_keys: Optional[List[str]] = None,
            pages: Optional[str] = None,
            known_language: Optional[str] = None) -> List[Document]

Extract text and tables from a PDF, JPEG, PNG, BMP or TIFF file using Azure's Form Recognizer service.

Arguments:

file_path: Path to the file you want to convert.
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: Not applicable.
valid_languages: Validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
pages: Custom page numbers for multi-page documents(PDF/TIFF). Input the page numbers and/or ranges of pages you want to get in the result. For a range of pages, use a hyphen, like pages=”1-3, 5-6”. Separate each page number or range with a comma.
known_language: Locale hint of the input document. See supported locales here: https://aka.ms/azsdk/formrecognizer/supportedlocales.

AzureConverter.convert_azure_json

def convert_azure_json(
        file_path: Path,
        meta: Optional[Dict[str, Any]] = None,
        valid_languages: Optional[List[str]] = None,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]

Extract text and tables from the JSON output of Azure's Form Recognizer service.

Arguments:

file_path: Path to the JSON-file you want to convert.
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
valid_languages: Validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

AzureConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

AzureConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module csv

CsvTextConverter

class CsvTextConverter(BaseConverter)

Converts a CSV file containing FAQs to text Documents. The CSV file must have two columns: 'question' and 'answer'. Use this node for FAQ-style question answering.

CsvTextConverter.convert

def convert(file_path: Union[Path, List[Path], str, List[str],
                             List[Union[Path, str]]],
            meta: Optional[Dict[str, Any]],
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "UTF-8",
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Load a CSV file containing question-answer pairs and convert it to Documents.

:param file_path: Path to the CSV file you want to convert. The file must have two columns called 'question' and 'answer'. The first will be interpreted as a question, the second as content. :param meta: A dictionary of metadata key-value pairs that you want to append to the returned document. It's optional. :param encoding: Specifies the file encoding. It's optional. The default value is UTF-8. :param id_hash_keys: Generates the document ID from a custom list of strings that refer to the document's attributes. To ensure you don't have duplicate documents in your DocumentStore when texts are not unique, modify the metadata and pass, for example, "meta" to this field (example: ["content", "meta"]). Then the ID is generated by using the content and the metadata you defined. :param remove_numeric_tables: unused :param valid_languages: unused :returns: List of document, 1 document per line in the CSV.

CsvTextConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True)

Arguments:

remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
progress_bar: Show a progress bar for the conversion.

CsvTextConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

CsvTextConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module docx

DocxToTextConverter

class DocxToTextConverter(BaseConverter)

DocxToTextConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, str]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Extract text from a .docx file.

Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here. For compliance with other converters we nevertheless opted for keeping the methods name.

Arguments:

file_path: Path to the .docx file you want to convert
meta: dictionary of meta data key-value pairs to append in the returned document.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

DocxToTextConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

DocxToTextConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module image

ImageToTextConverter

class ImageToTextConverter(BaseConverter)

ImageToTextConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None)

Arguments:

remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified here (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text. If no value is provided, English will be set as default. Run the following line of code to check available language packs: # List of available languages print(pytesseract.get_languages(config=''))
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

ImageToTextConverter.convert

def convert(file_path: Union[Path, str],
            meta: Optional[Dict[str, str]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)

Arguments:

file_path: path to image file
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages supported by tessarect (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

ImageToTextConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

ImageToTextConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module json

JsonConverter

class JsonConverter(BaseConverter)

Extracts text from JSON files and casts it into Document objects.

JsonConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, Any]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "UTF-8",
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Reads a JSON file and converts it into a list of Documents.

It's a wrapper around Document.from_dict() and, as such, acts as the inverse of Document.to_dict().

It expects one of these formats:

A JSON file with a list of Document dicts.
A JSONL file with every line containing either a Document dict or a list of dicts.

Arguments:

file_path: Path to the JSON file you want to convert.
meta: Optional dictionary with metadata you want to attach to all resulting documents. Can be any custom keys and values. The result will have a union of metadata specified here and already present in the json. In case of same keys being used, the one passed here takes precedence/overwrites the one from the json.
remove_numeric_tables: Uses heuristics to remove numeric rows from the tables. Note: Not currently used in this Converter.
valid_languages: Validates languages from a list of languages specified in the [ISO 639-1] Note: Not currently used in this Converter.
encoding: Encoding used when opening the json file.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. To ensure you don't have duplicate documents in your DocumentStore if texts are not unique, modify the metadata and pass, for example, "meta" to this field (example: ["content", "meta"]). The id is then generated by using the content and the defined metadata. If specified here or during initialization of the JsonConverter, it will overwrite any id_hash_keys present in the json file.

JsonConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True)

Arguments:

remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
progress_bar: Show a progress bar for the conversion.

JsonConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

JsonConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module markdown

MarkdownConverter

class MarkdownConverter(BaseConverter)

MarkdownConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True,
             remove_code_snippets: bool = True,
             extract_headlines: bool = False,
             add_frontmatter_to_meta: bool = False)

Arguments:

remove_numeric_tables: Not applicable.
valid_languages: Not applicable.
id_hash_keys: Generate the document ID from a custom list of strings that refer to the document's attributes. To make sure you don't have duplicate documents in your DocumentStore if texts are not unique, you can modify the metadata and pass for example, "meta" to this field (["content", "meta"]). In this case, the ID is generated by using the content and the defined metadata.
progress_bar: Show a progress bar for the conversion.
remove_code_snippets: Whether to remove snippets from the markdown file.
extract_headlines: Whether to extract headings from the markdown file.
add_frontmatter_to_meta: Whether to add the contents of the frontmatter to meta.

MarkdownConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, Any]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "utf-8",
            id_hash_keys: Optional[List[str]] = None,
            remove_code_snippets: Optional[bool] = None,
            extract_headlines: Optional[bool] = None,
            add_frontmatter_to_meta: Optional[bool] = None) -> List[Document]

Reads text from a markdown file and executes optional preprocessing steps.

Arguments:

file_path: path of the file to convert
meta: dictionary of meta data key-value pairs to append in the returned document.
encoding: Select the file encoding (default is utf-8)
remove_numeric_tables: Not applicable
valid_languages: Not applicable
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
remove_code_snippets: Whether to remove snippets from the markdown file.
extract_headlines: Whether to extract headings from the markdown file.
add_frontmatter_to_meta: Whether to add the contents of the frontmatter to meta.

MarkdownConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

MarkdownConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module parsr

ParsrConverter

class ParsrConverter(BaseConverter)

File converter that makes use of the open-source Parsr tool by axa-group. (https://github.com/axa-group/Parsr). This Converter extracts both text and tables. Supported file formats are: PDF, DOCX

ParsrConverter.init

def __init__(parsr_url: str = "http://localhost:3001",
             extractor: Literal["pdfminer", "pdfjs"] = "pdfminer",
             table_detection_mode: Literal["lattice", "stream"] = "lattice",
             preceding_context_len: int = 3,
             following_context_len: int = 3,
             remove_page_headers: bool = False,
             remove_page_footers: bool = False,
             remove_table_of_contents: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             add_page_number: bool = True,
             extract_headlines: bool = True,
             timeout: Union[float, Tuple[float, float]] = 10.0)

Arguments:

parsr_url: URL endpoint to Parsr"s REST API.
extractor: Backend used to extract textual structured from PDFs. ("pdfminer" or "pdfjs")
table_detection_mode: Parsing method used to detect tables and their cells. "lattice" detects tables and their cells by demarcated lines between cells. "stream" detects tables and their cells by looking at whitespace between cells.
preceding_context_len: Number of lines before a table to extract as preceding context (will be returned as part of meta data).
following_context_len: Number of lines after a table to extract as preceding context (will be returned as part of meta data).
remove_page_headers: Whether to remove text that Parsr detected as a page header.
remove_page_footers: Whether to remove text that Parsr detected as a page footer.
remove_table_of_contents: Whether to remove text that Parsr detected as a table of contents.
valid_languages: Validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
add_page_number: Adds the number of the page a table occurs in to the Document's meta field "page".
extract_headlines: Whether to extract headings from the PDF file.
timeout: How many seconds to wait for the server to send data before giving up, as a float, or a :ref:(connect timeout, read timeout) <timeouts> tuple. Defaults to 10 seconds.

ParsrConverter.convert

def convert(
        file_path: Path,
        meta: Optional[Dict[str, Any]] = None,
        remove_numeric_tables: Optional[bool] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "utf-8",
        id_hash_keys: Optional[List[str]] = None,
        extract_headlines: Optional[bool] = None,
        timeout: Union[float, Tuple[float, float]] = 10.0) -> List[Document]

Extract text and tables from a PDF or DOCX using the open-source Parsr tool.

Arguments:

file_path: Path to the file you want to convert.
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: Not applicable.
valid_languages: Validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
extract_headlines: Whether to extract headings from the PDF file.
timeout: How many seconds to wait for the server to send data before giving up, as a float, or a :ref:(connect timeout, read timeout) <timeouts> tuple. Defaults to 10 seconds.

ParsrConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

ParsrConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module pdf

PDFToTextConverter

class PDFToTextConverter(BaseConverter)

PDFToTextConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             encoding: Optional[str] = None,
             keep_physical_layout: Optional[bool] = None,
             sort_by_position: bool = False,
             ocr: Optional[Literal["auto", "full"]] = None,
             ocr_language: str = "eng",
             multiprocessing: Union[bool, int] = True) -> None

Arguments:

remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
encoding: This parameter is being deprecated. It will be automatically detected by PyMuPDF.
keep_physical_layout: This parameter is being deprecated.
sort_by_position: Specifies whether to sort the extracted text by positional coordinates or logical reading order. If set to True, the text is sorted first by vertical position, and then by horizontal position. If set to False (default), the logical reading order in the PDF is used.
ocr: Specifies whether to use OCR to extract text from images in the PDF. If set to "auto", OCR is used only to extract text from images and integrate into the existing text. If set to "full", OCR is used to extract text from the entire PDF.
ocr_language: Specifies the language to use for OCR. The default language is English, which language code is eng. For a list of supported languages and the respective codes access https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. You can combine multiple languages by passing a string with the language codes separated by +. For example, to use English and German, pass eng+deu.
multiprocessing: We use multiprocessing to speed up PyMuPDF conversion, you can disable it by setting it to False. If set to True (the default value), the total number of cores is used. To specify the number of cores to use, set it to an integer.

PDFToTextConverter.convert

def convert(
        file_path: Path,
        meta: Optional[Dict[str, Any]] = None,
        remove_numeric_tables: Optional[bool] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = None,
        id_hash_keys: Optional[List[str]] = None,
        start_page: Optional[int] = None,
        end_page: Optional[int] = None,
        keep_physical_layout: Optional[bool] = None,
        sort_by_position: Optional[bool] = None,
        ocr: Optional[Literal["auto", "full"]] = None,
        ocr_language: Optional[str] = None,
        multiprocessing: Optional[Union[bool, int]] = None) -> List[Document]

Extract text from a PDF file and convert it to a Document.

Arguments:

file_path: Path to the .pdf file you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: This parameter is being deprecated. It will be automatically detected by PyMuPDF.
keep_physical_layout: This parameter is being deprecated.
sort_by_position: Specifies whether to sort the extracted text by positional coordinates or logical reading order. If set to True, the text is sorted first by vertical position, and then by horizontal position. If set to False (default), the logical reading order in the PDF is used.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
start_page: The page number where to start the conversion
end_page: The page number where to end the conversion.
ocr: Specifies whether to use OCR to extract text from images in the PDF. If set to "auto", OCR is used only to extract text from images and integrate into the existing text. If set to "full", OCR is used to extract text from the entire PDF. To use this feature you must install Tesseract-OCR. For more information, see https://github.com/tesseract-ocr/tesseract#installing-tesseract.
ocr_language: Specifies the language to use for OCR. The default language is English, which language code is eng. For a list of supported languages and the respective codes access https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. You can combine multiple languages by passing a string with the language codes separated by +. For example, to use English and German, pass eng+deu.
multiprocessing: We use multiprocessing to speed up PyMuPDF conversion, you can disable it by setting it to False. If set to None (the default value), the value defined in the class initialization is used. If set to True, the total number of cores is used. To specify the number of cores to use, set it to an integer.

PDFToTextConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

PDFToTextConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module pptx

PptxConverter

class PptxConverter(BaseConverter)

PptxConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, str]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Extract text from a .pptx file.

Note: As pptx doesn't contain "page" information, we actually extract and return a list of texts from each slide here. For compliance with other converters we nevertheless opted for keeping the methods name.

Arguments:

file_path: Path to the .pptx file you want to convert
meta: dictionary of meta data key-value pairs to append in the returned document.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

PptxConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

PptxConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module tika

TikaConverter

class TikaConverter(BaseConverter)

TikaConverter.init

def __init__(tika_url: str = "http://localhost:9998/tika",
             remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             timeout: Union[float, Tuple[float, float]] = 10.0)

Arguments:

tika_url: URL of the Tika server
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
timeout: How many seconds to wait for the server to send data before giving up, as a float, or a :ref:(connect timeout, read timeout) <timeouts> tuple. Defaults to 10 seconds.

TikaConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, str]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Arguments:

file_path: path of the file to convert
meta: dictionary of meta data key-value pairs to append in the returned document.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Not applicable
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

Returns:

A list of pages and the extracted meta data of the file.

TikaConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

TikaConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module txt

TextConverter

class TextConverter(BaseConverter)

TextConverter.convert

def convert(file_path: Path,
            meta: Optional[Dict[str, str]] = None,
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "utf-8",
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Reads text from a txt file and executes optional preprocessing steps.

Arguments:

file_path: path of the file to convert
meta: dictionary of meta data key-value pairs to append in the returned document.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is utf-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

TextConverter.init

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True)

Arguments:

remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
progress_bar: Show a progress bar for the conversion.

TextConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

TextConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

file_paths: Path to the files you want to convert
meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ﬀ" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
encoding: Select the file encoding (default is UTF-8)
id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.

Module azure

AzureConverter

AzureConverter.__init__

AzureConverter.convert

AzureConverter.convert_azure_json

AzureConverter.validate_language

AzureConverter.run

Module csv

CsvTextConverter

CsvTextConverter.convert

CsvTextConverter.__init__

CsvTextConverter.validate_language

CsvTextConverter.run

Module docx

DocxToTextConverter

DocxToTextConverter.convert

DocxToTextConverter.validate_language

DocxToTextConverter.run

Module image

ImageToTextConverter

ImageToTextConverter.__init__

ImageToTextConverter.convert

ImageToTextConverter.validate_language

ImageToTextConverter.run

Module json

JsonConverter

JsonConverter.convert

JsonConverter.__init__

JsonConverter.validate_language

JsonConverter.run

Module markdown

MarkdownConverter

MarkdownConverter.__init__

MarkdownConverter.convert

MarkdownConverter.validate_language

MarkdownConverter.run

Module parsr

ParsrConverter

ParsrConverter.__init__

ParsrConverter.convert

ParsrConverter.validate_language

ParsrConverter.run

Module pdf

PDFToTextConverter

PDFToTextConverter.__init__

PDFToTextConverter.convert

PDFToTextConverter.validate_language

PDFToTextConverter.run

Module pptx

PptxConverter

PptxConverter.convert

PptxConverter.validate_language

PptxConverter.run

Module tika

TikaConverter

TikaConverter.__init__

TikaConverter.convert

TikaConverter.validate_language

TikaConverter.run

Module txt

TextConverter

TextConverter.convert

TextConverter.__init__

TextConverter.validate_language

TextConverter.run

AzureConverter.init

CsvTextConverter.init

ImageToTextConverter.init

JsonConverter.init

MarkdownConverter.init

ParsrConverter.init

PDFToTextConverter.init

TikaConverter.init

TextConverter.init