DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Abstract class for implementing file converters.

Module base

BaseConverter

class BaseConverter(BaseComponent)

Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.

BaseConverter.__init__

def __init__(remove_numeric_tables: bool = False,
             valid_languages: Optional[List[str]] = None,
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True)

Arguments:

  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
  • id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
  • progress_bar: Show a progress bar for the conversion.

BaseConverter.convert

@abstractmethod
def convert(file_path: Path,
            meta: Optional[Dict[str, Any]],
            remove_numeric_tables: Optional[bool] = None,
            valid_languages: Optional[List[str]] = None,
            encoding: Optional[str] = "UTF-8",
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Convert a file to a dictionary containing the text and any associated meta data.

File converters may extract file meta like name or size. In addition to it, user supplied meta data like author, url, external IDs can be supplied as a dictionary.

Arguments:

  • file_path: path of the file to convert
  • meta: dictionary of meta data key-value pairs to append in the returned document.
  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
  • encoding: Select the file encoding (default is UTF-8)
  • id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.

BaseConverter.validate_language

def validate_language(text: str,
                      valid_languages: Optional[List[str]] = None) -> bool

Validate if the language of the text is one of valid languages.

BaseConverter.run

def run(file_paths: Union[Path, List[Path]],
        meta: Optional[Union[Dict[str, str],
                             List[Optional[Dict[str, str]]]]] = None,
        remove_numeric_tables: Optional[bool] = None,
        known_ligatures: Optional[Dict[str, str]] = None,
        valid_languages: Optional[List[str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        raise_on_failure: bool = True)

Extract text from a file.

Arguments:

  • file_paths: Path to the files you want to convert
  • meta: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.
  • remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.
  • known_ligatures: Some converters tend to recognize clusters of letters as ligatures, such as "ff" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is in haystack.nodes.file_converter.base.KNOWN_LIGATURES: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.
  • valid_languages: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.
  • encoding: Select the file encoding (default is UTF-8)
  • id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
  • raise_on_failure: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.