Abstract class for implementing file converters.
Module base
BaseConverter
class BaseConverter(BaseComponent)
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
BaseConverter.__init__
def __init__(remove_numeric_tables: bool = False,
valid_languages: Optional[List[str]] = None,
id_hash_keys: Optional[List[str]] = None,
progress_bar: bool = True)
Arguments:
remove_numeric_tables
: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.valid_languages
: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.id_hash_keys
: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]). In this case the id will be generated by using the content and the defined metadata.progress_bar
: Show a progress bar for the conversion.
BaseConverter.convert
@abstractmethod
def convert(file_path: Path,
meta: Optional[Dict[str, Any]],
remove_numeric_tables: Optional[bool] = None,
valid_languages: Optional[List[str]] = None,
encoding: Optional[str] = "UTF-8",
id_hash_keys: Optional[List[str]] = None) -> List[Document]
Convert a file to a dictionary containing the text and any associated meta data.
File converters may extract file meta like name or size. In addition to it, user supplied meta data like author, url, external IDs can be supplied as a dictionary.
Arguments:
file_path
: path of the file to convertmeta
: dictionary of meta data key-value pairs to append in the returned document.remove_numeric_tables
: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.valid_languages
: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.encoding
: Select the file encoding (default isUTF-8
)id_hash_keys
: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]). In this case the id will be generated by using the content and the defined metadata.
BaseConverter.validate_language
def validate_language(text: str,
valid_languages: Optional[List[str]] = None) -> bool
Validate if the language of the text is one of valid languages.
BaseConverter.run
def run(file_paths: Union[Path, List[Path]],
meta: Optional[Union[Dict[str, str],
List[Optional[Dict[str, str]]]]] = None,
remove_numeric_tables: Optional[bool] = None,
known_ligatures: Optional[Dict[str, str]] = None,
valid_languages: Optional[List[str]] = None,
encoding: Optional[str] = "UTF-8",
id_hash_keys: Optional[List[str]] = None,
raise_on_failure: bool = True)
Extract text from a file.
Arguments:
file_paths
: Path to the files you want to convertmeta
: Optional dictionary with metadata that shall be attached to all resulting documents. Can be any custom keys and values.remove_numeric_tables
: This option uses heuristics to remove numeric rows from the tables. The tabular structures in documents might be noise for the reader model if it does not have table parsing capability for finding answers. However, tables may also have long strings that could possible candidate for searching answers. The rows containing strings are thus retained in this option.known_ligatures
: Some converters tend to recognize clusters of letters as ligatures, such as "ff" (double f). Such ligatures however make text hard to compare with the content of other files, which are generally ligature free. Therefore we automatically find and replace the most common ligatures with their split counterparts. The default mapping is inhaystack.nodes.file_converter.base.KNOWN_LIGATURES
: it is rather biased towards Latin alphabeths but excludes all ligatures that are known to be used in IPA. If no value is provided, this default is created and used. You can use this parameter to provide your own set of ligatures to clean up from the documents.valid_languages
: validate languages from a list of languages specified in the ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) format. This option can be used to add test for encoding errors. If the extracted text is not one of the valid languages, then it might likely be encoding error resulting in garbled text.encoding
: Select the file encoding (default isUTF-8
)id_hash_keys
: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]). In this case the id will be generated by using the content and the defined metadata.raise_on_failure
: If true, raises an exception if the conversion of a single file fails. If False, skips the file without failing.