Distinguishes between text, PDF, Markdown, Docx and HTML files and routes them to the appropriate File Converter in an indexing pipeline.
Module file_type
FileTypeClassifier
class FileTypeClassifier(BaseComponent)
Route files in an Indexing Pipeline to corresponding file converters.
FileTypeClassifier.__init__
def __init__(supported_types: Optional[List[str]] = None,
full_analysis: bool = False,
raise_on_error: bool = True)
Node that sends out files on a different output edge depending on their extension.
Arguments:
supported_types
: The file types this node distinguishes. Optional. If you don't provide any value, the default is:txt
,pdf
,md
,docx
, andhtml
. You can't use lists with duplicate elements.full_analysis
: If True, the whole file is analyzed to determine the file type. If False, only the first 2049 bytes are analyzed.raise_on_error
: If True, the node will raise an exception if the file type is not supported.
FileTypeClassifier.run
def run(file_paths: Union[Path, List[Path], str, List[str], List[Union[Path,
str]]])
Sends out files on a different output edge depending on their extension.
Arguments:
file_paths
: paths to route on different edges.