Data Handling

Component	Available Classes	Description
Crawler	Crawler	Scrapes text from websites. Example usage: To run searches on your website content.
DocumentClassifier	TransformersDocumentClassifier	Classifies documents by attaching metadata to them. Example usage: Labeling documents by their characteristic (for example, sentiment).
DocumentLanguageClassifier	LangdetectDocumentLanguageClassifier TransformersDocumentLanguageClassifier	Detects the language of the Documents you pass to it and adds it to the document metadata.
EntityExtractor	EntityExtractor	Extracts predefined entities out of a piece of text. Example usage: Named entity extraction (NER)
FileClassifier	FileTypeClassifier	Distinguishes between text, PDF, Markdown, Docx, and HTML files. Example usage: Routing files to appropriate converters (for example, it routes PDF files to PDFToTextConverter).
FileConverter	AzureConverter CSVTextConverter DocxToTextConverter ImageToTextConverter MarkdownConverter PDFToTextConverter ParsrConverter TikaConverter TextConverter	Cleans and splits Documents in different formats. Example usage: In indexing pipelines, extracting text from a file and casting it into the Document format.
PreProcessor	PreProcessor	Cleans and splits documents. Example usage: Normalizing white spaces, getting rid of headers and footers, splitting documents into smaller ones.