Data Handling
This group of components performs operations on your data, such as preprocessing files, crawling, or classifying them. Use this section to discover what components are available for data tasks.
Component | Available Classes | Description |
---|---|---|
Crawler | Crawler | Scrapes text from websites. Example usage: To run searches on your website content. |
DocumentClassifier | TransformersDocumentClassifier | Classifies documents by attaching metadata to them. Example usage: Labeling documents by their characteristic (for example, sentiment). |
DocumentLanguageClassifier | LangdetectDocumentLanguageClassifier TransformersDocumentLanguageClassifier | Detects the language of the Documents you pass to it and adds it to the document metadata. |
EntityExtractor | EntityExtractor | Extracts predefined entities out of a piece of text. Example usage: Named entity extraction (NER) |
FileClassifier | FileTypeClassifier | Distinguishes between text, PDF, Markdown, Docx, and HTML files. Example usage: Routing files to appropriate converters (for example, it routes PDF files to PDFToTextConverter). |
FileConverter | AzureConverter CSVTextConverter DocxToTextConverter ImageToTextConverter MarkdownConverter PDFToTextConverter ParsrConverter TikaConverter TextConverter | Cleans and splits Documents in different formats. Example usage: In indexing pipelines, extracting text from a file and casting it into the Document format. |
PreProcessor | PreProcessor | Cleans and splits documents. Example usage: Normalizing white spaces, getting rid of headers and footers, splitting documents into smaller ones. |
Updated over 1 year ago