DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Data Handling

This group of components performs operations on your data, such as preprocessing files, crawling, or classifying them. Use this section to discover what components are available for data tasks.

ComponentAvailable ClassesDescription
CrawlerCrawlerScrapes text from websites.
Example usage:
To run searches on your website content.
DocumentClassifierTransformersDocumentClassifierClassifies documents by attaching metadata to them.
Example usage:
Labeling documents by their characteristic (for example, sentiment).
DocumentLanguageClassifierLangdetectDocumentLanguageClassifier
TransformersDocumentLanguageClassifier
Detects the language of the Documents you pass to it and adds it to the document metadata.
EntityExtractorEntityExtractorExtracts predefined entities out of a piece of text.
Example usage:
Named entity extraction (NER)
FileClassifierFileTypeClassifierDistinguishes between text, PDF, Markdown, Docx, and HTML files.
Example usage:
Routing files to appropriate converters (for example, it routes PDF files to PDFToTextConverter).
FileConverterAzureConverter

CSVTextConverter

DocxToTextConverter

ImageToTextConverter

MarkdownConverter

PDFToTextConverter

ParsrConverter

TikaConverter

TextConverter
Cleans and splits Documents in different formats.
Example usage:
In indexing pipelines, extracting text from a file and casting it into the Document format.
PreProcessorPreProcessorCleans and splits documents.
Example usage:
Normalizing white spaces, getting rid of headers and footers, splitting documents into smaller ones.