HTMLToDocument
A component that converts HTML files to documents.
| Name | HTMLToDocument |
| Folder path | /converters/ |
| Most common position in a pipeline | Before PreProcessors, or right at the beginning of an indexing pipeline |
| Mandatory input variables | "sources": A list of HTML file paths or ByteStream objects |
| Output variables | "documents": A list of documents |
Overview
The HTMLToDocument component converts HTML files into documents. It can be used in an indexing pipeline to index the contents of an HTML file into a Document Store or even in a querying pipeline after the LinkContentFetcher. The HTMLToDocument component takes a list of HTML file paths or ByteStream objects as input and converts the files to a list of documents. Optionally, you can attach metadata to the documents through the meta input parameter.
When you initialize the component, you can optionally set the extractor_type, which is the type of boilerpy3 extractor to use. It defaults to DefaultExtractor. For more information on extractors, refer to the boilerpy3 documentation.
Usage
On its own
from pathlib import Path
from haystack.components.converters import HTMLToDocument
converter = HTMLToDocument()
docs = converter.run(sources=[Path("saved_page.html"]))
In a pipeline
Here's an example of an indexing pipeline that writes the contents of an HTML file into an InMemoryDocumentStore:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", HTMLToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": file_names}})
Updated over 1 year ago
