LinkContentFetcher

With LinkContentFetcher, you can use the contents of several URLs as the data for your pipeline. You can use it in indexing and query pipelines to fetch the contents of the URLs you give it.


Name	LinkContentFetcher
Folder Path	/fetchers/
Position in a Pipeline	In indexing or query pipelines as the data fetching step.
Inputs	"urls": A list of URLs (strings)
Outputs	"streams": A list of ByteStream objects

Overview

LinkContentFetcher fetches the contents of the urls you give it and returns a list of content streams. Each item in this list is the content of one link it successfully fetched in the form of a ByteStream object. Each of these objects in the returned list has metadata that contains its content type (in the content_type key) and its URL (in the url key).

For example, if you pass ten URLs to LinkContentFetcher and it manages to fetch six of them, then the output will be a list of six ByteStream objects, each containing information about its content type and URL.

It may happen that some sites block LinkContentFetcher from getting their content. In that case, it logs the error and returns the ByteStream objects that it successfully fetched.

Often, to use this component in a pipeline, you must convert the returned list of ByteStream objects into a list of Document objects. To do so, you can use the HTMLToDocument component.

You can use LinkContentFetcher at the beginning of an indexing pipeline to index the contents of URLs into a Document Store. Or, you can use it directly in a query pipeline, such as a retrieval-augmented generative (RAG) pipeline, to use the contents of a URL as the data source.

Usage

On its own

Below is an example where LinkContentFetcher fetches the contents of a URL. It initializes the component with the default settings. To change the default component settings, such as retry_attempts, check out the API reference docs.

from haystack.components.fetchers import LinkContentFetcher

fetcher = LinkContentFetcher()

fetcher.run(urls=["https://haystack.deepset.ai"])

In a Pipeline

Below is an example of an indexing pipeline that uses the LinkContentFetcher to index the contents of the specified URLs into an InMemoryDocumentStore. Notice how it uses the HTMLToDocument component to convert the list of ByteStream objects to Document objects.

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
fetcher = LinkContentFetcher()
converter = HTMLToDocument()
writer = DocumentWriter(document_store = document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=converter, name="converter")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("fetcher.streams", "converter.sources")
indexing_pipeline.connect("converter.documents", "writer.documents")

indexing_pipeline.run(data={"fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]}})

Updated 18 days ago