Version: 2.26-unstable

LinkContentFetcher

With LinkContentFetcher, you can use the contents of several URLs as the data for your pipeline. You can use it in indexing and query pipelines to fetch the contents of the URLs you give it.


Most common position in a pipeline	In indexing or query pipelines as the data fetching step
Mandatory run variables	`urls`: A list of URLs (strings)
Output variables	`streams`: A list of `ByteStream` objects
API reference	Fetchers
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/fetchers/link_content.py

Overview

LinkContentFetcher fetches the contents of the urls you give it and returns a list of content streams. Each item in this list is the content of one link it successfully fetched in the form of a ByteStream object. Each of these objects in the returned list has metadata that contains its content type (in the content_type key) and its URL (in the url key).

For example, if you pass ten URLs to LinkContentFetcher and it manages to fetch six of them, then the output will be a list of six ByteStream objects, each containing information about its content type and URL.

It may happen that some sites block LinkContentFetcher from getting their content. In that case, it logs the error and returns the ByteStream objects that it successfully fetched.

Often, to use this component in a pipeline, you must convert the returned list of ByteStream objects into a list of Document objects. To do so, you can use the HTMLToDocument component.

You can use LinkContentFetcher at the beginning of an indexing pipeline to index the contents of URLs into a Document Store. You can also use it directly in a query pipeline, such as a retrieval-augmented generative (RAG) pipeline, to use the contents of a URL as the data source.

Security considerations

LinkContentFetcher requests the URLs passed to it. If those URLs come directly from end users, this can expose your environment to server-side request forgery (SSRF) risks.

Before calling LinkContentFetcher, an application should therefore validate and sanitize user-provided URLs. For example:

Allow only expected schemes, for example https
Use an allowlist of trusted domains when possible
Block localhost, link-local, and private-network destinations
Consider using an outbound proxy or network-level egress restrictions in production

For example, an application could block private, loopback, link-local, reserved IPs, and custom IP ranges using the standard library's ipaddress module:

python

import ipaddress
from urllib.parse import urlparse


PRIVATE_RANGES = (
    ipaddress.ip_network("127.0.0.0/8"),
    ipaddress.ip_network("10.0.0.0/8"),
    ipaddress.ip_network("172.16.0.0/12"),
    ipaddress.ip_network("192.168.0.0/16"),
    ipaddress.ip_network("169.254.0.0/16"),
)


def is_unsafe_url(url: str) -> bool:
    parsed = urlparse(url)
    if parsed.scheme != "https" or not parsed.hostname:
        return True
    try:
        ip = ipaddress.ip_address(parsed.hostname)
    except ValueError:
        # Hostname (not a raw IP). Apply your own domain allowlist policy here. Filter out "LOCALHOST" etc.
        return False
    return ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or any(ip in net for net in PRIVATE_RANGES)

Usage

On its own

Below is an example where LinkContentFetcher fetches the contents of a URL. It initializes the component using the default settings. To change the default component settings, such as retry_attempts, check out the API reference docs.

python

from haystack.components.fetchers import LinkContentFetcher

fetcher = LinkContentFetcher()

fetcher.run(urls=["https://haystack.deepset.ai"])

In a pipeline

Below is an example of an indexing pipeline that uses the LinkContentFetcher to index the contents of the specified URLs into an InMemoryDocumentStore. Notice how it uses the HTMLToDocument component to convert the list of ByteStream objects to Document objects.

python

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
fetcher = LinkContentFetcher()
converter = HTMLToDocument()
writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=converter, name="converter")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("fetcher.streams", "converter.sources")
indexing_pipeline.connect("converter.documents", "writer.documents")

indexing_pipeline.run(
    data={
        "fetcher": {
            "urls": [
                "https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2",
            ],
        },
    },
)

Overview​

Security considerations​

Usage​

On its own​

In a pipeline​

Overview

Security considerations

Usage

On its own

In a pipeline