DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Fetches content from a list of URLs and returns a list of extracted content streams.

Module link_content

LinkContentFetcher

Fetches and extracts content from URLs.

It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.

You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.

Usage example

from haystack.components.fetchers.link_content import LinkContentFetcher

fetcher = LinkContentFetcher()
streams = fetcher.run(urls=["https://www.google.com"])["streams"]

assert len(streams) == 1
assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
assert streams[0].data

LinkContentFetcher.__init__

def __init__(raise_on_failure: bool = True,
             user_agents: Optional[List[str]] = None,
             retry_attempts: int = 2,
             timeout: int = 3)

Initializes the component.

Arguments:

  • raise_on_failure: If True, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.
  • user_agents: User agents for fetching content. If None, a default user agent is used.
  • retry_attempts: The number of times to retry to fetch the URL's content.
  • timeout: Timeout in seconds for the request.

LinkContentFetcher.run

@component.output_types(streams=List[ByteStream])
def run(urls: List[str])

Fetches content from a list of URLs and returns a list of extracted content streams.

Each content stream is a ByteStream object containing the extracted content as binary data. Each ByteStream object in the returned list corresponds to the contents of a single URL. The content type of each stream is stored in the metadata of the ByteStream object under the key "content_type". The URL of the fetched content is stored under the key "url".

Arguments:

  • urls: A list of URLs to fetch content from.

Raises:

  • Exception: If the provided list of URLs contains only a single URL, and raise_on_failure is set to True, an exception will be raised in case of an error during content retrieval. In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved ByteStream objects is returned.

Returns:

ByteStream objects representing the extracted content.