Fetches content from a list of URLs and returns a list of extracted content streams.
Module link_content
LinkContentFetcher
Fetches and extracts content from URLs.
It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.
You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.
Usage example
from haystack.components.fetchers.link_content import LinkContentFetcher
fetcher = LinkContentFetcher()
streams = fetcher.run(urls=["https://www.google.com"])["streams"]
assert len(streams) == 1
assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
assert streams[0].data
For async usage:
import asyncio
from haystack.components.fetchers import LinkContentFetcher
async def fetch_async():
fetcher = LinkContentFetcher()
result = await fetcher.run_async(urls=["https://www.google.com"])
return result["streams"]
streams = asyncio.run(fetch_async())
LinkContentFetcher.__init__
def __init__(raise_on_failure: bool = True,
user_agents: Optional[List[str]] = None,
retry_attempts: int = 2,
timeout: int = 3,
http2: bool = False,
client_kwargs: Optional[Dict] = None)
Initializes the component.
Arguments:
raise_on_failure
: IfTrue
, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched.user_agents
: User agents for fetching content. IfNone
, a default user agent is used.retry_attempts
: The number of times to retry to fetch the URL's content.timeout
: Timeout in seconds for the request.http2
: Whether to enable HTTP/2 support for requests. Defaults to False. Requires the 'h2' package to be installed (viapip install httpx[http2]
).client_kwargs
: Additional keyword arguments to pass to the httpx client. IfNone
, default values are used.
LinkContentFetcher.__del__
def __del__()
Clean up resources when the component is deleted.
Closes both the synchronous and asynchronous HTTP clients to prevent resource leaks.
LinkContentFetcher.run
@component.output_types(streams=List[ByteStream])
def run(urls: List[str])
Fetches content from a list of URLs and returns a list of extracted content streams.
Each content stream is a ByteStream
object containing the extracted content as binary data.
Each ByteStream object in the returned list corresponds to the contents of a single URL.
The content type of each stream is stored in the metadata of the ByteStream object under
the key "content_type". The URL of the fetched content is stored under the key "url".
Arguments:
urls
: A list of URLs to fetch content from.
Raises:
Exception
: If the provided list of URLs contains only a single URL, andraise_on_failure
is set toTrue
, an exception will be raised in case of an error during content retrieval. In all other scenarios, any retrieval errors are logged, and a list of successfully retrievedByteStream
objects is returned.
Returns:
ByteStream
objects representing the extracted content.
LinkContentFetcher.run_async
@component.output_types(streams=List[ByteStream])
async def run_async(urls: List[str])
Asynchronously fetches content from a list of URLs and returns a list of extracted content streams.
This is the asynchronous version of the run
method with the same parameters and return values.
Arguments:
urls
: A list of URLs to fetch content from.
Returns:
ByteStream
objects representing the extracted content.