Fetchers
link_content
LinkContentFetcher
Fetches and extracts content from URLs.
It supports various content types, retries on failures, and automatic user-agent rotation for failed web requests. Use it as the data-fetching step in your pipelines.
You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument converter to do this.
Usage example
from haystack.components.fetchers.link_content import LinkContentFetcher
fetcher = LinkContentFetcher()
streams = fetcher.run(urls=["https://www.google.com"])["streams"]
assert len(streams) == 1
assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
assert streams[0].data
For async usage:
import asyncio
from haystack.components.fetchers import LinkContentFetcher
async def fetch_async():
fetcher = LinkContentFetcher()
result = await fetcher.run_async(urls=["https://www.google.com"])
return result["streams"]
streams = asyncio.run(fetch_async())
init
__init__(
raise_on_failure: bool = True,
user_agents: list[str] | None = None,
retry_attempts: int = 2,
timeout: int = 3,
http2: bool = False,
client_kwargs: dict | None = None,
request_headers: dict[str, str] | None = None,
)
Initializes the component.
Parameters:
- raise_on_failure (
bool) – IfTrue, raises an exception if it fails to fetch a single URL. For multiple URLs, it logs errors and returns the content it successfully fetched. - user_agents (
list[str] | None) – User agents for fetching content. IfNone, a default user agent is used. - retry_attempts (
int) – The number of times to retry to fetch the URL's content. - timeout (
int) – Timeout in seconds for the request. - http2 (
bool) – Whether to enable HTTP/2 support for requests. Defaults to False. Requires the 'h2' package to be installed (viapip install httpx[http2]). - client_kwargs (
dict | None) – Additional keyword arguments to pass to the httpx client. IfNone, default values are used.
run
Fetches content from a list of URLs and returns a list of extracted content streams.
Each content stream is a ByteStream object containing the extracted content as binary data.
Each ByteStream object in the returned list corresponds to the contents of a single URL.
The content type of each stream is stored in the metadata of the ByteStream object under
the key "content_type". The URL of the fetched content is stored under the key "url".
Parameters:
- urls (
list[str]) – A list of URLs to fetch content from.
Returns:
- –
ByteStreamobjects representing the extracted content.
Raises:
Exception– If the provided list of URLs contains only a single URL, andraise_on_failureis set toTrue, an exception will be raised in case of an error during content retrieval. In all other scenarios, any retrieval errors are logged, and a list of successfully retrievedByteStreamobjects is returned.
run_async
Asynchronously fetches content from a list of URLs and returns a list of extracted content streams.
This is the asynchronous version of the run method with the same parameters and return values.
Parameters:
- urls (
list[str]) – A list of URLs to fetch content from.
Returns:
- –
ByteStreamobjects representing the extracted content.