Skip to main content
Version: 2.23

Firecrawl

haystack_integrations.components.fetchers.firecrawl.firecrawl_crawler

FirecrawlCrawler

A component that uses Firecrawl to crawl one or more URLs and return the content as Haystack Documents.

Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit. This is useful for ingesting entire websites or documentation sites, not just single pages.

Firecrawl is a service that crawls websites and returns content in a structured format (e.g. Markdown) suitable for LLMs. You need a Firecrawl API key from firecrawl.dev.

Usage example

python
from haystack_integrations.components.fetchers.firecrawl import FirecrawlFetcher

fetcher = FirecrawlFetcher(
api_key=Secret.from_env_var("FIRECRAWL_API_KEY"),
params={"limit": 5},
)
fetcher.warm_up()

result = fetcher.run(urls=["https://docs.haystack.deepset.ai/docs/intro"])
documents = result["documents"]

init

python
__init__(
api_key: Secret = Secret.from_env_var("FIRECRAWL_API_KEY"),
params: dict[str, Any] | None = None,
) -> None

Initialize the FirecrawlFetcher.

Parameters:

  • api_key (Secret) – API key for Firecrawl. Defaults to the FIRECRAWL_API_KEY environment variable.
  • params (dict[str, Any] | None) – Parameters for the crawl request. See the Firecrawl API reference for available parameters. Defaults to {"limit": 1, "scrape_options": {"formats": ["markdown"]}}. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.

run

python
run(urls: list[str], params: dict[str, Any] | None = None) -> dict[str, Any]

Crawls the given URLs and returns the extracted content as Documents.

Parameters:

  • urls (list[str]) – List of URLs to crawl.
  • params (dict[str, Any] | None) – Optional override of crawl parameters for this run. If provided, fully replaces the init-time params.

Returns:

  • dict[str, Any] – A dictionary with the following keys:
  • documents: List of documents, one for each URL crawled.

run_async

python
run_async(
urls: list[str], params: dict[str, Any] | None = None
) -> dict[str, Any]

Asynchronously crawls the given URLs and returns the extracted content as Documents.

Parameters:

  • urls (list[str]) – List of URLs to crawl.
  • params (dict[str, Any] | None) – Optional override of crawl parameters for this run. If provided, fully replaces the init-time params.

Returns:

  • dict[str, Any] – A dictionary with the following keys:
  • documents: List of documents, one for each URL crawled.

warm_up

python
warm_up() -> None

Warm up the Firecrawl client by initializing the clients. This is useful to avoid cold start delays when crawling many URLs.