The Crawler scrapes the text from a website, creates a Haystack Document object out of it, and saves it to a JSON file.
Module crawler
Crawler
class Crawler(BaseComponent)
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
Example:
from haystack.nodes.connector import Crawler
crawler = Crawler(output_dir="crawled_files")
# crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
filter_urls= ["haystack.deepset.ai/overview/"])
Crawler.__init__
def __init__(urls: Optional[List[str]] = None,
crawler_depth: int = 1,
filter_urls: Optional[List] = None,
id_hash_keys: Optional[List[str]] = None,
extract_hidden_text=True,
loading_wait_time: Optional[int] = None,
output_dir: Union[str, Path, None] = None,
overwrite_existing_files=True,
file_path_meta_field_name: Optional[str] = None,
crawler_naming_function: Optional[Callable[[str, str],
str]] = None,
webdriver_options: Optional[List[str]] = None)
Init object with basic params for crawling (can be overwritten later).
Arguments:
urls
: List of http(s) address(es) (can also be supplied later when calling crawl())crawler_depth
: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.id_hash_keys
: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]).
In this case the id will be generated by using the content and the defined metadata.extract_hidden_text
: Whether to extract the hidden text contained in page.
E.g. the text can be inside a span with style="display: none"loading_wait_time
: Seconds to wait for page loading before scraping. Recommended when page relies on
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
E.g. 2: Crawler will wait 2 seconds before scraping pageoutput_dir
: If provided, the crawled documents will be saved as JSON files in this directory.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new contentfile_path_meta_field_name
: If provided, the file path will be stored in this meta field.crawler_naming_function
: A function mapping the crawled page to a file name.
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
E.g. 1) crawlernaming_function=lambda url, page_content: re.sub("[<>:'/\|?*\0 ]", "", link)
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.webdriver_options
: A list of options to send to Selenium webdriver. If none is provided,
Crawler uses, as a default option, a reasonable selection for operating locally, on restricted docker containers,
and avoids using GPU.
Crawler always appends the following option: "--headless"
For example: 1) ["--disable-gpu", "--no-sandbox", "--disable-dev-shm-usage", "--single-process"]
These are the default options which disable GPU, disable shared memory usage
and spawn a single process.
2) ["--no-sandbox"]
This option disables the sandbox, which is required for running Chrome as root.
3) ["--remote-debugging-port=9222"]
This option enables remote debug over HTTP.
See Chromium Command Line Switches for more details on the available options.
If your crawler fails, rasing aselenium.WebDriverException
, this Stack Overflow thread can be helpful. Contains useful suggestions for webdriver_options.
Crawler.crawl
def crawl(
urls: Optional[List[str]] = None,
crawler_depth: Optional[int] = None,
filter_urls: Optional[List] = None,
id_hash_keys: Optional[List[str]] = None,
extract_hidden_text: Optional[bool] = None,
loading_wait_time: Optional[int] = None,
output_dir: Union[str, Path, None] = None,
overwrite_existing_files: Optional[bool] = None,
file_path_meta_field_name: Optional[str] = None,
crawler_naming_function: Optional[Callable[[str, str], str]] = None
) -> List[Document]
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
file per URL, including text and basic meta data).
You can optionally specify via filter_urls
to only crawl URLs that match a certain pattern.
All parameters are optional here and only meant to overwrite instance attributes at runtime.
If no parameters are provided to this method, the instance attributes that were passed during init will be used.
Arguments:
urls
: List of http addresses or single http addresscrawler_depth
: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new contentid_hash_keys
: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]).
In this case the id will be generated by using the content and the defined metadata.loading_wait_time
: Seconds to wait for page loading before scraping. Recommended when page relies on
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
E.g. 2: Crawler will wait 2 seconds before scraping pageoutput_dir
: If provided, the crawled documents will be saved as JSON files in this directory.file_path_meta_field_name
: If provided, the file path will be stored in this meta field.crawler_naming_function
: A function mapping the crawled page to a file name.
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
E.g. 1) crawlernaming_function=lambda url, page_content: re.sub("[<>:'/\|?*\0 ]", "", link)
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
Returns:
List of Documents that were created during crawling
Crawler.run
def run(
urls: Optional[List[str]] = None,
crawler_depth: Optional[int] = None,
filter_urls: Optional[List] = None,
id_hash_keys: Optional[List[str]] = None,
extract_hidden_text: Optional[bool] = True,
loading_wait_time: Optional[int] = None,
output_dir: Union[str, Path, None] = None,
overwrite_existing_files: Optional[bool] = None,
crawler_naming_function: Optional[Callable[[str, str], str]] = None,
file_path_meta_field_name: Optional[str] = None
) -> Tuple[Dict[str, List[Document]], str]
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
Arguments:
output_dir
: Path for the directory to store filesurls
: List of http addresses or single http addresscrawler_depth
: How many sublinks to follow from the initial list of URLs. Current options:
0: Only initial list of urls
1: Follow links found on the initial URLs (but no further)filter_urls
: Optional list of regular expressions that the crawled URLs must comply with.
All URLs not matching at least one of the regular expressions will be dropped.overwrite_existing_files
: Whether to overwrite existing files in output_dir with new contentreturn_documents
: Return json files contentid_hash_keys
: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g."meta"
to this field (e.g. ["content"
,"meta"
]).
In this case the id will be generated by using the content and the defined metadata.extract_hidden_text
: Whether to extract the hidden text contained in page.
E.g. the text can be inside a span with style="display: none"loading_wait_time
: Seconds to wait for page loading before scraping. Recommended when page relies on
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
E.g. 2: Crawler will wait 2 seconds before scraping pagefile_path_meta_field_name
: If provided, the file path will be stored in this meta field.crawler_naming_function
: A function mapping the crawled page to a file name.
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
E.g. 1) crawlernaming_function=lambda url, page_content: re.sub("[<>:'/\|?*\0 ]", "", link)
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
Returns:
Tuple({"documents": List of Documents, ...}, Name of output edge)