---
title: "Crawling a website"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Crawling a website}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

```{r setup}
library(crawlee)
```

This article follows the same path as the
[Crawlee](https://crawlee.dev) fundamentals: start from a single page, then
teach the crawler to **follow links**, **control its scope**, **route**
different page types and **discover URLs from a sitemap**. The examples target
[`books.toscrape.com`](https://books.toscrape.com), a public sandbox built for
practising web scraping.

## The model

A crawler owns three things:

* a **request queue** — a deduplicating, resumable list of URLs to visit;
* one or more **handlers** — functions run on each fetched page;
* a **dataset** — the structured records your handlers produce.

You build a crawler with `crawler()` and configure it with `cr_*` verbs that
compose through the native pipe (`|>`), then run it with `cr_run()`.

## Your first crawler

Fetch a single page and extract a couple of fields. The handler receives a
context object (`ctx`) exposing the parsed page and the action `push_data()`.

```{r}
result <- crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

result
```

## Following links

Real crawls discover new URLs as they go. `ctx$enqueue_links()` extracts links
from the current page and adds them to the queue; the crawler keeps going until
the queue drains. Because the queue deduplicates by a normalised URL, each page
is visited at most once.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$enqueue_links() # follow every same-domain link
  }) |>
  cr_options(max_requests = 50) |>
  cr_run()
```

`enqueue_links()` only follows same-domain links by default, so a crawl cannot
wander off across the whole web.

## Controlling scope

You rarely want *every* link. `enqueue_links()` takes `glob` (a shorthand for
`include`), `include`/`exclude` patterns and a `same_domain` flag; the crawler
itself enforces `max_depth` and `max_requests`.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_options(max_depth = 3, max_requests = 200) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url, depth = ctx$request$depth))
    ctx$enqueue_links(
      glob    = "*/catalogue/*", # only follow catalogue pages
      exclude = "*/category/*"
    )
  }) |>
  cr_run() |>
  cr_collect()
```

## Routing different page types

Most sites have a few kinds of page — listings vs. detail pages, say. Give a
`label` when enqueuing and register a handler for that label. Listing pages
enqueue detail pages; detail pages extract the data.

```{r}
books <- crawler("https://books.toscrape.com/") |>
  # listing pages: enqueue book detail pages, labelled "book"
  cr_on_html(function(ctx) {
    ctx$enqueue_links(glob = "*/catalogue/*index.html", label = "book")
    ctx$enqueue_links(glob = "*/page-*.html") # pagination, default handler
  }) |>
  # detail pages
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(
      title = ctx$page |> rvest::html_element("h1") |> rvest::html_text2(),
      price = ctx$page |> rvest::html_element(".price_color") |> rvest::html_text2()
    ))
  }) |>
  cr_run() |>
  cr_collect()

books
```

A request's `label` always wins over the content-kind default, so labelled
routing and `cr_on_html()`/`cr_on_pdf()` defaults compose cleanly.

## Crawling from a sitemap

When a site publishes a `sitemap.xml`, you can seed the queue directly from it
instead of discovering links page by page — `cr_from_sitemap()` handles sitemap
indexes and gzipped sitemaps, and can filter by glob or by `<lastmod>` date.

```{r}
crawler() |>
  cr_from_sitemap("https://books.toscrape.com/sitemap.xml", label = "book") |>
  cr_on_html(label = "book", function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
  }) |>
  cr_run() |>
  cr_collect()
```

The companion `cr_from_rss()` does the same for RSS and Atom feeds.

## Rendering JavaScript pages

If a page builds its content with JavaScript, the plain HTTP backend sees an
empty shell. Switch to the headless-browser backend with `cr_use_browser()`
(requires the \pkg{chromote} package and a Chrome/Chromium install). Handlers
are unchanged; you additionally get `ctx$screenshot()`.

```{r}
crawler("https://example.com") |>
  cr_use_browser(wait_selector = ".content") |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(url = ctx$request$url))
    ctx$screenshot()
  }) |>
  cr_run()
```

## Where next

* **Politeness & speed** — `robots.txt` is respected by default; `cr_options(delay = )`
  rate-limits, and `cr_parallel()` fetches concurrently.
* **Documents** — `cr_on_pdf()` extracts text from PDFs; `ctx$save_body()` stores
  raw files in a key-value store.
* **Reproducible, resumable runs** — `cr_persist(dir)` checkpoints the queue and
  persists the dataset, so an interrupted crawl continues where it left off.
* **RAG** — `cr_chunk()`, `cr_embed()` and `cr_export()` turn crawled text into a
  retrieval-ready table.
