| Type: | Package |
| Title: | Parse, Clean, and Normalize URLs |
| Version: | 1.2.0 |
| Language: | en-US |
| Description: | A lightweight toolkit for extracting structured information from URLs. Includes functions for parsing, normalizing protocols, extracting domains, and constructing clean URLs. The package includes a processed copy of the Public Suffix List from https://publicsuffix.org for domain extraction. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Collate: | 'utils.R' 'domain.R' 'path-query.R' 'parse-phases.R' 'parse.R' 'accessors.R' 'canonical_join.R' 'zzz.R' |
| Imports: | utils, curl, stringi, punycoder (≥ 1.0.0) |
| URL: | https://github.com/bart-turczynski/rurl |
| BugReports: | https://github.com/bart-turczynski/rurl/issues |
| Suggests: | testthat (≥ 3.0.0), knitr, rmarkdown, withr |
| Config/testthat/edition: | 3 |
| Depends: | R (≥ 3.5) |
| VignetteBuilder: | knitr |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-13 10:02:53 UTC; bartturczynski |
| Author: | Bart Turczynski [aut, cre] |
| Maintainer: | Bart Turczynski <bartek+rurl@turczynski.pl> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-19 15:50:02 UTC |
Canonical Join of Two URL Sets (Base R Version)
Description
Performs a join between two data frames by canonicalizing URLs to a shared
"clean" format using safe_parse_urls and then matching on
that key.
This is suitable for large crawl exports.
Usage
canonical_join(
data_A,
data_B,
col_A = "URL",
col_B = "URL",
suffix_A = "_A",
suffix_B = "_B",
name_A = NULL,
name_B = NULL,
join = c("inner", "left", "right", "full"),
collision = c("first", "all", "error"),
on_parse_error = c("keep", "drop", "error"),
join_parse_status = c("ok", "ok_or_warning"),
...
)
Arguments
data_A |
A data frame containing URLs for the left side of the join. |
data_B |
A data frame containing URLs for the right side of the join. |
col_A |
Character string, the name of the column in |
col_B |
Character string, the name of the column in |
suffix_A |
Character string, suffix to append to |
suffix_B |
Character string, suffix to append to |
name_A |
Character string, the name of the output column holding the
original |
name_B |
Character string, the name of the output column holding the
original |
join |
Join type: |
collision |
How to handle duplicate canonical keys within inputs.
|
on_parse_error |
How to handle URLs that fail canonicalization.
|
join_parse_status |
Which parse statuses yield joinable canonical keys.
|
... |
Additional arguments forwarded to |
Value
A data frame representing the join. The output includes:
The original URL columns (named via
name_A/name_B, or after the input expressions when those areNULL).-
JoinKey: the canonicalized URL used for matching. All other columns from
data_Aanddata_Bwith suffixes applied.
Returns an empty data frame with the expected structure if no matches are found or if inputs are invalid.
Examples
A <- data.frame(
URL = c("http://Example.com/Page", "http://example.com/Other"),
ValA = 1:2, stringsAsFactors = FALSE
)
B <- data.frame(
URL = c("https://www.example.com/Page/", "http://example.com/Miss"),
ValB = c("x", "y"), stringsAsFactors = FALSE
)
canonical_join(
A, B,
protocol_handling = "strip",
www_handling = "strip",
case_handling = "lower_host",
trailing_slash_handling = "strip"
)
Get cleaned URLs
Description
This function returns the cleaned version of the URLs after applying
protocol, www, case, and trailing slash handling rules. The result is a
normalized canonical key composed of scheme, host, and path only; port,
query, fragment, and userinfo are intentionally excluded (use
get_port, get_query, get_fragment,
or get_userinfo for those).
Usage
get_clean_url(
url,
protocol_handling = "keep",
www_handling = "none",
case_handling = "lower_host",
trailing_slash_handling = "none",
index_page_handling = "keep",
path_normalization = "none",
scheme_relative_handling = "keep",
subdomain_levels_to_keep = NULL,
host_encoding = "keep",
path_encoding = "keep"
)
Arguments
url |
A character vector containing URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
case_handling |
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
|
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
Value
A character vector of cleaned URLs.
Examples
get_clean_url("Example.COM/Path") # Default lower_host: host folds, path kept
get_clean_url(
"Example.COM/Path",
case_handling = "keep",
trailing_slash_handling = "keep"
)
get_clean_url(
"Example.COM/Path/",
case_handling = "upper",
trailing_slash_handling = "strip"
)
get_clean_url("http://example.com", www_handling = "strip")
get_clean_url(
"http://deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 0
)
# -> "http://example.com/path"
get_clean_url(
"http://www.deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 1,
www_handling = "strip"
)
# -> "http://domain.example.com/path"
get_clean_url(
"http://www.deep.sub.domain.example.com/path",
subdomain_levels_to_keep = 1,
www_handling = "keep"
)
# -> "http://www.domain.example.com/path"
Get domain names
Description
Extracts the registered domain name from a URL (e.g., "example.com"). Relies on the Public Suffix List.
Usage
get_domain(
url,
protocol_handling = "keep",
www_handling = "none",
subdomain_levels_to_keep = NULL,
source = c("all", "private", "icann")
)
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
source |
Which PSL source to use: "all", "private", or "icann". |
Value
A character vector of domain names.
Examples
get_domain("http://www.example.co.uk/path")
Get URL fragments
Description
Extracts the fragment component of a URL.
Usage
get_fragment(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
A character vector of fragments.
Examples
get_fragment("http://example.com/path#section")
Get URL hosts
Description
Extracts the host component of a URL.
Usage
get_host(
url,
protocol_handling = "keep",
www_handling = "none",
subdomain_levels_to_keep = NULL,
case_handling = c("lower", "keep", "upper", "lower_host")
)
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
case_handling |
How to handle casing of the returned host. Defaults to "lower". |
Value
A character vector of URL hosts.
Examples
get_host("http://sub.example.com:8080")
get_host(
"http://www.two.one.example.com",
subdomain_levels_to_keep = 1
) # Result: "www.one.example.com"
get_host(
"http://www.two.one.example.com",
www_handling = "strip",
subdomain_levels_to_keep = 1
) # Result: "one.example.com"
get_host(
"http://www.two.one.example.com",
www_handling = "keep",
subdomain_levels_to_keep = 1
) # Result: "www.one.example.com"
get_host(
"http://three.two.one.example.com",
subdomain_levels_to_keep = 0
) # Result: "example.com"
get_host(
"http://www.three.two.one.example.com",
subdomain_levels_to_keep = 0
) # Result: "www.example.com"
Get the parse status of URLs
Description
Get the parse status of URLs
Usage
get_parse_status(
url,
protocol_handling = "keep",
www_handling = "none",
subdomain_levels_to_keep = NULL
)
Arguments
url |
A character vector of URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
Value
A character vector with the parse status of each URL.
Examples
get_parse_status(
c("http://example.com", "ftp://example.com", "mailto:user@example.com")
)
get_parse_status(c("http://example.com", "not-a-url"))
Get URL passwords
Description
Extracts the password component of a URL.
Usage
get_password(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
A character vector of passwords.
Get URL paths
Description
Extracts the path component of a URL.
Usage
get_path(
url,
protocol_handling = "keep",
case_handling = c("lower_host", "keep", "lower", "upper")
)
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
case_handling |
How to handle casing of the returned path. Defaults to "lower_host", which preserves the path's original casing (paths are case-sensitive per RFC 3986 §6.2.2.1). Use "lower"/"upper" to force a case. |
Value
A character vector of URL paths.
Examples
get_path("http://example.com/some/path?query=1")
Get URL ports
Description
Extracts the port component of a URL.
Usage
get_port(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
An integer vector of ports.
Examples
get_port("http://example.com:8080/path")
Get URL query strings
Description
Extracts the query component of a URL, optionally parsing it into a list.
Usage
get_query(
url,
protocol_handling = "keep",
format = c("string", "list"),
decode = TRUE
)
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
format |
Return format: "string" (default) or "list" for parsed elements. |
decode |
Logical; if TRUE and format="list", percent-decodes keys/values. |
Value
A character vector (format="string") or list (format="list").
Examples
get_query("http://example.com/path?a=1&b=2")
get_query("http://example.com/path?a=1&b=2", format = "list")
Get URL schemes
Description
Extracts the scheme (protocol) of a URL.
Usage
get_scheme(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
A character vector of URL schemes.
Examples
get_scheme("https://example.com")
Get URL subdomains
Description
Extracts the subdomain component of a URL.
Usage
get_subdomain(
url,
protocol_handling = "keep",
www_handling = "none",
source = c("all", "private", "icann"),
include_www = FALSE,
format = c("string", "labels")
)
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
source |
Which PSL source to use: "all", "private", or "icann". |
include_www |
Logical; if FALSE (default), removes a leading www/www[0-9]* label only when it is the sole subdomain label. |
format |
Return format: "string" (default) or "labels" for a character vector of labels. |
Value
A character vector (format="string") or list of label vectors (format="labels").
Examples
get_subdomain("http://www.blog.example.co.uk")
get_subdomain("http://www.blog.example.co.uk", format = "labels")
Extract the top-level domain (TLD) from a URL
Description
Uses safe_parse_url internally to extract the TLD, benefiting from all memoization layers for improved performance.
Usage
get_tld(url, source = c("all", "private", "icann"))
Arguments
url |
A character vector of URLs. |
source |
Which TLD source to use: "all", "icann", or "private". |
Value
A character vector of TLDs.
Examples
get_tld("example.com")
Get URL user names
Description
Extracts the user component of a URL.
Usage
get_user(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
A character vector of user names.
Get URL userinfo
Description
Extracts the userinfo component of a URL (user or user:password).
Usage
get_userinfo(url, protocol_handling = "keep")
Arguments
url |
A character vector of URLs. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
Value
A character vector of userinfo values.
Configure the rurl memoization caches
Description
Enables or disables individual caches and sets an optional bound on the
full_parse cache. Called with no arguments, it leaves the
configuration unchanged and returns the current state.
Usage
rurl_cache_config(
full_parse = NULL,
domain = NULL,
tld = NULL,
puny_encode = NULL,
puny_decode = NULL,
max_full_parse = NULL
)
Arguments
full_parse |
Logical; enable/disable the full URL parse cache. |
domain |
Logical; enable/disable the registered-domain cache. |
tld |
Logical; enable/disable the TLD-extraction cache. |
puny_encode |
Logical; enable/disable the IDNA/Punycode encode cache. |
puny_decode |
Logical; enable/disable the Punycode decode cache. |
max_full_parse |
A single number ( |
Details
Disabling a cache stops new writes to it (existing entries are left in
place until rurl_clear_caches is called). When
full_parse reaches max_full_parse entries, it is reset before
the next new entry is stored, so its peak size never exceeds the bound; the
default of Inf preserves the historical unbounded behavior. The
domain, tld, puny_encode, and puny_decode caches
are unbounded by design (each stays small — bounded by the number of unique
hosts/labels seen, not URL+option combinations).
Value
Invisibly, the updated rurl_cache_info data.frame.
See Also
rurl_cache_info, rurl_clear_caches
Examples
rurl_cache_config(max_full_parse = 10000)
rurl_cache_config(domain = FALSE)
rurl_cache_config() # inspect current configuration
Inspect the rurl memoization caches
Description
Reports the number of entries currently held in each memoization cache, along with whether the cache is enabled and any configured entry bound.
Usage
rurl_cache_info()
Value
A data.frame with one row per cache (full_parse,
domain, tld, puny_encode, puny_decode) and
columns entries, enabled, and max_entries.
See Also
rurl_cache_config, rurl_clear_caches
Examples
get_domain("https://www.example.com")
rurl_cache_info()
Clear all rurl caches
Description
Clears the memoization caches used by rurl functions. This is useful if you need to free memory or if you've updated the PSL data.
Usage
rurl_clear_caches()
Value
Invisibly returns NULL.
Examples
rurl_clear_caches()
Parse a URL comprehensively, extracting and deriving all relevant components.
Description
This function serves as the core URL processing engine. It parses a URL, handles protocol and www prefix modifications, detects IP addresses, and derives components like the registered domain and top-level domain (TLD). Results are memoized for performance when processing large datasets.
Usage
safe_parse_url(
url,
protocol_handling = c("keep", "none", "strip", "http", "https"),
www_handling = c("none", "strip", "keep", "if_no_subdomain"),
tld_source = c("all", "private", "icann"),
case_handling = c("lower_host", "keep", "lower", "upper"),
trailing_slash_handling = c("none", "keep", "strip"),
index_page_handling = c("keep", "strip"),
path_normalization = c("none", "collapse_slashes", "dot_segments", "both"),
scheme_relative_handling = c("keep", "http", "https", "error"),
subdomain_levels_to_keep = NULL,
host_encoding = c("keep", "idna", "unicode"),
path_encoding = c("keep", "encode", "decode")
)
Arguments
url |
A single URL string to be parsed. For vectors, use
|
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
tld_source |
Which TLD source to use for TLD extraction: "all", "icann", or "private". Defaults to "all". |
case_handling |
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
|
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
Value
A named list with the following components:
'original_url': The original URL string provided.
'scheme': The scheme (e.g., "http", "https").
'host': The host (e.g., "www.example.com"). NA if the host becomes empty after processing.
'port': The port number.
'path': The path component (e.g., "/path/to/resource").
'query': The query string (e.g., "name=value").
'fragment': The fragment identifier (e.g., "section").
'user': The user name for authentication.
'password': The password for authentication.
'domain': The registered domain name (e.g., "example.com"). NA if host is an IP, empty, or derivation fails.
'tld': The top-level domain (e.g., "com"). NA if host is an IP, empty, or derivation fails.
'is_ip_host': Logical, TRUE if the host is an IP address.
'clean_url': A normalized canonical key reconstructed from scheme, host, and path only, after processing and with case handling applied. Port, query, fragment, and userinfo are intentionally excluded (use the dedicated components above to retrieve them). With 'path_encoding = "decode"' the path is shown decoded, so 'clean_url' is human-readable rather than guaranteed URL-safe. NA if host is empty/NA.
'parse_status': Character string indicating parsing outcome ("ok", "ok-ftp", "ok-scheme-relative", "error", "warning-no-tld", "warning-invalid-tld", "warning-public-suffix").
Returns 'NULL' if the URL is fundamentally unparseable (e.g., NA, empty) or uses a disallowed scheme.
See Also
Examples
safe_parse_url(
"http://www.Example.com/Path?q=1#Frag",
protocol_handling = "keep",
case_handling = "lower"
)
safe_parse_url(
"Example.com/Another",
protocol_handling = "none",
www_handling = "keep",
case_handling = "upper",
trailing_slash_handling = "keep"
)
safe_parse_url(
"example.com",
www_handling = "if_no_subdomain"
) # -> www.example.com
safe_parse_url(
"sub.example.com",
www_handling = "if_no_subdomain"
) # -> sub.example.com
safe_parse_url(
"www1.example.com",
www_handling = "if_no_subdomain"
) # -> www.example.com
safe_parse_url(
"www1.sub.example.com",
www_handling = "if_no_subdomain"
) # -> www.sub.example.com
safe_parse_url(
"http://www.example.com/path/",
trailing_slash_handling = "strip"
)
safe_parse_url("192.168.1.1/test")
safe_parse_url("ftp://user:pass@ftp.example.co.uk:21/file.txt")
safe_parse_url(
"http://deep.sub.domain.example.com",
subdomain_levels_to_keep = 0
)
safe_parse_url(
"http://deep.sub.domain.example.com",
subdomain_levels_to_keep = 1
)
safe_parse_url(
"http://www.deep.sub.domain.example.com",
www_handling = "keep",
subdomain_levels_to_keep = 0
)
safe_parse_url(
"http://www.deep.sub.domain.example.com",
www_handling = "keep",
subdomain_levels_to_keep = 1
)
Parse multiple URLs and return a data.frame of components
Description
Vectorized wrapper around safe_parse_url that returns a
data.frame with one row per input URL.
Usage
safe_parse_urls(
url,
protocol_handling = c("keep", "none", "strip", "http", "https"),
www_handling = c("none", "strip", "keep", "if_no_subdomain"),
tld_source = c("all", "private", "icann"),
case_handling = c("lower_host", "keep", "lower", "upper"),
trailing_slash_handling = c("none", "keep", "strip"),
index_page_handling = c("keep", "strip"),
path_normalization = c("none", "collapse_slashes", "dot_segments", "both"),
scheme_relative_handling = c("keep", "http", "https", "error"),
subdomain_levels_to_keep = NULL,
host_encoding = c("keep", "idna", "unicode"),
path_encoding = c("keep", "encode", "decode")
)
Arguments
url |
A character vector of URLs to be parsed. |
protocol_handling |
A character string specifying how to handle protocols. Defaults to "keep".
|
www_handling |
A character string specifying how to handle "www" and "www[number]" prefixes in the host. Defaults to "none".
|
tld_source |
Which TLD source to use for TLD extraction: "all", "icann", or "private". Defaults to "all". |
case_handling |
A character string specifying how to handle the case of the cleaned URL. Defaults to "lower_host", the RFC 3986 §6.2.2.1 normalization (scheme and host are case-insensitive and folded to lowercase; the path is case-sensitive and preserved).
|
trailing_slash_handling |
A character string specifying how to handle trailing slashes in the path component of the cleaned URL. Defaults to "none".
|
index_page_handling |
A character string specifying how to handle index/default pages. Defaults to "keep".
|
path_normalization |
How to normalize path structure. Defaults to "none".
|
scheme_relative_handling |
How to handle URLs starting with "//". Defaults to "keep".
|
subdomain_levels_to_keep |
An integer or NULL. Determines how many levels of subdomains are kept, in addition to any 'www.' prefix handled by 'www_handling'.
|
host_encoding |
How to present the host in 'clean_url'. Defaults to "keep".
|
path_encoding |
How to handle percent-encoding in the path for 'clean_url'. Defaults to "keep".
|
Value
A data.frame with one row per URL and the same fields returned by
safe_parse_url. Invalid inputs return NA fields with
parse_status = "error".
Examples
safe_parse_urls(c("example.com", "https://www.example.com/path"))