This vignette gives an overview over adaR and url
parsing in general.
A URL (Uniform Resource Locator) serves as a reference to a web resource and has specific components that give information about how the resource can be fetched. The table below gives an overview of the components of a valid URL.
| Name | Description | Example | 
|---|---|---|
| Protocol | Indicates the protocol to access the resource. | http:// | 
| Username & Password | Contains authentication info. Separated by a colon and followed by
an @. | username:password@ | 
| Hostname | Refers to the domain name or IP of the server where the resource resides. | example.comor192.168.1.1 | 
| Port | Specifies the technical gate used to access the resources on the server. | :8080 | 
| Pathname | Provides info about the location of the resource on the server, often like a filesystem path. | /directory/file.htmlor/images/pic.jpg | 
| Query | Provides additional parameters, often for search queries or data retrieval. | ?key1=value1&key2=value2 | 
| Fragment | Refers to a specific part of a web resource or document, like an anchor. | #section2 | 
A full URL might look something like this:
https://username:password@example.com:8080/directory/file.html?key1=value1&key2=value2#section2However, URLs can be as simple as just a scheme and host (e.g.,
http://example.com). The presence and specific combination
of these components can vary based on the exact nature and purpose of
the URL.
The terms are not necessarily unambiguous and there are further (sub)
terms that need explanation. The protocol can also be
called scheme. hostname+port is
called host in adaR. Additionally, the
query is referred to as search and the
fragment as hash in adaR.
Some more relevant subcomponents are given in the following table.
| Term | Description | Example | 
|---|---|---|
| Domain | A name that represents an IP address of the server which hosts the website. It’s a human-readable form of an address where web resources can be accessed. | example.com | 
| Subdomain | A subset or a smaller part of the main domain. It’s used to organize and navigate to different sections or services of a website. | blog.example.com | 
| Top-Level Domain (TLD) | The last segment of the domain name. It follows the last dot in the domain name. Indicates the purpose or origin of a domain. | .com,.net,.org | 
| Public Suffix | A domain under which Internet users can directly register their own domain names. Public suffixes include TLDs as well as certain subdomains under which domains can be registered. | co.uk,com.au,github.io | 
But wait, there is more. The table below gives the definition of
several terms that are of relevance when dealing with URLs and the
adaR package.
| Term | Description | Example | 
|---|---|---|
| Authority | Combines user info, hostname, and port. Identifies the party responsible for the resource. | userinfo@host:port | 
| Relative URL | A URL without the scheme and host, often starting with a path. Relative to a base URL. | /path/to/file.html | 
| Absolute URL | A full URL specifying scheme and host. | https://example.com/path/to/file.html | 
| Base URL | The URL to which relative URLs are resolved. | <base href="https://example.com/page/"> | 
| Percent Encoding | Encodes special characters within a URI using %followed by two hexadecimal digits. | Hello%20World(represents “Hello World”) | 
| Punycode | Represents Unicode characters in domain names using ASCII. | xn--80akhbyknj4f(representsпример) | 
| URL Canonicalization | Converts a URL into a standardized or normalized format. | From https://example.com:443/../a.htmltohttps://example.com/a.html | 
| URL Shortening | Converts a long URL into a significantly shorter version that redirects to the original URL. | Shortening https://example.com/some-long-pathmight
givehttps://exmpl.co/abc123 | 
| URL Slug | Part of a URL derived from the title of a webpage, usually human-readable and used for SEO. | For a post titled “How to Bake”, slug might be how-to-bake | 
| URI vs URL | URI is a broader category including URLs (locator) and URNs (name). All URLs are URIs, but not all URIs are URLs. | URI: mailto:john.doe@example.com, URL:https://example.com | 
The underlying C++ code of adaR, ada-url is “WHATWG copliant”.
Who/What is the WHATWG?
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving the web through standards and tests.
It was founded by individuals of Apple, the Mozilla Foundation, and Opera Software in 2004, after a W3C workshop. Apple, Mozilla and Opera were becoming increasingly concerned about the W3C’s direction with XHTML, lack of interest in HTML, and apparent disregard for the needs of real-world web developers. So, in response, these organisations set out with a mission to address these concerns and the Web Hypertext Application Technology Working Group was born.
What is the WHATWG working on?
The WHATWG’s focus is on standards implementable in web browsers, and
their associated tests. Their existing work can be found here.
The standard of relevance for this package, is the url standard. Being “WHATWG compliant” means, that ada-url follows this url standard.
The function ada_url_parse() decomposes a url into the
components shown in the first table.
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
#>                                                      href protocol username
#> 1 https://user_1:password_1@example.org:8080/api?q=1#frag   https:   user_1
#>     password             host    hostname port pathname search  hash
#> 1 password_1 example.org:8080 example.org 8080     /api   ?q=1 #fragThe function can deal with punycode and percent encoding and does generally handle all types of edge cases well.
corner_cases <- c(
    "https://example.com:8080", "http://user:password@example.com",
    "http://[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:8080", "https://example.com/path/to/resource?query=value&another=thing#fragment",
    "http://sub.sub.example.com", "ftp://files.example.com:2121/download/file.txt",
    "http://example.com/path with spaces/and&special=characters?",
    "https://user:pa%40ssword@example.com/path", "http://example.com/..//a/b/../c/./d.html",
    "https://example.com:8080/over/under?query=param#and-a-fragment",
    "http://192.168.0.1/path/to/resource", "http://3com.com/path/to/resource",
    "http://example.com/%7Eusername/", "https://example.com/a?query=value&query=value2",
    "https://example.com/a/b/c/..", "ws://websocket.example.com:9000/chat",
    "https://example.com:65535/edge-case-port", "file:///home/user/file.txt",
    "http://example.com/a/b/c/%2F%2F", "http://example.com/a/../a/../a/../a/",
    "https://example.com/./././a/", "http://example.com:8080/a;b?c=d#e",
    "http://@example.com", "http://example.com/@test", "http://example.com/@@@/a/b",
    "https://example.com:0/", "http://example.com/%25path%20with%20encoded%20chars",
    "https://example.com/path?query=%26%3D%3F%23", "http://example.com:8080/?query=value#fragment#fragment2",
    "https://example.xn--80akhbyknj4f/path/to/resource", "https://example.co.uk/path/to/resource",
    "http://username:pass%23word@example.net", "ftp://downloads.example.edu:3030/files/archive.zip",
    "https://example.com:8080/this/is/a/deeply/nested/path/to/a/resource",
    "http://another-example.com/..//test/./demo.html", "https://sub2.sub1.example.org:5000/login?user=test#section2",
    "ws://chat.example.biz:5050/livechat", "http://192.168.1.100/a/b/c/d",
    "https://secure.example.shop/cart?item=123&quantity=5", "http://example.travel/%60%21%40%23%24%25%5E%26*()",
    "https://example.museum/path/to/artifact?search=ancient", "ftp://secure-files.example.co:4040/files/document.docx",
    "https://test.example.aero/booking?flight=abc123", "http://example.asia/%E2%82%AC%E2%82%AC/path",
    "http://subdomain.example.tel/contact?name=john", "ws://game-server.example.jobs:2020/match?id=xyz",
    "http://example.mobi/path/with/mobile/content", "https://example.name/family/tree?name=smith",
    "http://192.168.2.2/path?query1=value1&query2=value2", "http://example.pro/professional/services",
    "https://example.info/information/page", "http://example.int/internal/systems/login",
    "https://example.post/postal/services", "http://example.xxx/age/verification",
    "https://example.xxx/another/edge/case/path?with=query#and-fragment"
)
df <- ada_url_parse(corner_cases)
df[, -1]
#>    protocol username  password                                host
#> 1    https:                                       example.com:8080
#> 2     http:     user  password                         example.com
#> 3     http:                    [2001:db8:85a3::8a2e:370:7334]:8080
#> 4    https:                                            example.com
#> 5     http:                                    sub.sub.example.com
#> 6      ftp:                                 files.example.com:2121
#> 7     http:                                            example.com
#> 8    https:     user pa@ssword                         example.com
#> 9     http:                                            example.com
#> 10   https:                                       example.com:8080
#> 11    http:                                            192.168.0.1
#> 12    http:                                               3com.com
#> 13    http:                                            example.com
#> 14   https:                                            example.com
#> 15   https:                                            example.com
#> 16      ws:                             websocket.example.com:9000
#> 17   https:                                      example.com:65535
#> 18    file:                                                       
#> 19    http:                                            example.com
#> 20    http:                                            example.com
#> 21   https:                                            example.com
#> 22    http:                                       example.com:8080
#> 23    http:                                            example.com
#> 24    http:                                            example.com
#> 25    http:                                            example.com
#> 26   https:                                          example.com:0
#> 27    http:                                            example.com
#> 28   https:                                            example.com
#> 29    http:                                       example.com:8080
#> 30   https:                                      example.испытание
#> 31   https:                                          example.co.uk
#> 32    http: username pass#word                         example.net
#> 33     ftp:                             downloads.example.edu:3030
#> 34   https:                                       example.com:8080
#> 35    http:                                    another-example.com
#> 36   https:                             sub2.sub1.example.org:5000
#> 37      ws:                                  chat.example.biz:5050
#> 38    http:                                          192.168.1.100
#> 39   https:                                    secure.example.shop
#> 40    http:                                         example.travel
#> 41   https:                                         example.museum
#> 42     ftp:                           secure-files.example.co:4040
#> 43   https:                                      test.example.aero
#> 44    http:                                           example.asia
#> 45    http:                                  subdomain.example.tel
#> 46      ws:                          game-server.example.jobs:2020
#> 47    http:                                           example.mobi
#> 48   https:                                           example.name
#> 49    http:                                            192.168.2.2
#> 50    http:                                            example.pro
#> 51   https:                                           example.info
#> 52    http:                                            example.int
#> 53   https:                                           example.post
#> 54    http:                                            example.xxx
#> 55   https:                                            example.xxx
#>                          hostname  port
#> 1                     example.com  8080
#> 2                     example.com      
#> 3  [2001:db8:85a3::8a2e:370:7334]  8080
#> 4                     example.com      
#> 5             sub.sub.example.com      
#> 6               files.example.com  2121
#> 7                     example.com      
#> 8                     example.com      
#> 9                     example.com      
#> 10                    example.com  8080
#> 11                    192.168.0.1      
#> 12                       3com.com      
#> 13                    example.com      
#> 14                    example.com      
#> 15                    example.com      
#> 16          websocket.example.com  9000
#> 17                    example.com 65535
#> 18                                     
#> 19                    example.com      
#> 20                    example.com      
#> 21                    example.com      
#> 22                    example.com  8080
#> 23                    example.com      
#> 24                    example.com      
#> 25                    example.com      
#> 26                    example.com     0
#> 27                    example.com      
#> 28                    example.com      
#> 29                    example.com  8080
#> 30              example.испытание      
#> 31                  example.co.uk      
#> 32                    example.net      
#> 33          downloads.example.edu  3030
#> 34                    example.com  8080
#> 35            another-example.com      
#> 36          sub2.sub1.example.org  5000
#> 37               chat.example.biz  5050
#> 38                  192.168.1.100      
#> 39            secure.example.shop      
#> 40                 example.travel      
#> 41                 example.museum      
#> 42        secure-files.example.co  4040
#> 43              test.example.aero      
#> 44                   example.asia      
#> 45          subdomain.example.tel      
#> 46       game-server.example.jobs  2020
#> 47                   example.mobi      
#> 48                   example.name      
#> 49                    192.168.2.2      
#> 50                    example.pro      
#> 51                   example.info      
#> 52                    example.int      
#> 53                   example.post      
#> 54                    example.xxx      
#> 55                    example.xxx      
#>                                       pathname                       search
#> 1                                            /                             
#> 2                                            /                             
#> 3                                            /                             
#> 4                            /path/to/resource   ?query=value&another=thing
#> 5                                            /                             
#> 6                           /download/file.txt                             
#> 7     /path with spaces/and&special=characters                             
#> 8                                        /path                             
#> 9                                 //a/c/d.html                             
#> 10                                 /over/under                 ?query=param
#> 11                           /path/to/resource                             
#> 12                           /path/to/resource                             
#> 13                                 /~username/                             
#> 14                                          /a    ?query=value&query=value2
#> 15                                       /a/b/                             
#> 16                                       /chat                             
#> 17                             /edge-case-port                             
#> 18                         /home/user/file.txt                             
#> 19                                   /a/b/c///                             
#> 20                                         /a/                             
#> 21                                         /a/                             
#> 22                                        /a;b                         ?c=d
#> 23                                           /                             
#> 24                                      /@test                             
#> 25                                    /@@@/a/b                             
#> 26                                           /                             
#> 27                   /%path with encoded chars                             
#> 28                                       /path                  ?query=&=?#
#> 29                                           /                 ?query=value
#> 30                           /path/to/resource                             
#> 31                           /path/to/resource                             
#> 32                                           /                             
#> 33                          /files/archive.zip                             
#> 34 /this/is/a/deeply/nested/path/to/a/resource                             
#> 35                            //test/demo.html                             
#> 36                                      /login                   ?user=test
#> 37                                   /livechat                             
#> 38                                    /a/b/c/d                             
#> 39                                       /cart         ?item=123&quantity=5
#> 40                                /`!@#$%^&*()                             
#> 41                           /path/to/artifact              ?search=ancient
#> 42                        /files/document.docx                             
#> 43                                    /booking               ?flight=abc123
#> 44                                    /€€/path                             
#> 45                                    /contact                   ?name=john
#> 46                                      /match                      ?id=xyz
#> 47                   /path/with/mobile/content                             
#> 48                                /family/tree                  ?name=smith
#> 49                                       /path ?query1=value1&query2=value2
#> 50                      /professional/services                             
#> 51                           /information/page                             
#> 52                     /internal/systems/login                             
#> 53                            /postal/services                             
#> 54                           /age/verification                             
#> 55                     /another/edge/case/path                  ?with=query
#>                   hash
#> 1                     
#> 2                     
#> 3                     
#> 4            #fragment
#> 5                     
#> 6                     
#> 7                     
#> 8                     
#> 9                     
#> 10     #and-a-fragment
#> 11                    
#> 12                    
#> 13                    
#> 14                    
#> 15                    
#> 16                    
#> 17                    
#> 18                    
#> 19                    
#> 20                    
#> 21                    
#> 22                  #e
#> 23                    
#> 24                    
#> 25                    
#> 26                    
#> 27                    
#> 28                    
#> 29 #fragment#fragment2
#> 30                    
#> 31                    
#> 32                    
#> 33                    
#> 34                    
#> 35                    
#> 36           #section2
#> 37                    
#> 38                    
#> 39                    
#> 40                    
#> 41                    
#> 42                    
#> 43                    
#> 44                    
#> 45                    
#> 46                    
#> 47                    
#> 48                    
#> 49                    
#> 50                    
#> 51                    
#> 52                    
#> 53                    
#> 54                    
#> 55       #and-fragmentada_url_parse() is the power horse of adaR
which always returns all components of a URL. Specific components can be
parsed with the ada_get_*() set of functions.
ada_get_hostname(corner_cases)
#>  [1] "example.com"                    "example.com"                   
#>  [3] "[2001:db8:85a3::8a2e:370:7334]" "example.com"                   
#>  [5] "sub.sub.example.com"            "files.example.com"             
#>  [7] "example.com"                    "example.com"                   
#>  [9] "example.com"                    "example.com"                   
#> [11] "192.168.0.1"                    "3com.com"                      
#> [13] "example.com"                    "example.com"                   
#> [15] "example.com"                    "websocket.example.com"         
#> [17] "example.com"                    ""                              
#> [19] "example.com"                    "example.com"                   
#> [21] "example.com"                    "example.com"                   
#> [23] "example.com"                    "example.com"                   
#> [25] "example.com"                    "example.com"                   
#> [27] "example.com"                    "example.com"                   
#> [29] "example.com"                    "example.испытание"             
#> [31] "example.co.uk"                  "example.net"                   
#> [33] "downloads.example.edu"          "example.com"                   
#> [35] "another-example.com"            "sub2.sub1.example.org"         
#> [37] "chat.example.biz"               "192.168.1.100"                 
#> [39] "secure.example.shop"            "example.travel"                
#> [41] "example.museum"                 "secure-files.example.co"       
#> [43] "test.example.aero"              "example.asia"                  
#> [45] "subdomain.example.tel"          "game-server.example.jobs"      
#> [47] "example.mobi"                   "example.name"                  
#> [49] "192.168.2.2"                    "example.pro"                   
#> [51] "example.info"                   "example.int"                   
#> [53] "example.post"                   "example.xxx"                   
#> [55] "example.xxx"ada_has_*() can be used to check if certain components
are present or not.
ada_has_search(corner_cases)
#>  [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
#> [13] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#> [25] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
#> [37] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
#> [49]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUEThe package also implements a public suffix extractor public_suffix(), based on a lookup of the Public Suffix List. Note that from this list, we only include registry suffixes (e.g., com, co.uk), which are those controlled by a domain name registry and governed by ICANN. We do not include “private” suffixes (e.g., blogspot.com) that allow people to register subdomains. Hence, we use the term domain in the sense of “top domain under a registry suffix”. See https://github.com/google/guava/wiki/InternetDomainNameExplained for more details.
urls <- c(
    "https://subsub.sub.domain.co.uk",
    "https://domain.api.gov.uk",
    "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk"                            "gov.uk"                          
#> [3] "butthisispartoftheps.kawasaki.jp"If you are wondering about the last url. The list also contains
wildcard suffixes such as *.kawasaki.jp which need to be
matched.