| Type: | Package |
| Title: | Intelligently Peruse Data |
| Version: | 0.1.0 |
| Maintainer: | Scott McKenzie <sccmckenzie@gmail.com> |
| Description: | Facilitate extraction of key information from common datasets. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| Imports: | pastecs (≥ 1.3.21), stats, dplyr (≥ 1.0.0), rlang (≥ 0.4.3), tidyr (≥ 1.0.0), tibble, purrr, glue, tidyselect |
| RoxygenNote: | 7.1.1 |
| Suggests: | knitr, ggplot2, testthat (≥ 3.0.0), rmarkdown, mopac, hms, stringr, readr |
| Config/testthat/edition: | 3 |
| Depends: | R (≥ 3.3.0) |
| VignetteBuilder: | knitr |
| LinkingTo: | cpp11 |
| SystemRequirements: | C++11 |
| NeedsCompilation: | yes |
| Packaged: | 2021-07-03 18:58:25 UTC; Scott |
| Author: | Scott McKenzie [aut, cre], RStudio [cph] (internal functions from dplyr.R) |
| Repository: | CRAN |
| Date/Publication: | 2021-07-05 09:10:02 UTC |
Join tables based on overlapping intervals.
Description
User-friendly interface that synthesizes power of dplyr::left_join and findInterval.
Usage
break_join(x, y, brk = character(), by = NULL, ...)
Arguments
x |
A data frame. |
y |
Data frame containing desired reference information. |
brk |
Name of column in |
by |
Joining variables, if needed. See mutate-joins. |
... |
additional arguments automatically directed to |
Value
An object of the same type as x.
All
xrows will be returned.All columns between
xandyare returned.Rows in
yare matched withxbased on overlapping values ofbrk(e.g.findInterval(x$brk, y$brk, ...)).
Examples
# joining USA + UK leaders with population time-series
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))
# simple dataset
set.seed(1)
a <- data.frame(p = c(rep("A", 10), rep("B", 10)), q = runif(20, 0, 10))
b <- data.frame(p = c("A", "A", "B", "B"), q = c(3, 5, 6, 9), r = c("a1", "a2", "b1", "b2"))
break_join(a, b, brk = "q") # p identified as common variable automatically
break_join(a, b, brk = "q", by = "p") # same result
break_join(a, b, brk = "q", all.inside = TRUE) # note missing values have been filled
# joining toll prices with vehicle time-series
library(mopac)
library(dplyr, warn.conflicts = FALSE)
library(hms)
express %>%
mutate(time_hms = as_hms(time)) %>%
break_join(rates, brk = c("time_hms" = "time"))
Simulated records of radio station communications.
Description
Dataset intended to demonstrate usage of sift::conjecture.
Usage
comms
Format
An object of class tbl_df (inherits from tbl, data.frame) with 50000 rows and 4 columns.
Specialized "long to wide" reshaping
Description
On the surface, conjecture() appears similar to tidyr::pivot_wider(), but uses different logic tailored to a specific type of dataset:
column corresponding to
names_fromcontains only 2 levelsthere is no determinate combination of elements to fill 2 columns per row.
See vignette("conjecture") for more details.
Usage
conjecture(data, sort_by, names_from, names_first)
Arguments
data |
A data frame to reshape. |
sort_by |
Column name, as symbol. Plays a similar role as |
names_from |
Column name, as symbol. Used to differentiate anterior/posterior observations. Column must only contain 2 levels (missing values not allowed). |
names_first |
level in variable specified by |
Details
conjecture() uses the following routine to match elements:
Values in
sort_byare separated into two vectors: anterior and posterior.Each anterior element is matched with the closest posterior element measured by
sort_by.
Value
An object of the same type as data.
Examples
# See vignette("conjecture") for more examples
conjecture(comms, timestamp, type, "send")
Automatically cluster 1-dimensional continuous data.
Description
Automatically cluster 1-dimensional continuous data.
Usage
kluster(x, bw = "SJ", fixed = FALSE)
Arguments
x |
Vector to be clustered. Must contain at least 1 non-missing value. |
bw |
kernel bandwidth. Default "SJ" should suffice more application, however you can supply a custom numeric value. See ?stats::density for more information. |
fixed |
logical; if |
Value
An integer vector identifying the cluster corresponding to each element in x.
Examples
# Below vector clearly has 2 groups.
# kluster will identify these groups using kernel density estimation.
kluster(c(0.1, 0.2, 1))
# kluster shines in cases where manually assigning groups via "eyeballing" is impractical.
# Suppose we obtained vector 'x' without knowing how it was generated.
set.seed(1)
nodes <- runif(10, min = 0, max = 100)
x <- lapply(nodes, function(x) rnorm(10, mean = x, sd = 0.1))
x <- unlist(x)
kluster(x) # kluster reveals the natural grouping
kluster(x, bw = 10) # adjust bandwidth depending on application
# Example with faithful dataset
faithful$k <- kluster(faithful$eruptions)
library(ggplot2)
ggplot(faithful, aes(eruptions)) +
geom_density() +
geom_rug(aes(color = factor(k))) +
theme_minimal() +
scale_color_discrete(name = "k")
2020 New York Times Headlines
Description
Includes selected headlines and additional metadata for NYT articles throughout 2020. This dataset is not a comprehensive account of all major events from 2020.
Usage
nyt2020
Format
A data frame with 1,830 rows and 6 variables:
- headline
Article Headline
- abstract
Brief summary of article
- byline
Contributing Writers
- pub_date
Date of Publication
- section_name
NYT section in which article was published
- web_url
Article URL
...
Source
Obtained using NYT Developer Portal (Archive API)
Augmented data frame filtering.
Description
Imagine dplyr::filter that includes neighboring observations.
Choose how many observations to include by adjusting inputs sift.col and scope.
Usage
sift(.data, sift.col, scope, ...)
Arguments
.data |
A data frame. |
sift.col |
Column name, as symbol, to serve as "sifting/augmenting" dimension. Must be non-missing and coercible to numeric. |
scope |
Specifies augmentation bandwidth relative to "key" observations. Parameter should share the same scale as If length 1, bandwidth used is +/- If length 2, bandwidth used is (- |
... |
Expressions passed to |
Details
sift() can be understood as a 2-step process:
-
.datais passed todplyr::filter, using subsetting expression(s) provided in.... We'll refer to these intermediate results as "key" observations. For each key observation,
siftexpands the row selection bidirectionally along dimension specified bysift.col. Any row from the original dataset withinscopeunits of a key observation is captured in the final result.
Essentially, this allows us to "peek" at neighboring rows surrounding the key observations.
Value
A sifted data frame, with 2 additional columns:
-
.cluster <int>: Identifies resulting group formed by each key observation and its neighboring rows. When the key observations are close enough together, the clusters will overlap. -
.key <lgl>:TRUEindicates key observation.
Examples
# See current events from same timeframe as 2020 Utah Monolith discovery.
sift(nyt2020, pub_date, scope = 2, grepl("Monolith", headline))
# or Biden's presidential victory.
sift(nyt2020, pub_date, scope = 2, grepl("Biden is elected", headline))
# We can specify lower & upper scope to see what happened AFTER Trump tested positive.
sift(nyt2020, pub_date, scope = c(0, 2), grepl("Trump Tests Positive", headline))
# sift recognizes dplyr group specification.
library(dplyr)
library(mopac)
express %>%
group_by(direction) %>%
sift(time, 30, plate == "EAS-1671") # row augmentation performed within groups.
Fragments of US & UK population & leaders
Description
These datasets are intended to demonstrate usage of sift::break_join.
Usage
us_uk_pop
us_uk_leaders
Source
See tidyr::who and ggplot2::presidential.