The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
You can install {matchmaker} from CRAN:
install.packages("matchmaker")The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec() will translate the values in a single
vectormatch_df() will translate values in all specified
columns of a data frameEach of these functions have four manditory options:
x: your data. This will be a vector or data frame
depending on the function.dictionary: This is a data frame with at least two
columns specifying keys and values to modifyfrom: a character or number specifying which column
contains the keysto: a character or number specifying which column
contains the valuesMostly, users will be working with match_df() to
transform values across specific columns. A typical workflow would be
to:
library("matchmaker")
# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)
# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE
)This is the top of our data set, generated for example purposes
| id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup |
|---|---|---|---|---|---|---|---|---|---|---|
| ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u |
| e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui |
| b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | oui | |
| c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui |
| 40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n |
| 46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |
The dictionary looks like this:
| options | values | grp | orders |
|---|---|---|---|
| y | Yes | readmission | 1 |
| n | No | readmission | 2 |
| u | Unknown | readmission | 3 |
| .missing | Missing | readmission | 4 |
| 0 | Yes | treated | 1 |
| 1 | No | treated | 2 |
| .missing | Missing | treated | 3 |
| 1 | Facility 1 | facility | 1 |
| 2 | Facility 2 | facility | 2 |
| 3 | Facility 3 | facility | 3 |
| 4 | Facility 4 | facility | 4 |
| 5 | Facility 5 | facility | 5 |
| 6 | Facility 6 | facility | 6 |
| 7 | Facility 7 | facility | 7 |
| 8 | Facility 8 | facility | 8 |
| 9 | Facility 9 | facility | 9 |
| 10 | Facility 10 | facility | 10 |
| .default | Unknown | facility | 11 |
| 0 | 0-9 | age_group | 1 |
| 10 | 10-19 | age_group | 2 |
| 20 | 20-29 | age_group | 3 |
| 30 | 30-39 | age_group | 4 |
| 40 | 40-49 | age_group | 5 |
| 50 | 50+ | age_group | 6 |
| high | High | .regex ^lab_result_ | 1 |
| norm | Normal | .regex ^lab_result_ | 2 |
| inc | Inconclusive | .regex ^lab_result_ | 3 |
| y | yes | .global | Inf |
| n | no | .global | Inf |
| u | unknown | .global | Inf |
| unk | unknown | .global | Inf |
| oui | yes | .global | Inf |
| .missing | missing | .global | Inf |
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp"
)
head(cleaned)
#> id date readmission treated facility age_group
#> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19
#> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19
#> 3 b72883 2019-07-07 Yes No Facility 8 30-39
#> 4 c9ee86 2019-07-09 No No Facility 4 40-49
#> 5 40bc7a 2019-07-12 No No Facility 6 0-9
#> 6 46566e 2019-07-14 Yes Missing Unknown 50+
#> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1 unknown High Inconclusive missing unknown
#> 2 Inconclusive unknown Normal yes yes
#> 3 Inconclusive Normal Inconclusive missing yes
#> 4 Inconclusive Inconclusive unknown yes yes
#> 5 Normal unknown Normal missing no
#> 6 unknown unknown Inconclusive missing missing