When sharing datasets or publishing packages containing data, developers must ensure that: 1. Sensitive Personally Identifiable Information (PII) is anonymized. 2. Datasets are thoroughly documented with standard data dictionaries. 3. Package functions are covered by reliable test suites.
devkit provides modules to streamline data masking,
roxygen2 documentation generation, and unit-test scaffolding.
Before sharing research data or package datasets, PII like names, email addresses, phone numbers, and exact locations must be scrambled or removed.
mask_identity() runs an interactive console wizard that
reads a dataframe, prompts you to select columns containing sensitive
data, and applies appropriate masking algorithms (e.g., scrambling
strings, grouping ages, or replacing values with random
identifiers).
Imagine we have a dummy clinical dataset containing sensitive columns:
# Create a dummy patient dataset
patient_data <- data.frame(
patient_id = 1:5,
name = c("Alice Smith", "Bob Jones", "Charlie Brown", "Diana Prince", "Evan Wright"),
age = c(34, 45, 23, 56, 41),
email = c("alice@mail.com", "bob@mail.com", "charlie@mail.com", "diana@mail.com", "evan@mail.com"),
diagnosis = c("Flu", "Cold", "Flu", "Allergy", "Healthy"),
stringsAsFactors = FALSE
)
# Run the interactive masking wizard
masked_data <- mask_identity(patient_data)
# The wizard will prompt you:
# 1. Scramble/Anonymize the 'name' column? Yes -> replaces names with scrambled strings (e.g., 'Ujdfn Hsoiu')
# 2. Scramble/Anonymize the 'email' column? Yes -> replaces emails with random strings (e.g., 'mask_1@example.com')
# 3. Apply category grouping to 'age'? Yes -> groups exact ages into ranges (e.g., '30-39', '40-49')
# Verify the masked dataset
head(masked_data)CRAN requires that all package datasets are documented using a
@format roxygen2 block listing the column names and their
descriptions. Documenting this manually is tedious.
dictate_dictionary() runs an interactive wizard that
inspects your dataframe’s column names and classes, prompts you to input
description bullets for each column, and generates a pre-formatted
roxygen2 documentation block ready to be pasted into your package code
files.
# Create a dummy sales dataframe
sales_df <- data.frame(
transaction_id = 1001:1003,
amount_usd = c(12.50, 45.00, 120.99),
category = c("Book", "Electronics", "Clothing"),
stringsAsFactors = FALSE
)
# Generate a roxygen2 data dictionary interactively
dict_res <- dictate_dictionary(sales_df)
# The console wizard will prompt you for descriptions:
# - 'transaction_id': Unique transaction identifier
# - 'amount_usd': Transaction amount in US Dollars
# - 'category': Category of item purchased
# Print the generated roxygen2 lines
cat(dict_res$roxygen_block, sep = "\n")The output will be formatted like:
#' @format A data frame with 3 rows and 3 variables:
#' \describe{
#' \item{transaction_id}{Unique transaction identifier}
#' \item{amount_usd}{Transaction amount in US Dollars}
#' \item{category}{Category of item purchased}
#' }Writing test suites for your functions ensures code reliability.
scaffold_tests() creates test files under
tests/testthat/ with structural boilerplate matching your
function’s signature and return type.
# Scaffold a test file for the function 'calculate_mean'
scaffold_tests(target_func = "calculate_mean")This generates tests/testthat/test-calculate_mean.R with
pre-configured assertions: