| Type: | Package |
| Title: | Automated Data Quality Checks for Recurring Dataset Deliveries |
| Version: | 0.1.2 |
| Date: | 2026-05-16 |
| Description: | Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks, compares the file to the previous delivery, writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/mickmioduszewski/dqcheckr |
| BugReports: | https://github.com/mickmioduszewski/dqcheckr/issues |
| Encoding: | UTF-8 |
| Language: | en-GB |
| Depends: | R (≥ 4.2) |
| Imports: | readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, dplyr, tidyr, yaml, rlang |
| Suggests: | testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | no |
| Packaged: | 2026-05-15 16:51:37 UTC; mick |
| Author: | Mick Mioduszewski [aut, cre] |
| Maintainer: | Mick Mioduszewski <mick@mioduszewski.net> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-20 08:00:07 UTC |
dqcheckr: Automated Data Quality Checks for Recurring Dataset Deliveries
Description
Automates quality verification of recurring external dataset deliveries. For each new file arrival, it runs single-snapshot quality checks (QC-01 to QC-14, SC-01/SC-02), compares the file to the previous delivery (CP-01 to CP-08), writes a self-contained 'HTML' report, and records summary statistics in a local 'SQLite' database for long-term trend tracking. Supports 'CSV' and fixed-width formats. Custom organisation-specific checks can be supplied as plain R files.
Details
The main entry point is run_dq_check. Configuration is driven
by two 'YAML' files: a global dqcheckr.yml and a per-dataset
<dataset_name>.yml.
Author(s)
Maintainer: Mick Mioduszewski mick@mioduszewski.net
Authors:
Mick Mioduszewski mick@mioduszewski.net
See Also
Useful links:
Report bugs at https://github.com/mickmioduszewski/dqcheckr/issues
Compute missing rate for a vector
Description
Compute missing rate for a vector
Usage
.missing_rate_vec(x)
Test for missing or empty values
Description
Test for missing or empty values
Usage
.missing_vals(x)
QC-09: Check for values outside the allowed set
Description
QC-09: Check for values outside the allowed set
Usage
check_allowed_values(df, config)
QC-05: Report column count
Description
QC-05: Report column count
Usage
check_col_count(df, config)
QC-08: Report distinct value counts for character columns
Description
QC-08: Report distinct value counts for character columns
Usage
check_distinct_counts(df, config)
QC-03: Check for fully-duplicate rows
Description
QC-03: Check for fully-duplicate rows
Usage
check_duplicate_rows(df, config)
QC-02: Check for entirely empty columns
Description
QC-02: Check for entirely empty columns
Usage
check_empty_column(df, config)
QC-06: Report inferred column types
Description
QC-06: Report inferred column types
Usage
check_inferred_types(df, config)
QC-12: Check uniqueness of key columns
Description
QC-12: Check uniqueness of key columns
Usage
check_key_uniqueness(df, config)
QC-14: Check minimum row count threshold
Description
QC-14: Check minimum row count threshold
Usage
check_min_row_count(df, config)
QC-01: Check missing rate per column
Description
Returns a dq_result per column flagging columns whose
proportion of missing or empty values exceeds max_missing_rate.
Usage
check_missing_rate(df, config)
Arguments
df |
A data frame with all columns as character vectors. |
config |
Named list as returned by |
Value
A list of dq_result objects, one per column.
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df <- read_dataset(path, cfg)
check_missing_rate(df, cfg)
QC-11: Check non-numeric rate in numeric columns
Description
QC-11: Check non-numeric rate in numeric columns
Usage
check_non_numeric(df, config)
QC-10: Check for out-of-range numeric values
Description
QC-10: Check for out-of-range numeric values
Usage
check_numeric_bounds(df, config)
QC-07: Report numeric summary statistics
Description
QC-07: Report numeric summary statistics
Usage
check_numeric_stats(df, config)
QC-13: Check values against a regex pattern
Description
QC-13: Check values against a regex pattern
Usage
check_pattern(df, config)
QC-04: Report row count
Description
QC-04: Report row count
Usage
check_row_count(df, config)
SC-01/SC-02: Check columns against expected schema contract
Description
SC-01/SC-02: Check columns against expected schema contract
Usage
check_schema_contract(df, config)
CP-08: Check column order consistency between deliveries
Description
CP-08: Check column order consistency between deliveries
Usage
compare_column_order(df_current, df_previous, config)
CP-06: Detect dropped distinct values in character columns
Description
CP-06: Detect dropped distinct values in character columns
Usage
compare_dropped_values(df_current, df_previous, config)
CP-03: Compare per-column missing rate between deliveries
Description
CP-03: Compare per-column missing rate between deliveries
Usage
compare_missing_rate(df_current, df_previous, config)
CP-05: Detect new distinct values in character columns
Description
CP-05: Detect new distinct values in character columns
Usage
compare_new_values(df_current, df_previous, config)
CP-07: Compare non-numeric rate in numeric columns between deliveries
Description
CP-07: Compare non-numeric rate in numeric columns between deliveries
Usage
compare_non_numeric_rate(df_current, df_previous, config)
CP-04: Compare numeric column means between deliveries
Description
CP-04: Compare numeric column means between deliveries
Usage
compare_numeric_mean(df_current, df_previous, config)
CP-01: Compare row count between deliveries
Description
CP-01: Compare row count between deliveries
Usage
compare_row_count(df_current, df_previous, config)
CP-02: Detect schema differences between deliveries
Description
CP-02: Detect schema differences between deliveries
Usage
compare_schema(df_current, df_previous, config)
Compute per-column statistics for snapshot storage
Description
Compute per-column statistics for snapshot storage
Usage
compute_col_stats(df, config, qc_results)
Detect current and previous dataset files
Description
Resolves the current and previous file paths from the configuration. If
current_file is set explicitly, it is used directly. Otherwise the
two most recently modified files in folder are used.
Usage
detect_files(config)
Arguments
config |
Named list. Merged configuration as returned by
|
Value
A named list with elements current (character path) and
previous (character path or NULL).
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$current_file <- system.file("demonstrations/data/starwars.csv",
package = "dqcheckr")
files <- detect_files(cfg)
files$current
Construct a data quality result object
Description
Creates the atomic result unit returned by every check function.
Usage
dq_result(
check_id,
check_name,
column = NA_character_,
status,
observed,
threshold = NA_character_,
message
)
Arguments
check_id |
Character. Short identifier for the check (e.g. |
check_name |
Character. Human-readable name of the check. |
column |
Character. Column the check applies to, or |
status |
Character. One of |
observed |
Character. What was observed (e.g. |
threshold |
Character. The configured threshold, or |
message |
Character. Human-readable description of the result. |
Value
A named list with seven elements: check_id, check_name,
column, status, observed, threshold,
message.
Examples
dq_result("QC-01", "Missing rate", column = "age",
status = "PASS", observed = "0% missing",
message = "No missing values.")
Infer the logical type of a character column
Description
Classifies a character vector as "date", "numeric",
"character", or "unknown" by applying rules in priority order.
Usage
infer_col_type(x, threshold = 0.9)
Arguments
x |
Character vector to classify (as read from a CSV or FWF file). |
threshold |
Numeric. Minimum proportion of non-empty values that must
parse as numeric for the column to be classified as |
Value
A single character string: "date", "numeric",
"character", or "unknown".
Examples
infer_col_type(c("2024-01-01", "2024-06-15")) # "date"
infer_col_type(c("1.5", "2.0", "3.1")) # "numeric"
infer_col_type(c("high", "low", "medium")) # "character"
infer_col_type(c(NA, "", NA)) # "unknown"
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80) # "numeric"
Initialise the SQLite snapshot database
Description
Initialise the SQLite snapshot database
Usage
init_snapshot_db(db_path)
Load and merge dataset configuration
Description
Reads the global dqcheckr.yml and the dataset-specific YAML, merging
rule_overrides from the dataset config on top of default_rules
from the global config.
Usage
load_config(dataset_name, config_dir)
Arguments
dataset_name |
Character. Dataset name; must match
|
config_dir |
Character. Path to the directory containing both YAML files. |
Value
A named list representing the merged configuration.
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
cfg$format
Compute the worst status across a list of dq_result objects
Description
Returns the single worst status in precedence order:
"FAIL" > "WARN" > "PASS" > "INFO".
Usage
overall_status(results)
Arguments
results |
A list of |
Value
A single character string: "FAIL", "WARN",
"PASS", or "INFO".
Examples
r1 <- dq_result("QC-01", "test", status = "PASS", observed = "ok", message = "ok")
r2 <- dq_result("QC-02", "test", status = "WARN", observed = "ok", message = "ok")
overall_status(list(r1, r2)) # "WARN"
Read a dataset file into a data frame
Description
Reads a CSV or fixed-width file, coercing all columns to character and
trimming whitespace. Encoding and delimiter are taken from config.
Usage
read_dataset(path, config)
Arguments
path |
Character. Path to the file to read. |
config |
Named list. Merged configuration as returned by
|
Value
A data frame with all columns as character vectors.
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df <- read_dataset(path, cfg)
Read recent snapshot history from the SQLite database
Description
Retrieves the n most recent run records for a given dataset from the
snapshot database, ordered newest-first.
Usage
read_recent_snapshots(db_path, dataset_name, n = 10)
Arguments
db_path |
Character. Path to the SQLite database file. |
dataset_name |
Character. Dataset name to filter on. |
n |
Integer. Maximum number of records to return. Defaults to 10. |
Value
A data frame with one row per run and columns including
id, run_timestamp, file_name, row_count,
overall_status, check_pass_count, check_warn_count,
check_fail_count. Returns an empty data frame if the database does
not exist or contains no records for the dataset.
Examples
history <- read_recent_snapshots(tempfile(fileext = ".sqlite"), "starwars_csv")
Render the HTML data quality report
Description
Render the HTML data quality report
Usage
render_report(
dataset_name,
file_name,
file_path,
df,
qc_results,
cp_results,
custom_results,
snapshot_history,
config,
col_stats = NULL,
output_dir,
open_report = TRUE
)
Run all version comparison checks between two dataset snapshots
Description
Runs CP-01 to CP-08 comparing a current delivery against the previous one.
Usage
run_comparison_checks(df_current, df_previous, config)
Arguments
df_current |
A data frame. The current delivery. |
df_previous |
A data frame. The previous delivery. |
config |
Named list. Merged configuration as returned by
|
Value
A list of dq_result objects. The list carries
attributes new_cols and dropped_cols (character vectors)
for use by the snapshot writer.
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
curr_path <- system.file("demonstrations/data2/starwars_v2.csv", package = "dqcheckr")
prev_path <- system.file("demonstrations/data2/starwars_v1.csv", package = "dqcheckr")
curr <- read_dataset(curr_path, cfg)
prev <- read_dataset(prev_path, cfg)
results <- run_comparison_checks(curr, prev, cfg)
Run organisation-specific custom checks
Description
Sources the R file specified by config$custom_checks_file, which must
define a function custom_checks(df) returning a list of
dq_result objects. Returns an empty list if
custom_checks_file is not set in the config.
Usage
run_custom_checks(df, config)
Arguments
df |
A data frame. The current delivery. |
config |
Named list. Merged configuration as returned by
|
Value
A list of dq_result objects (may be empty).
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df <- read_dataset(path, cfg)
results <- run_custom_checks(df, cfg)
Run a full data quality check pipeline
Description
Orchestrates the complete dqcheckr pipeline: loads configuration, detects files, runs QC and comparison checks, writes a snapshot to SQLite, and renders an HTML report.
Usage
run_dq_check(dataset_name, config_dir = ".", open_report = TRUE)
Arguments
dataset_name |
Character. Name of the dataset; must match a YAML config
file |
config_dir |
Character. Path to the directory containing
|
open_report |
Logical. Whether to open the HTML report in the browser after rendering (only takes effect in interactive sessions). |
Value
Invisibly, a named list with:
- status
Overall status string:
"PASS","WARN","FAIL", or"INFO".- report_path
Absolute path to the rendered HTML report.
- snapshot_id
Integer row ID of the snapshot written to SQLite, or
NULLif the write failed.
Examples
tmp <- gsub("\\\\", "/", tempdir())
dat <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
writeLines(c(
paste0('snapshot_db: "', tmp, '/snap.sqlite"'),
paste0('report_output_dir: "', tmp, '"'),
'default_rules:',
' max_missing_rate: 0.60',
' min_row_count: 80'
), file.path(tmp, "dqcheckr.yml"))
writeLines(c(
'dataset_name: "starwars_csv"',
paste0('current_file: "', dat, '"'),
'format: csv',
'encoding: "UTF-8"',
'delimiter: ","'
), file.path(tmp, "starwars_csv.yml"))
result <- run_dq_check("starwars_csv", config_dir = tmp, open_report = FALSE)
result$status
Run all generic quality checks on a dataset
Description
Runs the full QC check suite (QC-01 to QC-14, SC-01, SC-02) against a single data frame snapshot.
Usage
run_qc_checks(df, config)
Arguments
df |
A data frame with all columns as character vectors (as returned by
|
config |
Named list. Merged configuration as returned by
|
Value
A list of dq_result objects.
Examples
cfg_dir <- system.file("demonstrations/config", package = "dqcheckr")
cfg <- load_config("starwars_csv", config_dir = cfg_dir)
path <- system.file("demonstrations/data/starwars.csv", package = "dqcheckr")
df <- read_dataset(path, cfg)
results <- run_qc_checks(df, cfg)
Write a run snapshot to the SQLite database
Description
Write a run snapshot to the SQLite database
Usage
write_snapshot(
db_path,
dataset_name,
file_name,
df,
qc_results,
cp_results,
custom_results,
config,
col_stats = NULL
)