dqcheckr automates the verification of recurring
external dataset deliveries. For each new file arrival it runs a battery
of quality checks, compares the file to the previous delivery, writes a
self-contained HTML report, and records summary statistics in a local
SQLite database so that quality trends can be tracked over time.
How it works
Your data file Your YAML config
│ │
▼ ▼
read_dataset() load_config()
(all columns as (global defaults +
character, trimmed) per-dataset rules)
│ │
└──────────────┬─────────────────┘
▼
run_qc_checks() ← single-snapshot checks (QC / SC series)
run_comparison_checks()← vs previous delivery (CP series)
run_custom_checks() ← your own rules
│
▼
write_snapshot() ← SQLite: run summary + per-column stats
render_report() ← self-contained HTML
│
▼
run_dq_check() ← the one function you call
A data officer runs a single command for each arriving dataset:
run_dq_check("customer_accounts", config_dir = "path/to/configs")This prints a one-line console summary, writes an HTML report, and
returns list(status, report_path, snapshot_id)
invisibly.
Installation
dqcheckr is a first-party package — install it directly
from source:
All dependencies are on CRAN: readr, DBI,
RSQLite, rmarkdown, knitr,
kableExtra, ggplot2, gridExtra,
yaml, dplyr, tidyr,
rlang.
Configuration
Two YAML files control every run.
Global config — dqcheckr.yml
Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.
snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"
default_rules:
# Proportion of non-empty values that must parse as numeric for a column to
# be classified as numeric (affects type inference across many checks)
type_inference_threshold: 0.90
# FAIL if any column's missing-value rate exceeds this (0-1)
max_missing_rate: 0.05
# FAIL if non-numeric values in a numeric column exceed this rate
max_non_numeric_rate: 0.01
# FAIL if row count drops below this (0 = disabled)
min_row_count: 0
# Version-comparison thresholds (used when a previous file is available)
max_row_count_change_pct: 0.10
max_numeric_mean_shift_pct: 0.20
max_missing_rate_change_pp: 2.0 # percentage points
max_non_numeric_rate_change_pp: 1.0
flag_new_columns: true # set false to suppress new-column warnings in CP-02
flag_dropped_columns: true # set false to suppress dropped-column warnings in CP-02
flag_type_changes: true # set false to suppress type-change warnings in CP-02
flag_column_order_change: true # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)Per-dataset config — <dataset_name>.yml
One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.
dataset_name: "customer_accounts" # for human reference only — the functional
# identifier is the YAML filename, which must
# match the dataset_name argument to run_dq_check()
# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"
# Option B: explicit paths (set current_file only → single-file mode,
# no version comparison)
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"
format: csv
encoding: "UTF-8"
delimiter: ","
# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
- id
- name
- country_code
- account_status
- account_balance
- created_date
# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
- id
# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
max_missing_rate: 0.02
max_row_count_change_pct: 0.05
min_row_count: 1000
# --- Per-column type overrides (optional) ---
# Force specific columns to a type regardless of what the data looks like.
# Valid values: character, numeric, date.
# Eliminates false positives on phone numbers, postcodes, unit numbers, etc.
column_types:
phone: character # prevent QC-11 / CP-04 / CP-07 false positives
postcode: character # 4-digit codes are not meaningful as numeric
bsb_code: character # 6-digit bank codes are not meaningful as numeric
# --- Per-column rules (optional) ---
column_rules:
country_code:
allowed_values: ["GB", "US", "DE", "FR"] # QC-09
pattern: "^[A-Z]{2}$" # QC-13
account_status:
allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
account_balance:
min_value: 0 # QC-10
max_value: 1000000
# Per-column threshold overrides (optional, within column_rules):
email:
max_missing_rate: 1.00 # email is 99% absent by design — suppress QC-01
max_missing_rate_change_pp: 5.0 # looser than global 2.0 — suppress CP-03 noise
account_balance:
max_missing_rate: 0.00 # balance must always be present
max_numeric_mean_shift_pct: 0.05 # stricter than global 0.20
# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"
# --- Output paths (optional, override global) ---
# snapshot_db: "data/snapshots.sqlite"
# report_output_dir: "reports/"What is required vs optional
The only fields the package requires to run are the file location
(folder or current_file) and the
format. Every other config key is optional — the
corresponding checks are simply skipped when the key is absent.
| Config key | Skipped / default when absent |
|---|---|
key_columns |
QC-12 (key uniqueness) skipped |
expected_columns |
SC-01 and SC-02 (schema contract) skipped |
column_types |
All columns use automatic type inference |
column_rules.allowed_values |
QC-09 (allowed values) skipped |
column_rules.min_value / max_value
|
QC-10 (numeric bounds) skipped |
column_rules.pattern |
QC-13 (pattern / regex) skipped |
column_rules.<col>.max_missing_rate |
Falls back to dataset / global max_missing_rate
|
column_rules.<col>.max_non_numeric_rate |
Falls back to dataset / global
max_non_numeric_rate
|
column_rules.<col>.max_missing_rate_change_pp |
Falls back to dataset / global
max_missing_rate_change_pp
|
column_rules.<col>.max_numeric_mean_shift_pct |
Falls back to dataset / global
max_numeric_mean_shift_pct
|
column_rules.<col>.max_non_numeric_rate_change_pp |
Falls back to dataset / global
max_non_numeric_rate_change_pp
|
custom_checks_file |
All custom checks skipped |
previous_file / second file in folder |
All CP-01–CP-08 (version comparison) skipped |
min_row_count set to 0 |
QC-14 (minimum row count) skipped |
type_inference_threshold |
Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07) |
flag_new_columns |
Defaults to true; set to false to suppress
new-column warnings in CP-02 (still tracked in SQLite) |
flag_dropped_columns |
Defaults to true; set to false to suppress
dropped-column warnings in CP-02 (still tracked in SQLite) |
flag_type_changes |
Defaults to true; set to false to suppress
type-change warnings in CP-02 |
flag_column_order_change |
Defaults to true; set to false to skip
CP-08 entirely |
A minimal dataset config that runs basic checks with no column-level rules:
The quality checks
Each check returns one or more dq_result objects with
four key fields: check_id, status
(PASS / WARN / FAIL /
INFO), observed, and message. The
overall run status is the worst single status across all checks.
Single-snapshot checks (QC series)
These run on the current file alone.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| QC-01 | Missing rate | Column missing rate > max_missing_rate
|
FAIL |
| QC-02 | Empty column | Column is 100% empty | FAIL |
| QC-03 | Duplicate rows | Any fully-identical rows | WARN |
| QC-04 | Row count | Always | INFO |
| QC-05 | Column count | Always | INFO |
| QC-06 | Inferred type | Always (date / numeric / character / unknown) | INFO |
| QC-07 | Numeric stats | Numeric columns (min, max, mean, SD) | INFO |
| QC-08 | Distinct count | Character columns | INFO |
| QC-09 | Allowed values | Values outside allowed_values list |
FAIL |
| QC-10 | Numeric bounds | Values outside min_value / max_value
|
FAIL |
| QC-11 | Non-numeric values | Non-numeric values in a numeric column | WARN / FAIL |
| QC-12 | Key uniqueness | Duplicate values in a key_columns column |
FAIL |
| QC-13 | Pattern / regex | Values not matching pattern
|
FAIL |
| QC-14 | Minimum row count | Row count below min_row_count
|
FAIL |
Schema contract checks (SC series)
Only run when expected_columns is set.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| SC-01 | Unexpected column | Column in file not in expected_columns
|
FAIL |
| SC-02 | Missing expected column | Column in expected_columns absent from file |
FAIL |
Version comparison checks (CP series)
Only run when a previous file is available. Skipped automatically in single-file mode.
| ID | Name | Triggers on | Severity |
|---|---|---|---|
| CP-01 | Row count change | Change > max_row_count_change_pct
|
WARN |
| CP-02 | Schema diff | Columns added, dropped, or type-changed | WARN |
| CP-03 | Missing rate change | Increase > max_missing_rate_change_pp pp |
WARN |
| CP-04 | Numeric mean shift | Shift > max_numeric_mean_shift_pct
|
WARN |
| CP-05 | New distinct values | Values in current not seen in previous | INFO |
| CP-06 | Dropped distinct values | Values in previous absent from current | INFO |
| CP-07 | Non-numeric rate change | Increase > max_non_numeric_rate_change_pp pp |
WARN |
| CP-08 | Column order change | Column order differs from previous | WARN (CSV) / FAIL (FWF) |
Type inference
infer_col_type() classifies each column after whitespace
trimming. Rules are applied in priority order:
-
date — all non-empty values parse as a known date
format (
%Y-%m-%d,%d/%m/%Y,%m/%d/%Y,%Y%m%d,%d-%m-%Y) -
numeric — ≥
type_inference_thresholdof non-empty values coerce to numeric - character — everything else
-
unknown — all values are empty or
NA
The threshold defaults to 90%, meaning a column with up to 10% bad
values is still classified as numeric, making those bad values visible
to QC-11 (non-numeric contamination). It is configurable per dataset via
type_inference_threshold in default_rules or
rule_overrides:
# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
type_inference_threshold: 0.95
# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
type_inference_threshold: 0.80Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.
infer_col_type(c("2024-01-15", "2024-06-30")) # "date"
infer_col_type(c("100", "200", "N/A")) # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17))) # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA)) # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80) # "numeric" (85%)Overriding inferred types
Some columns are formatted as numbers but are semantically character
— phone numbers, postcodes, unit numbers, BSB codes. Automatic inference
classifies them as numeric, which triggers irrelevant
QC-11, CP-04, and CP-07 checks and can cause spurious CP-02 type-change
warnings when the proportion of non-numeric values drifts across the
inference threshold between deliveries.
Use column_types in the dataset YAML to force the type
for any column:
column_types:
TenantPhone: character # 10-digit numbers — not meaningful as numeric
PremisesUnit: character # values like "5", "3A", "Ground Floor"
PremisesPostCode: character # 4-digit codes validated by pattern, not numeric
BSBCode: character # 6-digit bank routing codesValid values are character, numeric, and
date. The override affects all downstream logic: the type
is stored in SQLite as the column’s inferred_type, drift
comparisons see the stable override, and QC/CP checks that would be
meaningless for the forced type are skipped automatically.
Within custom check scripts, declare config as a second
argument and use resolve_col_type() instead of
infer_col_type() so that overrides are respected. If the
function signature has only one argument (df), the config
is not passed — both signatures work:
# Two-argument form — config is passed automatically, overrides respected
custom_checks <- function(df, config) {
results <- list()
for (col in names(df)) {
if (resolve_col_type(col, df[[col]], config) == "numeric") {
# numeric-specific check logic here
}
}
results
}
# One-argument form — still works, type inference only (no override awareness)
custom_checks <- function(df) {
# ...
}Per-column threshold overrides
Global and dataset-level thresholds apply uniformly to all columns.
Some columns naturally have higher missing rates or more variability,
making a single threshold produce constant noise. Per-column threshold
keys in column_rules let you tune specific columns without
affecting others.
Resolution order for every threshold-based check:
column_rules.<col>.<threshold> → rule_overrides.<threshold> → default_rules.<threshold>
Supported per-column threshold keys
| Key | Governs | Default |
|---|---|---|
max_missing_rate |
QC-01 (missing rate) | 0.05 |
max_non_numeric_rate |
QC-11 (non-numeric contamination) | 0.01 |
max_missing_rate_change_pp |
CP-03 (missing rate change) | 2.0 pp |
max_numeric_mean_shift_pct |
CP-04 (numeric mean shift) | 0.20 |
max_non_numeric_rate_change_pp |
CP-07 (non-numeric rate change) | 1.0 pp |
Example
column_rules:
# Email is almost always absent by design — don't alert on it
LandlordEmail:
max_missing_rate: 1.00 # effectively disables QC-01 for this column
max_missing_rate_change_pp: 5.0 # allow larger swings without CP-03 noise
# Bond amount must always be present and should not shift much
BondAmount:
max_missing_rate: 0.00 # zero tolerance
max_numeric_mean_shift_pct: 0.05 # tighter than global 0.20
# Phone is forced to character via column_types, so numeric thresholds don't apply,
# but you can still tighten missing-rate expectations:
TenantPhone:
max_missing_rate: 0.10The threshold field in the HTML report and
dq_result objects reflects the effective per-column
threshold, not the global default, so the report always shows the value
that was actually used.
Running a check
result <- run_dq_check(
dataset_name = "customer_accounts",
config_dir = "path/to/configs", # contains dqcheckr.yml + customer_accounts.yml
open_report = TRUE # open HTML in browser when interactive
)
# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html
result$status # "PASS", "WARN", or "FAIL"
result$report_path # full path to the HTML file
result$snapshot_id # integer row ID in snapshots tableThe HTML report contains:
- Header — dataset name, file name, run time, overall status badge
- File summary — row count, column count, file size, format, encoding
- Quality checks table — every check result, failures first
- Custom checks table — if configured
- Version comparison — what changed since the previous delivery
- Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
- Appendix — full column-level statistics table
Calling individual checks
You can call any check function directly, without going through the
full pipeline. Every check takes (df, config) and returns a
list of dq_result objects.
library(dqcheckr)
# Build a minimal config programmatically
cfg <- list(
rules = list(
max_missing_rate = 0.05,
max_non_numeric_rate = 0.01,
min_row_count = 0
),
column_rules = list(
status = list(allowed_values = c("ACTIVE", "CLOSED"))
),
key_columns = "id",
expected_columns = NULL
)
df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)
# Run a single check
results <- check_missing_rate(df, cfg)
# Inspect
results[[1]]$status # "PASS" or "FAIL"
results[[1]]$observed # e.g. "3.2% missing (4 of 125)"
# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)
# Overall status
overall_status(all_results) # "FAIL" > "WARN" > "PASS" > "INFO"Custom checks
Create a plain .R file that defines
custom_checks(df). It receives the trimmed data frame and
must return a list of dq_result objects.
# File: custom/customer_accounts_checks.R
custom_checks <- function(df) {
results <- list()
# Rule: ACTIVE accounts must not have a zero balance
active_zero <- df[df$account_status == "ACTIVE" &
!is.na(df$account_balance) &
df$account_balance == "0", ]
n <- nrow(active_zero)
results <- c(results, list(dq_result(
check_id = "CUST-01",
check_name = "No zero-balance active accounts",
column = "account_balance", # enables per-column storage in SQLite
status = if (n > 0) "FAIL" else "PASS",
observed = sprintf("%d ACTIVE account(s) with balance 0", n),
message = if (n > 0)
sprintf("%d ACTIVE account(s) have a zero balance.", n)
else
"No ACTIVE accounts have a zero balance."
)))
results
}Point to it in the dataset config:
The file is sourced in an isolated environment (not the global
session). dq_result() is available inside it. Any error —
missing file, undefined function, runtime failure — stops the run with a
clear message.
Tip: always set column = when your
check is scoped to a specific column. Results with
column = NA (the default) appear in the HTML report and
contribute to the overall status, but are not written to the
column_snapshots SQLite table and therefore do not appear
in per-column trend queries.
The snapshot database
Every run writes two rows to a shared SQLite database: one row in
snapshots (run-level summary) and one row per column-stat
in column_snapshots.
library(DBI)
library(RSQLite)
con <- dbConnect(SQLite(), "data/snapshots.sqlite")
# Recent runs for one dataset
dbGetQuery(con,
"SELECT id, file_name, overall_status, check_fail_count, run_timestamp
FROM snapshots
WHERE dataset_name = 'customer_accounts'
ORDER BY id DESC
LIMIT 10")
# Column-level stats for the most recent run
dbGetQuery(con,
"SELECT column_name, dq_check, value, threshold
FROM column_snapshots
WHERE snapshot_id = (
SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
)
ORDER BY column_name, dq_check")
dbDisconnect(con)Because all datasets share one database, you can query across them:
dbGetQuery(con,
"SELECT dataset_name, COUNT(*) AS runs,
SUM(check_fail_count) AS total_failures
FROM snapshots
GROUP BY dataset_name")The output/ directory (and database file) are created
automatically on the first run if they do not exist.
Historical drift comparison
run_dq_check() compares the current file to the previous
file in a single run. For longer-range comparisons — “how has this
dataset’s quality changed over the past two years?” — use
compare_snapshots(), which reads two historical snapshots
from SQLite without needing the original files.
List available snapshots
# All datasets
list_snapshots(config_dir = "config")
# One dataset
list_snapshots("customer_accounts", config_dir = "config")
#> id dataset_name file_name run_timestamp row_count overall_status
#> 1 1 customer_accounts delivery_20250130.csv 2025-01-30 09:15:00 150000 PASS
#> 2 4 customer_accounts delivery_20260128.csv 2026-01-28 10:22:00 158000 WARN
#> 3 7 customer_accounts delivery_20270129.csv 2027-01-29 08:45:00 162000 FAILlist_snapshots() returns a data frame invisibly.
config_dir tells it where dqcheckr.yml is (to
find snapshot_db).
Compare two snapshots
# Latest vs second-latest (default)
drift <- compare_snapshots("customer_accounts", config_dir = "config")
# Specific snapshot IDs — first argument is always treated as "previous"
drift <- compare_snapshots("customer_accounts",
snapshot_id_prev = 1,
snapshot_id_curr = 7,
config_dir = "config")
# Also write a plain-text report alongside the HTML
drift <- compare_snapshots("customer_accounts",
snapshot_id_prev = 1,
snapshot_id_curr = 7,
config_dir = "config",
text_report = TRUE)Console output is a single line:
[dqcheckr] drift: customer_accounts snapshot #1 vs #7 | reports/drift_customer_accounts_20270130_091500.html
What the drift report shows
- Header — dataset name, snapshot IDs, timestamps, file names
- Table-level drift — row count change (%), column count, check outcome counts (PASS / WARN / FAIL / INFO), threshold breach highlighted
- Schema drift — new columns, dropped columns, type changes
-
Per-column drift (sorted by magnitude, breaches
highlighted):
- Missing rate change (percentage points)
- Non-numeric rate change (percentage points)
- Numeric mean shift (%)
- Distinct count change (absolute and %)
Return value
compare_snapshots() returns the full drift data
invisibly as a named list, so you can inspect or process the results
programmatically:
drift <- compare_snapshots("customer_accounts",
snapshot_id_prev = 1,
snapshot_id_curr = 7,
config_dir = "config",
report = FALSE) # skip HTML output
# Table-level summary
drift$table_drift
# Schema changes only
drift$schema_changes
# Columns with the largest missing-rate shift
head(drift$missing_rate_changes[, c("Column", "missing_rate_prev",
"missing_rate_curr",
"missing_rate_change_pp",
"missing_rate_exceeds")])
# Columns whose numeric mean shifted most
head(drift$mean_shifts[, c("Column", "numeric_mean_prev",
"numeric_mean_curr",
"numeric_mean_shift_pct",
"numeric_mean_exceeds")])Typical workflow
# 1. Regular delivery run (writes a new snapshot each time)
run_dq_check("customer_accounts", config_dir = "config")
# 2. Two years later, a new delivery arrives
run_dq_check("customer_accounts", config_dir = "config")
# 3. Discover what snapshots are available
list_snapshots("customer_accounts", config_dir = "config")
# 4. Compare any two
compare_snapshots("customer_accounts",
snapshot_id_prev = 1,
snapshot_id_curr = 4,
config_dir = "config")Note: When explicit IDs are passed, the function treats the first as “previous” and the second as “current” regardless of their numeric order. This matters when baseline snapshots are backfilled from archived files — their IDs may be higher than earlier production runs.
Worked example — Star Wars dataset
The following uses dplyr::starwars exported to CSV and
fixed-width files.
# Config directory contains:
# dqcheckr.yml — global thresholds
# starwars_csv.yml — CSV dataset config
# starwars_fwf.yml — FWF dataset config
# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)The Star Wars config sets max_missing_rate: 0.60 because
vehicles (87%) and starships (77%) are
naturally sparse — most characters have none. Both columns FAIL, which
is the correct finding: they are genuinely incomplete and a consumer
should be aware before using them.
Expected console output:
[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html
Error handling
| Situation | Behaviour |
|---|---|
| Config file not found | Stop immediately; no output written |
| Data file not found | Stop immediately; no output written |
| Only one file available | Single-file mode; CP checks skipped |
| File cannot be parsed | Stop with filename and parse error |
| SC-01 / SC-02 violations | FAIL recorded; run continues; report written |
| Custom checks file missing | Stop with message |
custom_checks() not defined |
Stop with message |
custom_checks() runtime error |
Stop with R error detail |
| SQLite write fails | Warning emitted; HTML report still written |
The package never modifies, moves, or deletes source files.
Design principles
Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.
Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.
Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.
Non-destructive. The package reads files and writes reports and database records. Source files are never touched.