Skip to contents

dqcheckr automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time.

How it works

Your data file                  Your YAML config
      │                                │
      ▼                                ▼
  read_dataset()            load_config()
  (all columns as           (global defaults +
   character, trimmed)       per-dataset rules)
      │                                │
      └──────────────┬─────────────────┘
                     ▼
          run_qc_checks()        ← single-snapshot checks (QC / SC series)
          run_comparison_checks()← vs previous delivery   (CP series)
          run_custom_checks()    ← your own rules
                     │
                     ▼
          write_snapshot()       ← SQLite: run summary + per-column stats
          render_report()        ← self-contained HTML
                     │
                     ▼
          run_dq_check()         ← the one function you call

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "path/to/configs")

This prints a one-line console summary, writes an HTML report, and returns list(status, report_path, snapshot_id) invisibly.


Installation

dqcheckr is a first-party package — install it directly from source:

devtools::install("path/to/dqr/dqcheckr")
library(dqcheckr)

All dependencies are on CRAN: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, yaml, dplyr, tidyr, rlang.


Configuration

Two YAML files control every run.

Global config — dqcheckr.yml

Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.

snapshot_db:       "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Proportion of non-empty values that must parse as numeric for a column to
  # be classified as numeric (affects type inference across many checks)
  type_inference_threshold: 0.90

  # FAIL if any column's missing-value rate exceeds this (0-1)
  max_missing_rate: 0.05

  # FAIL if non-numeric values in a numeric column exceed this rate
  max_non_numeric_rate: 0.01

  # FAIL if row count drops below this (0 = disabled)
  min_row_count: 0

  # Version-comparison thresholds (used when a previous file is available)
  max_row_count_change_pct:       0.10
  max_numeric_mean_shift_pct:     0.20
  max_missing_rate_change_pp:     2.0    # percentage points
  max_non_numeric_rate_change_pp: 1.0

  flag_new_columns:         true   # set false to suppress new-column warnings in CP-02
  flag_dropped_columns:     true   # set false to suppress dropped-column warnings in CP-02
  flag_type_changes:        true   # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true   # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — <dataset_name>.yml

One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.

dataset_name: "customer_accounts"   # for human reference only — the functional
                                     # identifier is the YAML filename, which must
                                     # match the dataset_name argument to run_dq_check()

# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"

# Option B: explicit paths (set current_file only → single-file mode,
#           no version comparison)
# current_file:  "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

format:    csv
encoding:  "UTF-8"
delimiter: ","

# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
  - id

# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
  max_missing_rate:        0.02
  max_row_count_change_pct: 0.05
  min_row_count:           1000

# --- Per-column type overrides (optional) ---
# Force specific columns to a type regardless of what the data looks like.
# Valid values: character, numeric, date.
# Eliminates false positives on phone numbers, postcodes, unit numbers, etc.
column_types:
  phone:    character   # prevent QC-11 / CP-04 / CP-07 false positives
  postcode: character   # 4-digit codes are not meaningful as numeric
  bsb_code: character   # 6-digit bank codes are not meaningful as numeric

# --- Per-column rules (optional) ---
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]   # QC-09
    pattern: "^[A-Z]{2}$"                       # QC-13
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0                                 # QC-10
    max_value: 1000000
  # Per-column threshold overrides (optional, within column_rules):
  email:
    max_missing_rate: 1.00          # email is 99% absent by design — suppress QC-01
    max_missing_rate_change_pp: 5.0 # looser than global 2.0 — suppress CP-03 noise
  account_balance:
    max_missing_rate: 0.00          # balance must always be present
    max_numeric_mean_shift_pct: 0.05  # stricter than global 0.20

# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"

# --- Output paths (optional, override global) ---
# snapshot_db:       "data/snapshots.sqlite"
# report_output_dir: "reports/"

What is required vs optional

The only fields the package requires to run are the file location (folder or current_file) and the format. Every other config key is optional — the corresponding checks are simply skipped when the key is absent.

Config key Skipped / default when absent
key_columns QC-12 (key uniqueness) skipped
expected_columns SC-01 and SC-02 (schema contract) skipped
column_types All columns use automatic type inference
column_rules.allowed_values QC-09 (allowed values) skipped
column_rules.min_value / max_value QC-10 (numeric bounds) skipped
column_rules.pattern QC-13 (pattern / regex) skipped
column_rules.<col>.max_missing_rate Falls back to dataset / global max_missing_rate
column_rules.<col>.max_non_numeric_rate Falls back to dataset / global max_non_numeric_rate
column_rules.<col>.max_missing_rate_change_pp Falls back to dataset / global max_missing_rate_change_pp
column_rules.<col>.max_numeric_mean_shift_pct Falls back to dataset / global max_numeric_mean_shift_pct
column_rules.<col>.max_non_numeric_rate_change_pp Falls back to dataset / global max_non_numeric_rate_change_pp
custom_checks_file All custom checks skipped
previous_file / second file in folder All CP-01–CP-08 (version comparison) skipped
min_row_count set to 0 QC-14 (minimum row count) skipped
type_inference_threshold Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07)
flag_new_columns Defaults to true; set to false to suppress new-column warnings in CP-02 (still tracked in SQLite)
flag_dropped_columns Defaults to true; set to false to suppress dropped-column warnings in CP-02 (still tracked in SQLite)
flag_type_changes Defaults to true; set to false to suppress type-change warnings in CP-02
flag_column_order_change Defaults to true; set to false to skip CP-08 entirely

A minimal dataset config that runs basic checks with no column-level rules:

dataset_name: "my_dataset"           # for reference; must match the YAML filename
current_file: "data/incoming/my_dataset.csv"
format: csv

Fixed-width files

For fixed-width files, set format: fwf and supply column widths and names. If the file has a header row, set fwf_skip: 1 to skip it.

format: fwf
fwf_skip: 1
fwf_widths:    [10, 20, 3, 10, 12, 11]
fwf_col_names: [id, name, country_code, account_status, account_balance, created_date]

The quality checks

Each check returns one or more dq_result objects with four key fields: check_id, status (PASS / WARN / FAIL / INFO), observed, and message. The overall run status is the worst single status across all checks.

Single-snapshot checks (QC series)

These run on the current file alone.

ID Name Triggers on Severity
QC-01 Missing rate Column missing rate > max_missing_rate FAIL
QC-02 Empty column Column is 100% empty FAIL
QC-03 Duplicate rows Any fully-identical rows WARN
QC-04 Row count Always INFO
QC-05 Column count Always INFO
QC-06 Inferred type Always (date / numeric / character / unknown) INFO
QC-07 Numeric stats Numeric columns (min, max, mean, SD) INFO
QC-08 Distinct count Character columns INFO
QC-09 Allowed values Values outside allowed_values list FAIL
QC-10 Numeric bounds Values outside min_value / max_value FAIL
QC-11 Non-numeric values Non-numeric values in a numeric column WARN / FAIL
QC-12 Key uniqueness Duplicate values in a key_columns column FAIL
QC-13 Pattern / regex Values not matching pattern FAIL
QC-14 Minimum row count Row count below min_row_count FAIL

Schema contract checks (SC series)

Only run when expected_columns is set.

ID Name Triggers on Severity
SC-01 Unexpected column Column in file not in expected_columns FAIL
SC-02 Missing expected column Column in expected_columns absent from file FAIL

Version comparison checks (CP series)

Only run when a previous file is available. Skipped automatically in single-file mode.

ID Name Triggers on Severity
CP-01 Row count change Change > max_row_count_change_pct WARN
CP-02 Schema diff Columns added, dropped, or type-changed WARN
CP-03 Missing rate change Increase > max_missing_rate_change_pp pp WARN
CP-04 Numeric mean shift Shift > max_numeric_mean_shift_pct WARN
CP-05 New distinct values Values in current not seen in previous INFO
CP-06 Dropped distinct values Values in previous absent from current INFO
CP-07 Non-numeric rate change Increase > max_non_numeric_rate_change_pp pp WARN
CP-08 Column order change Column order differs from previous WARN (CSV) / FAIL (FWF)

Type inference

infer_col_type() classifies each column after whitespace trimming. Rules are applied in priority order:

  1. date — all non-empty values parse as a known date format (%Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y)
  2. numeric — ≥ type_inference_threshold of non-empty values coerce to numeric
  3. character — everything else
  4. unknown — all values are empty or NA

The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via type_inference_threshold in default_rules or rule_overrides:

# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
  type_inference_threshold: 0.95

# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
  type_inference_threshold: 0.80

Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.

infer_col_type(c("2024-01-15", "2024-06-30"))  # "date"
infer_col_type(c("100", "200", "N/A"))          # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17)))  # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA))                   # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric" (85%)

Overriding inferred types

Some columns are formatted as numbers but are semantically character — phone numbers, postcodes, unit numbers, BSB codes. Automatic inference classifies them as numeric, which triggers irrelevant QC-11, CP-04, and CP-07 checks and can cause spurious CP-02 type-change warnings when the proportion of non-numeric values drifts across the inference threshold between deliveries.

Use column_types in the dataset YAML to force the type for any column:

column_types:
  TenantPhone:    character   # 10-digit numbers — not meaningful as numeric
  PremisesUnit:   character   # values like "5", "3A", "Ground Floor"
  PremisesPostCode: character # 4-digit codes validated by pattern, not numeric
  BSBCode:        character   # 6-digit bank routing codes

Valid values are character, numeric, and date. The override affects all downstream logic: the type is stored in SQLite as the column’s inferred_type, drift comparisons see the stable override, and QC/CP checks that would be meaningless for the forced type are skipped automatically.

Within custom check scripts, declare config as a second argument and use resolve_col_type() instead of infer_col_type() so that overrides are respected. If the function signature has only one argument (df), the config is not passed — both signatures work:

# Two-argument form — config is passed automatically, overrides respected
custom_checks <- function(df, config) {
  results <- list()
  for (col in names(df)) {
    if (resolve_col_type(col, df[[col]], config) == "numeric") {
      # numeric-specific check logic here
    }
  }
  results
}

# One-argument form — still works, type inference only (no override awareness)
custom_checks <- function(df) {
  # ...
}

Per-column threshold overrides

Global and dataset-level thresholds apply uniformly to all columns. Some columns naturally have higher missing rates or more variability, making a single threshold produce constant noise. Per-column threshold keys in column_rules let you tune specific columns without affecting others.

Resolution order for every threshold-based check:

column_rules.<col>.<threshold>  →  rule_overrides.<threshold>  →  default_rules.<threshold>

Supported per-column threshold keys

Key Governs Default
max_missing_rate QC-01 (missing rate) 0.05
max_non_numeric_rate QC-11 (non-numeric contamination) 0.01
max_missing_rate_change_pp CP-03 (missing rate change) 2.0 pp
max_numeric_mean_shift_pct CP-04 (numeric mean shift) 0.20
max_non_numeric_rate_change_pp CP-07 (non-numeric rate change) 1.0 pp

Example

column_rules:
  # Email is almost always absent by design — don't alert on it
  LandlordEmail:
    max_missing_rate:          1.00   # effectively disables QC-01 for this column
    max_missing_rate_change_pp: 5.0   # allow larger swings without CP-03 noise

  # Bond amount must always be present and should not shift much
  BondAmount:
    max_missing_rate:          0.00   # zero tolerance
    max_numeric_mean_shift_pct: 0.05  # tighter than global 0.20

  # Phone is forced to character via column_types, so numeric thresholds don't apply,
  # but you can still tighten missing-rate expectations:
  TenantPhone:
    max_missing_rate: 0.10

The threshold field in the HTML report and dq_result objects reflects the effective per-column threshold, not the global default, so the report always shows the value that was actually used.


Running a check

result <- run_dq_check(
  dataset_name = "customer_accounts",
  config_dir   = "path/to/configs",   # contains dqcheckr.yml + customer_accounts.yml
  open_report  = TRUE                 # open HTML in browser when interactive
)

# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html

result$status       # "PASS", "WARN", or "FAIL"
result$report_path  # full path to the HTML file
result$snapshot_id  # integer row ID in snapshots table

The HTML report contains:

  1. Header — dataset name, file name, run time, overall status badge
  2. File summary — row count, column count, file size, format, encoding
  3. Quality checks table — every check result, failures first
  4. Custom checks table — if configured
  5. Version comparison — what changed since the previous delivery
  6. Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
  7. Appendix — full column-level statistics table

Calling individual checks

You can call any check function directly, without going through the full pipeline. Every check takes (df, config) and returns a list of dq_result objects.

library(dqcheckr)

# Build a minimal config programmatically
cfg <- list(
  rules = list(
    max_missing_rate     = 0.05,
    max_non_numeric_rate = 0.01,
    min_row_count        = 0
  ),
  column_rules     = list(
    status = list(allowed_values = c("ACTIVE", "CLOSED"))
  ),
  key_columns      = "id",
  expected_columns = NULL
)

df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)

# Run a single check
results <- check_missing_rate(df, cfg)

# Inspect
results[[1]]$status    # "PASS" or "FAIL"
results[[1]]$observed  # e.g. "3.2% missing (4 of 125)"

# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)

# Overall status
overall_status(all_results)  # "FAIL" > "WARN" > "PASS" > "INFO"

Custom checks

Create a plain .R file that defines custom_checks(df). It receives the trimmed data frame and must return a list of dq_result objects.

# File: custom/customer_accounts_checks.R

custom_checks <- function(df) {
  results <- list()

  # Rule: ACTIVE accounts must not have a zero balance
  active_zero <- df[df$account_status == "ACTIVE" &
                    !is.na(df$account_balance) &
                    df$account_balance == "0", ]
  n <- nrow(active_zero)
  results <- c(results, list(dq_result(
    check_id   = "CUST-01",
    check_name = "No zero-balance active accounts",
    column     = "account_balance",   # enables per-column storage in SQLite
    status     = if (n > 0) "FAIL" else "PASS",
    observed   = sprintf("%d ACTIVE account(s) with balance 0", n),
    message    = if (n > 0)
      sprintf("%d ACTIVE account(s) have a zero balance.", n)
    else
      "No ACTIVE accounts have a zero balance."
  )))

  results
}

Point to it in the dataset config:

custom_checks_file: "custom/customer_accounts_checks.R"

The file is sourced in an isolated environment (not the global session). dq_result() is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message.

Tip: always set column = when your check is scoped to a specific column. Results with column = NA (the default) appear in the HTML report and contribute to the overall status, but are not written to the column_snapshots SQLite table and therefore do not appear in per-column trend queries.


The snapshot database

Every run writes two rows to a shared SQLite database: one row in snapshots (run-level summary) and one row per column-stat in column_snapshots.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), "data/snapshots.sqlite")

# Recent runs for one dataset
dbGetQuery(con,
  "SELECT id, file_name, overall_status, check_fail_count, run_timestamp
   FROM snapshots
   WHERE dataset_name = 'customer_accounts'
   ORDER BY id DESC
   LIMIT 10")

# Column-level stats for the most recent run
dbGetQuery(con,
  "SELECT column_name, dq_check, value, threshold
   FROM column_snapshots
   WHERE snapshot_id = (
     SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
   )
   ORDER BY column_name, dq_check")

dbDisconnect(con)

Because all datasets share one database, you can query across them:

dbGetQuery(con,
  "SELECT dataset_name, COUNT(*) AS runs,
          SUM(check_fail_count) AS total_failures
   FROM snapshots
   GROUP BY dataset_name")

The output/ directory (and database file) are created automatically on the first run if they do not exist.


Historical drift comparison

run_dq_check() compares the current file to the previous file in a single run. For longer-range comparisons — “how has this dataset’s quality changed over the past two years?” — use compare_snapshots(), which reads two historical snapshots from SQLite without needing the original files.

List available snapshots

# All datasets
list_snapshots(config_dir = "config")

# One dataset
list_snapshots("customer_accounts", config_dir = "config")
#>   id  dataset_name          file_name        run_timestamp row_count overall_status
#> 1  1 customer_accounts  delivery_20250130.csv  2025-01-30 09:15:00    150000           PASS
#> 2  4 customer_accounts  delivery_20260128.csv  2026-01-28 10:22:00    158000           WARN
#> 3  7 customer_accounts  delivery_20270129.csv  2027-01-29 08:45:00    162000           FAIL

list_snapshots() returns a data frame invisibly. config_dir tells it where dqcheckr.yml is (to find snapshot_db).

Compare two snapshots

# Latest vs second-latest (default)
drift <- compare_snapshots("customer_accounts", config_dir = "config")

# Specific snapshot IDs — first argument is always treated as "previous"
drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir = "config")

# Also write a plain-text report alongside the HTML
drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir  = "config",
                           text_report = TRUE)

Console output is a single line:

[dqcheckr] drift: customer_accounts snapshot #1 vs #7 | reports/drift_customer_accounts_20270130_091500.html

What the drift report shows

  1. Header — dataset name, snapshot IDs, timestamps, file names
  2. Table-level drift — row count change (%), column count, check outcome counts (PASS / WARN / FAIL / INFO), threshold breach highlighted
  3. Schema drift — new columns, dropped columns, type changes
  4. Per-column drift (sorted by magnitude, breaches highlighted):
    • Missing rate change (percentage points)
    • Non-numeric rate change (percentage points)
    • Numeric mean shift (%)
    • Distinct count change (absolute and %)

Return value

compare_snapshots() returns the full drift data invisibly as a named list, so you can inspect or process the results programmatically:

drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir = "config",
                           report = FALSE)   # skip HTML output

# Table-level summary
drift$table_drift

# Schema changes only
drift$schema_changes

# Columns with the largest missing-rate shift
head(drift$missing_rate_changes[, c("Column", "missing_rate_prev",
                                    "missing_rate_curr",
                                    "missing_rate_change_pp",
                                    "missing_rate_exceeds")])

# Columns whose numeric mean shifted most
head(drift$mean_shifts[, c("Column", "numeric_mean_prev",
                            "numeric_mean_curr",
                            "numeric_mean_shift_pct",
                            "numeric_mean_exceeds")])

Typical workflow

# 1. Regular delivery run (writes a new snapshot each time)
run_dq_check("customer_accounts", config_dir = "config")

# 2. Two years later, a new delivery arrives
run_dq_check("customer_accounts", config_dir = "config")

# 3. Discover what snapshots are available
list_snapshots("customer_accounts", config_dir = "config")

# 4. Compare any two
compare_snapshots("customer_accounts",
                  snapshot_id_prev = 1,
                  snapshot_id_curr = 4,
                  config_dir = "config")

Note: When explicit IDs are passed, the function treats the first as “previous” and the second as “current” regardless of their numeric order. This matters when baseline snapshots are backfilled from archived files — their IDs may be higher than earlier production runs.


Worked example — Star Wars dataset

The following uses dplyr::starwars exported to CSV and fixed-width files.

# Config directory contains:
#   dqcheckr.yml         — global thresholds
#   starwars_csv.yml     — CSV dataset config
#   starwars_fwf.yml     — FWF dataset config

# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)

The Star Wars config sets max_missing_rate: 0.60 because vehicles (87%) and starships (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them.

Expected console output:

[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html

Error handling

Situation Behaviour
Config file not found Stop immediately; no output written
Data file not found Stop immediately; no output written
Only one file available Single-file mode; CP checks skipped
File cannot be parsed Stop with filename and parse error
SC-01 / SC-02 violations FAIL recorded; run continues; report written
Custom checks file missing Stop with message
custom_checks() not defined Stop with message
custom_checks() runtime error Stop with R error detail
SQLite write fails Warning emitted; HTML report still written

The package never modifies, moves, or deletes source files.


Design principles

Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.

Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.

Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.

Non-destructive. The package reads files and writes reports and database records. Source files are never touched.