Getting started with dqcheckr

dqcheckr automates the verification of recurring external dataset deliveries. For each new file arrival it runs a battery of quality checks, compares the file to the previous delivery, writes a self-contained HTML report, and records summary statistics in a local SQLite database so that quality trends can be tracked over time.

How it works

Your data file                  Your YAML config
      │                                │
      ▼                                ▼
  read_dataset()            load_config()
  (all columns as           (global defaults +
   character, trimmed)       per-dataset rules)
      │                                │
      └──────────────┬─────────────────┘
                     ▼
          run_qc_checks()        ← single-snapshot checks (QC / SC series)
          run_comparison_checks()← vs previous delivery   (CP series)
          run_custom_checks()    ← your own rules
                     │
                     ▼
          write_snapshot()       ← SQLite: run summary + per-column stats
          render_report()        ← self-contained HTML
                     │
                     ▼
          run_dq_check()         ← the one function you call

A data officer runs a single command for each arriving dataset:

run_dq_check("customer_accounts", config_dir = "path/to/configs")

This prints a one-line console summary, writes an HTML report, and returns list(status, report_path, snapshot_id) invisibly.

Installation

dqcheckr is a first-party package — install it directly from source:

devtools::install("path/to/dqr/dqcheckr")
library(dqcheckr)

All dependencies are on CRAN: readr, DBI, RSQLite, rmarkdown, knitr, kableExtra, ggplot2, gridExtra, yaml, dplyr, tidyr, rlang.

Configuration

Two YAML files control every run.

Global config — `dqcheckr.yml`

Placed in your config directory. Sets default thresholds used by all datasets unless a dataset overrides them.

snapshot_db:       "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Proportion of non-empty values that must parse as numeric for a column to
  # be classified as numeric (affects type inference across many checks)
  type_inference_threshold: 0.90

  # FAIL if any column's missing-value rate exceeds this (0-1)
  max_missing_rate: 0.05

  # FAIL if non-numeric values in a numeric column exceed this rate
  max_non_numeric_rate: 0.01

  # FAIL if row count drops below this (0 = disabled)
  min_row_count: 0

  # Version-comparison thresholds (used when a previous file is available)
  max_row_count_change_pct:       0.10
  max_numeric_mean_shift_pct:     0.20
  max_missing_rate_change_pp:     2.0    # percentage points
  max_non_numeric_rate_change_pp: 1.0

  flag_new_columns:         true   # set false to suppress new-column warnings in CP-02
  flag_dropped_columns:     true   # set false to suppress dropped-column warnings in CP-02
  flag_type_changes:        true   # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true   # set false to skip CP-08 entirely (WARN CSV / FAIL FWF)

Per-dataset config — `<dataset_name>.yml`

One file per dataset. Only the keys you specify are needed — everything else inherits from the global defaults.

dataset_name: "customer_accounts"   # for human reference only — the functional
                                     # identifier is the YAML filename, which must
                                     # match the dataset_name argument to run_dq_check()

# --- File location ---
# Option A: folder scan (two most recently modified files are used)
folder: "data/incoming/customer_accounts/"

# Option B: explicit paths (set current_file only → single-file mode,
#           no version comparison)
# current_file:  "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

format:    csv
encoding:  "UTF-8"
delimiter: ","

# --- Schema contract (optional) ---
# SC-01 flags extra columns; SC-02 flags missing columns.
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# --- Key columns (optional) ---
# QC-12 checks each for duplicate values.
key_columns:
  - id

# --- Rule overrides (optional) ---
# These replace the matching keys in default_rules for this dataset only.
rule_overrides:
  max_missing_rate:        0.02
  max_row_count_change_pct: 0.05
  min_row_count:           1000

# --- Per-column type overrides (optional) ---
# Force specific columns to a type regardless of what the data looks like.
# Valid values: character, numeric, date.
# Eliminates false positives on phone numbers, postcodes, unit numbers, etc.
column_types:
  phone:    character   # prevent QC-11 / CP-04 / CP-07 false positives
  postcode: character   # 4-digit codes are not meaningful as numeric
  bsb_code: character   # 6-digit bank codes are not meaningful as numeric

# --- Per-column rules (optional) ---
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]   # QC-09
    pattern: "^[A-Z]{2}$"                       # QC-13
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0                                 # QC-10
    max_value: 1000000
  # Per-column threshold overrides (optional, within column_rules):
  email:
    max_missing_rate: 1.00          # email is 99% absent by design — suppress QC-01
    max_missing_rate_change_pp: 5.0 # looser than global 2.0 — suppress CP-03 noise
  account_balance:
    max_missing_rate: 0.00          # balance must always be present
    max_numeric_mean_shift_pct: 0.05  # stricter than global 0.20

# --- Custom checks (optional) ---
# custom_checks_file: "custom/customer_accounts_checks.R"

# --- Output paths (optional, override global) ---
# snapshot_db:       "data/snapshots.sqlite"
# report_output_dir: "reports/"

What is required vs optional

The only fields the package requires to run are the file location (folder or current_file) and the format. Every other config key is optional — the corresponding checks are simply skipped when the key is absent.

Config key	Skipped / default when absent
`key_columns`	QC-12 (key uniqueness) skipped
`expected_columns`	SC-01 and SC-02 (schema contract) skipped
`column_types`	All columns use automatic type inference
`column_rules.allowed_values`	QC-09 (allowed values) skipped
`column_rules.min_value` / `max_value`	QC-10 (numeric bounds) skipped
`column_rules.pattern`	QC-13 (pattern / regex) skipped
`column_rules.<col>.max_missing_rate`	Falls back to dataset / global `max_missing_rate`
`column_rules.<col>.max_non_numeric_rate`	Falls back to dataset / global `max_non_numeric_rate`
`column_rules.<col>.max_missing_rate_change_pp`	Falls back to dataset / global `max_missing_rate_change_pp`
`column_rules.<col>.max_numeric_mean_shift_pct`	Falls back to dataset / global `max_numeric_mean_shift_pct`
`column_rules.<col>.max_non_numeric_rate_change_pp`	Falls back to dataset / global `max_non_numeric_rate_change_pp`
`custom_checks_file`	All custom checks skipped
`previous_file` / second file in folder	All CP-01–CP-08 (version comparison) skipped
`min_row_count` set to 0	QC-14 (minimum row count) skipped
`type_inference_threshold`	Defaults to 0.90 (affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, CP-07)
`flag_new_columns`	Defaults to `true`; set to `false` to suppress new-column warnings in CP-02 (still tracked in SQLite)
`flag_dropped_columns`	Defaults to `true`; set to `false` to suppress dropped-column warnings in CP-02 (still tracked in SQLite)
`flag_type_changes`	Defaults to `true`; set to `false` to suppress type-change warnings in CP-02
`flag_column_order_change`	Defaults to `true`; set to `false` to skip CP-08 entirely

A minimal dataset config that runs basic checks with no column-level rules:

dataset_name: "my_dataset"           # for reference; must match the YAML filename
current_file: "data/incoming/my_dataset.csv"
format: csv

Fixed-width files

For fixed-width files, set format: fwf and supply column widths and names. If the file has a header row, set fwf_skip: 1 to skip it.

format: fwf
fwf_skip: 1
fwf_widths:    [10, 20, 3, 10, 12, 11]
fwf_col_names: [id, name, country_code, account_status, account_balance, created_date]

The quality checks

Each check returns one or more dq_result objects with four key fields: check_id, status (PASS / WARN / FAIL / INFO), observed, and message. The overall run status is the worst single status across all checks.

Single-snapshot checks (QC series)

These run on the current file alone.

ID	Name	Triggers on	Severity
QC-01	Missing rate	Column missing rate > `max_missing_rate`	FAIL
QC-02	Empty column	Column is 100% empty	FAIL
QC-03	Duplicate rows	Any fully-identical rows	WARN
QC-04	Row count	Always	INFO
QC-05	Column count	Always	INFO
QC-06	Inferred type	Always (date / numeric / character / unknown)	INFO
QC-07	Numeric stats	Numeric columns (min, max, mean, SD)	INFO
QC-08	Distinct count	Character columns	INFO
QC-09	Allowed values	Values outside `allowed_values` list	FAIL
QC-10	Numeric bounds	Values outside `min_value` / `max_value`	FAIL
QC-11	Non-numeric values	Non-numeric values in a numeric column	WARN / FAIL
QC-12	Key uniqueness	Duplicate values in a `key_columns` column	FAIL
QC-13	Pattern / regex	Values not matching `pattern`	FAIL
QC-14	Minimum row count	Row count below `min_row_count`	FAIL

Schema contract checks (SC series)

Only run when expected_columns is set.

ID	Name	Triggers on	Severity
SC-01	Unexpected column	Column in file not in `expected_columns`	FAIL
SC-02	Missing expected column	Column in `expected_columns` absent from file	FAIL

Version comparison checks (CP series)

Only run when a previous file is available. Skipped automatically in single-file mode.

ID	Name	Triggers on	Severity
CP-01	Row count change	Change > `max_row_count_change_pct`	WARN
CP-02	Schema diff	Columns added, dropped, or type-changed	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp	WARN
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct`	WARN
CP-05	New distinct values	Values in current not seen in previous	INFO
CP-06	Dropped distinct values	Values in previous absent from current	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp	WARN
CP-08	Column order change	Column order differs from previous	WARN (CSV) / FAIL (FWF)

Type inference

infer_col_type() classifies each column after whitespace trimming. Rules are applied in priority order:

date — all non-empty values parse as a known date format (%Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y)
numeric — ≥ type_inference_threshold of non-empty values coerce to numeric
character — everything else
unknown — all values are empty or NA

The threshold defaults to 90%, meaning a column with up to 10% bad values is still classified as numeric, making those bad values visible to QC-11 (non-numeric contamination). It is configurable per dataset via type_inference_threshold in default_rules or rule_overrides:

# raise threshold — stricter, fewer columns classified as numeric
rule_overrides:
  type_inference_threshold: 0.95

# lower threshold — more lenient, columns with more noise still classified numeric
rule_overrides:
  type_inference_threshold: 0.80

Changing the threshold affects QC-06, QC-07, QC-08, QC-11, CP-02, CP-04, CP-05, CP-06, and CP-07. Note that changing it between runs may cause CP-02 to report a spurious type change on columns that sit close to the old and new thresholds, even if the underlying data did not change.

infer_col_type(c("2024-01-15", "2024-06-30"))  # "date"
infer_col_type(c("100", "200", "N/A"))          # "character" (only 67% numeric)
infer_col_type(c("100", "200", "N/A", rep("50", 17)))  # "numeric" (95% numeric)
infer_col_type(c(NA, "", NA))                   # "unknown"
# with a custom threshold:
infer_col_type(c(rep("1", 17), "a", "b", "c"), threshold = 0.80)  # "numeric" (85%)

Overriding inferred types

Some columns are formatted as numbers but are semantically character — phone numbers, postcodes, unit numbers, BSB codes. Automatic inference classifies them as numeric, which triggers irrelevant QC-11, CP-04, and CP-07 checks and can cause spurious CP-02 type-change warnings when the proportion of non-numeric values drifts across the inference threshold between deliveries.

Use column_types in the dataset YAML to force the type for any column:

column_types:
  TenantPhone:    character   # 10-digit numbers — not meaningful as numeric
  PremisesUnit:   character   # values like "5", "3A", "Ground Floor"
  PremisesPostCode: character # 4-digit codes validated by pattern, not numeric
  BSBCode:        character   # 6-digit bank routing codes

Valid values are character, numeric, and date. The override affects all downstream logic: the type is stored in SQLite as the column’s inferred_type, drift comparisons see the stable override, and QC/CP checks that would be meaningless for the forced type are skipped automatically.

Within custom check scripts, declare config as a second argument and use resolve_col_type() instead of infer_col_type() so that overrides are respected. If the function signature has only one argument (df), the config is not passed — both signatures work:

# Two-argument form — config is passed automatically, overrides respected
custom_checks <- function(df, config) {
  results <- list()
  for (col in names(df)) {
    if (resolve_col_type(col, df[[col]], config) == "numeric") {
      # numeric-specific check logic here
    }
  }
  results
}

# One-argument form — still works, type inference only (no override awareness)
custom_checks <- function(df) {
  # ...
}

Per-column threshold overrides

Global and dataset-level thresholds apply uniformly to all columns. Some columns naturally have higher missing rates or more variability, making a single threshold produce constant noise. Per-column threshold keys in column_rules let you tune specific columns without affecting others.

Resolution order for every threshold-based check:

column_rules.<col>.<threshold>  →  rule_overrides.<threshold>  →  default_rules.<threshold>

Supported per-column threshold keys

Key	Governs	Default
`max_missing_rate`	QC-01 (missing rate)	0.05
`max_non_numeric_rate`	QC-11 (non-numeric contamination)	0.01
`max_missing_rate_change_pp`	CP-03 (missing rate change)	2.0 pp
`max_numeric_mean_shift_pct`	CP-04 (numeric mean shift)	0.20
`max_non_numeric_rate_change_pp`	CP-07 (non-numeric rate change)	1.0 pp

Example

column_rules:
  # Email is almost always absent by design — don't alert on it
  LandlordEmail:
    max_missing_rate:          1.00   # effectively disables QC-01 for this column
    max_missing_rate_change_pp: 5.0   # allow larger swings without CP-03 noise

  # Bond amount must always be present and should not shift much
  BondAmount:
    max_missing_rate:          0.00   # zero tolerance
    max_numeric_mean_shift_pct: 0.05  # tighter than global 0.20

  # Phone is forced to character via column_types, so numeric thresholds don't apply,
  # but you can still tighten missing-rate expectations:
  TenantPhone:
    max_missing_rate: 0.10

The threshold field in the HTML report and dq_result objects reflects the effective per-column threshold, not the global default, so the report always shows the value that was actually used.

Running a check

result <- run_dq_check(
  dataset_name = "customer_accounts",
  config_dir   = "path/to/configs",   # contains dqcheckr.yml + customer_accounts.yml
  open_report  = TRUE                 # open HTML in browser when interactive
)

# Console output (always printed):
# [dqcheckr] customer_accounts: FAIL - 1 warning(s), 2 failure(s). Report: reports/...html

result$status       # "PASS", "WARN", or "FAIL"
result$report_path  # full path to the HTML file
result$snapshot_id  # integer row ID in snapshots table

The HTML report contains:

Header — dataset name, file name, run time, overall status badge
File summary — row count, column count, file size, format, encoding
Quality checks table — every check result, failures first
Custom checks table — if configured
Version comparison — what changed since the previous delivery
Historical trend — row count and check outcomes across the last 10 runs (appears once at least two runs have been recorded)
Appendix — full column-level statistics table

Calling individual checks

You can call any check function directly, without going through the full pipeline. Every check takes (df, config) and returns a list of dq_result objects.

library(dqcheckr)

# Build a minimal config programmatically
cfg <- list(
  rules = list(
    max_missing_rate     = 0.05,
    max_non_numeric_rate = 0.01,
    min_row_count        = 0
  ),
  column_rules     = list(
    status = list(allowed_values = c("ACTIVE", "CLOSED"))
  ),
  key_columns      = "id",
  expected_columns = NULL
)

df <- read.csv("data/customer_accounts.csv", colClasses = "character")
df[] <- lapply(df, trimws)

# Run a single check
results <- check_missing_rate(df, cfg)

# Inspect
results[[1]]$status    # "PASS" or "FAIL"
results[[1]]$observed  # e.g. "3.2% missing (4 of 125)"

# Run all single-snapshot checks at once
all_results <- run_qc_checks(df, cfg)

# Overall status
overall_status(all_results)  # "FAIL" > "WARN" > "PASS" > "INFO"

Custom checks

Create a plain .R file that defines custom_checks(df). It receives the trimmed data frame and must return a list of dq_result objects.

# File: custom/customer_accounts_checks.R

custom_checks <- function(df) {
  results <- list()

  # Rule: ACTIVE accounts must not have a zero balance
  active_zero <- df[df$account_status == "ACTIVE" &
                    !is.na(df$account_balance) &
                    df$account_balance == "0", ]
  n <- nrow(active_zero)
  results <- c(results, list(dq_result(
    check_id   = "CUST-01",
    check_name = "No zero-balance active accounts",
    column     = "account_balance",   # enables per-column storage in SQLite
    status     = if (n > 0) "FAIL" else "PASS",
    observed   = sprintf("%d ACTIVE account(s) with balance 0", n),
    message    = if (n > 0)
      sprintf("%d ACTIVE account(s) have a zero balance.", n)
    else
      "No ACTIVE accounts have a zero balance."
  )))

  results
}

Point to it in the dataset config:

custom_checks_file: "custom/customer_accounts_checks.R"

The file is sourced in an isolated environment (not the global session). dq_result() is available inside it. Any error — missing file, undefined function, runtime failure — stops the run with a clear message.

Tip: always set column = when your check is scoped to a specific column. Results with column = NA (the default) appear in the HTML report and contribute to the overall status, but are not written to the column_snapshots SQLite table and therefore do not appear in per-column trend queries.

The snapshot database

Every run writes two rows to a shared SQLite database: one row in snapshots (run-level summary) and one row per column-stat in column_snapshots.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), "data/snapshots.sqlite")

# Recent runs for one dataset
dbGetQuery(con,
  "SELECT id, file_name, overall_status, check_fail_count, run_timestamp
   FROM snapshots
   WHERE dataset_name = 'customer_accounts'
   ORDER BY id DESC
   LIMIT 10")

# Column-level stats for the most recent run
dbGetQuery(con,
  "SELECT column_name, dq_check, value, threshold
   FROM column_snapshots
   WHERE snapshot_id = (
     SELECT MAX(id) FROM snapshots WHERE dataset_name = 'customer_accounts'
   )
   ORDER BY column_name, dq_check")

dbDisconnect(con)

Because all datasets share one database, you can query across them:

dbGetQuery(con,
  "SELECT dataset_name, COUNT(*) AS runs,
          SUM(check_fail_count) AS total_failures
   FROM snapshots
   GROUP BY dataset_name")

The output/ directory (and database file) are created automatically on the first run if they do not exist.

Historical drift comparison

run_dq_check() compares the current file to the previous file in a single run. For longer-range comparisons — “how has this dataset’s quality changed over the past two years?” — use compare_snapshots(), which reads two historical snapshots from SQLite without needing the original files.

List available snapshots

# All datasets
list_snapshots(config_dir = "config")

# One dataset
list_snapshots("customer_accounts", config_dir = "config")
#>   id  dataset_name          file_name        run_timestamp row_count overall_status
#> 1  1 customer_accounts  delivery_20250130.csv  2025-01-30 09:15:00    150000           PASS
#> 2  4 customer_accounts  delivery_20260128.csv  2026-01-28 10:22:00    158000           WARN
#> 3  7 customer_accounts  delivery_20270129.csv  2027-01-29 08:45:00    162000           FAIL

list_snapshots() returns a data frame invisibly. config_dir tells it where dqcheckr.yml is (to find snapshot_db).

Compare two snapshots

# Latest vs second-latest (default)
drift <- compare_snapshots("customer_accounts", config_dir = "config")

# Specific snapshot IDs — first argument is always treated as "previous"
drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir = "config")

# Also write a plain-text report alongside the HTML
drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir  = "config",
                           text_report = TRUE)

Console output is a single line:

[dqcheckr] drift: customer_accounts snapshot #1 vs #7 | reports/drift_customer_accounts_20270130_091500.html

What the drift report shows

Header — dataset name, snapshot IDs, timestamps, file names
Table-level drift — row count change (%), column count, check outcome counts (PASS / WARN / FAIL / INFO), threshold breach highlighted
Schema drift — new columns, dropped columns, type changes
Per-column drift (sorted by magnitude, breaches highlighted):
- Missing rate change (percentage points)
- Non-numeric rate change (percentage points)
- Numeric mean shift (%)
- Distinct count change (absolute and %)

Return value

compare_snapshots() returns the full drift data invisibly as a named list, so you can inspect or process the results programmatically:

drift <- compare_snapshots("customer_accounts",
                           snapshot_id_prev = 1,
                           snapshot_id_curr = 7,
                           config_dir = "config",
                           report = FALSE)   # skip HTML output

# Table-level summary
drift$table_drift

# Schema changes only
drift$schema_changes

# Columns with the largest missing-rate shift
head(drift$missing_rate_changes[, c("Column", "missing_rate_prev",
                                    "missing_rate_curr",
                                    "missing_rate_change_pp",
                                    "missing_rate_exceeds")])

# Columns whose numeric mean shifted most
head(drift$mean_shifts[, c("Column", "numeric_mean_prev",
                            "numeric_mean_curr",
                            "numeric_mean_shift_pct",
                            "numeric_mean_exceeds")])

Typical workflow

# 1. Regular delivery run (writes a new snapshot each time)
run_dq_check("customer_accounts", config_dir = "config")

# 2. Two years later, a new delivery arrives
run_dq_check("customer_accounts", config_dir = "config")

# 3. Discover what snapshots are available
list_snapshots("customer_accounts", config_dir = "config")

# 4. Compare any two
compare_snapshots("customer_accounts",
                  snapshot_id_prev = 1,
                  snapshot_id_curr = 4,
                  config_dir = "config")

Note: When explicit IDs are passed, the function treats the first as “previous” and the second as “current” regardless of their numeric order. This matters when baseline snapshots are backfilled from archived files — their IDs may be higher than earlier production runs.

Worked example — Star Wars dataset

The following uses dplyr::starwars exported to CSV and fixed-width files.

# Config directory contains:
#   dqcheckr.yml         — global thresholds
#   starwars_csv.yml     — CSV dataset config
#   starwars_fwf.yml     — FWF dataset config

# Run checks on both formats
result_csv <- run_dq_check("starwars_csv", config_dir = "config", open_report = TRUE)
result_fwf <- run_dq_check("starwars_fwf", config_dir = "config", open_report = TRUE)

The Star Wars config sets max_missing_rate: 0.60 because vehicles (87%) and starships (77%) are naturally sparse — most characters have none. Both columns FAIL, which is the correct finding: they are genuinely incomplete and a consumer should be aware before using them.

Expected console output:

[dqcheckr] starwars_csv: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_csv_....html
[dqcheckr] starwars_fwf: FAIL - 0 warning(s), 2 failure(s). Report: output/reports/starwars_fwf_....html

Error handling

Situation	Behaviour
Config file not found	Stop immediately; no output written
Data file not found	Stop immediately; no output written
Only one file available	Single-file mode; CP checks skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written

The package never modifies, moves, or deletes source files.

Design principles

Owned and evolvable. Every check is plain R — readable, modifiable, and not hidden behind a third-party API.

Plain English outputs. Reports are written for data management staff, not engineers. No R code, no package names, no jargon.

Fail loudly. Configuration errors, missing files, and custom-check failures all stop the run with a clear message. Nothing is silently skipped.

Non-destructive. The package reads files and writes reports and database records. Source files are never touched.