Changelog
Source:NEWS.md
dqcheckr 0.2.0
Breaking changes
DuckDB connection required for all check functions. The
con = NULLpure-R execution path has been removed. All check, comparison, and ingest functions now require a DuckDB connection. There is no automatic fallback.read_dataset()requirescon. The function now aborts with an informative message if called without a connection.DuckDB replaces SQLite for snapshot storage. Existing
.sqlitedatabases must be migrated withinst/scripts/migrate_sqlite_to_duckdb.R. Updatesnapshot_dbindqcheckr.ymlto point to the new.duckdbpath.Quarto replaces rmarkdown for HTML reports. Install Quarto CLI from https://quarto.org.
CP-02 split into three result objects:
CP-02a(new columns),CP-02b(dropped columns),CP-02c(type changes). Code filtering oncheck_id == "CP-02"must be updated.run_timestampis now stored in UTC ISO-8601 format (YYYY-MM-DDTHH:MM:SSZ).numeric_meanstat key renamed tonumeric_parseable_meanincolumn_snapshots.readrmoved fromImportstoSuggests. Packages that relied on readr as a transitive dependency of dqcheckr must now declare it directly.
New features
Native CSV ingestion via DuckDB. CSV files are read directly into DuckDB using
read_csvwithstrict_mode = false, eliminating the previous readr-based workaround for files with many columns and quoted comma-containing fields (a sniffer bug in DuckDB ≤ 1.5.x). Peak memory drops from ~2× to ~1× file size for large CSV inputs.Native FWF ingestion via DuckDB. Fixed-width files are read using a
chr(1)(SOH) delimiter to treat each line as a single VARCHAR column, thenSUBSTRextracts columns at their exact character positions. All three file formats — CSV, FWF, and Parquet — are now read natively by DuckDB with no intermediate R data.frame allocation.Parquet input: add
format: parquetto the dataset YAML.Outlier detection (
check_outliers(), QC-16): configurable Z-score threshold per column viacolumn_rules.<col>.max_z_score. Off by default.File size check (
check_file_size(), QC-15): FAIL when file exceedsmax_file_size_mb; always emits an INFO with the actual size.Maximum row count (
max_row_count, QC-14b).Multi-column composite key uniqueness:
key_columnsnow accepts a list.compare_snapshots()compares any two historical snapshots by ID.list_snapshots()lists available snapshots.Per-column type overrides (
column_typesin dataset YAML).Per-column threshold overrides in
column_rulesfor QC-01, QC-11, CP-03, CP-04, and CP-07.
Behaviour changes
All QC and CP checks execute as DuckDB SQL. The test suite exercises the same code path as production.
detect_files()uses filename alphabetical order as a tiebreaker when modification times are equal.The
observedfield in QC-09, CP-05, and CP-06 is capped at 20 values.CP-03 severity is configurable via
missing_rate_change_severity.CP-08 severity is configurable via
column_order_severity.flag_*keys suppress WARN from report but still write changes to snapshot db.QC-11 supports two-level WARN/FAIL via
warn_non_numeric_rate.
dqcheckr 0.1.1
-
flag_new_columns,flag_dropped_columns,flag_type_changes, andflag_column_order_changeare now honoured. -
type_inference_thresholdis configurable per dataset.
dqcheckr 0.1.0
Initial release.
- Single-snapshot quality checks: QC-01 to QC-14 and SC-01/SC-02.
- Version comparison checks: CP-01 to CP-08.
- Custom organisation-specific checks via a plain R file.
- Self-contained HTML report with check tables, trend charts, and column statistics appendix.
- SQLite snapshot database for long-term trend tracking.
- Supports CSV and fixed-width (FWF) file formats.
- Configuration via global
dqcheckr.ymland per-dataset YAML files.