A correctness and robustness release driven by a code review of 0.1.0 (see issue #12 for the full catalogue). Two changes alter previously-returned numeric output and are called out separately below.
nndr() now standardises (z-scores) each numerical
column by the real-data mean and standard deviation
before the nearest-neighbour distance is computed. Without this, a
single large-scale column (e.g. income in dollars)
dominated the Euclidean distance and the score moved with measurement
units rather than with row similarity. Pass
normalize = FALSE to recover the previous behaviour
exactly.correlation_similarity() and
contingency_similarity() now return
score = NA_real_ (rather than 1) when there
are fewer than two columns of the relevant type, and
diagnostic_report() returns NA_real_ per
column when the synthetic column is entirely NA. Aggregated
property scores in quality_report() /
diagnostic_report() skip these NAs
(na.rm = TRUE) so they no longer overstate fidelity with a
synthetic “1” where there is no signal to measure.equality_constraint() gains a tolerance
argument: with tolerance > 0 on numeric columns, the
check is abs(a - b) <= tolerance instead of exact
==. Default 0 preserves prior behaviour.custom_constraint() gains a vectorized
argument: when TRUE, the predicate is called
once with the whole data frame instead of once per row.
Substantially faster on large synthetic samples for vectorisable
predicates.ml_efficacy() gains a seed argument for
reproducible train/test splits. The caller’s global RNG state is
restored on exit, so callers using set.seed() elsewhere are
unaffected.nndr() gains a normalize argument (default
TRUE) — see the default-output note above.print() methods for equality_constraint,
inequality_constraint,
fixed_combinations_constraint, and
custom_constraint.metadata_to_json() / metadata_from_json()
now round-trip the structural constraint types (equality,
inequality, and fixed_combinations).
custom_constraint cannot be serialised — it holds an R
closure — and is dropped with a warning. Previously
metadata_to_json() crashed on any
constraint, so save_metadata() was effectively broken for
non-trivial metadata.check_constraint.equality_constraint and
check_constraint.inequality_constraint now return
FALSE (not NA) for rows containing
NA. This prevents NA from propagating into the
row selector used by sample()’s rejection loop, which
previously inserted phantom NA-only rows.sample_conditions() now honours metadata constraints
alongside the user-supplied conditions (previously it filtered only on
the conditions).tvd_similarity() now strips NAs from both
sides and divides by the non-NA count on each side; previously
NA-padding inflated TVD.ks_similarity() now suppresses the
ks.test() “p-value will be approximate in the presence
of ties” warning, which it leaked to users on any tied integer
column (very common in tables with integer ages, capital gains,
etc.).fixed_combinations_constraint now uses a collision-free
length-prefix key encoding ("<nchar>:<value>"),
removing a theoretical separator collision in the previous paste-based
comparison.fit.gaussian_copula_synthesizer() errors clearly when a
modeled column is entirely NA or when no row is complete
across all modeled columns. Previously the user saw a cryptic
'dim' must be an integer (>= 2) from inside
copula::normalCopula.ml_efficacy() validates target_col (must
be a column of real) and test_fraction (must
be strictly between 0 and 1) up front.attribute_disclosure_risk() validates that
known_cols are present and numeric (one-hot encode
categorical knowns first); previously triggered a cryptic
FNN::knnx.index error.gaussian_copula_synthesizer() cross-checks
numerical_distributions names against the metadata’s
numerical columns; silently-ignored typos like
list(capitl_gain = "gamma") now raise a clear error.sample_conditions() validates that .n
values are positive whole numbers (was silently truncating or accepting
negatives).privacy_report() errors when only one of
sensitive_col / known_cols is supplied
(previously silently dropped disclosure-risk computation).set_primary_key() emits an advisory warning when the
column’s metadata type is not "id", since the column would
otherwise be modeled as ordinary data and the diagnostic key-uniqueness
check would typically fail.set_column_type() docstring documents the
level-ordering rule for categorical columns — factor keeps
levels() order, character is sorted
lexicographically (c("2", "10") becomes
levels c("10", "2")).Initial CRAN release.
gaussian_copula_synthesizer()) that fits a single joint
copula over all modeled columns: numerical,
categorical, and boolean.norm, beta,
gamma, truncnorm, and uniform by
Kolmogorov-Smirnov distance. Per-column overrides via
numerical_distributions; global default via
default_distribution.sample() for unconditional generation and
sample_conditions() for conditional generation on
categorical or boolean values via rejection sampling.metadata(),
set_column_type(), set_primary_key()) with
auto-detection and JSON serialization (metadata_to_json(),
save_metadata()).add_constraint(), check_constraints()),
enforced via rejection sampling.quality_report() aggregates metrics into the
two-property hierarchy used by the Python SDMetrics
library:
correlation_similarity() for numerical pairs,
contingency_similarity() for categorical pairs). ML
efficacy (train-on-synthetic / test-on-real, TSTR/TRTR) is reported
separately, not folded into the overall score.diagnostic_report() checks structural validity:
boundary adherence (numerical ranges), category adherence (categorical
values), and key uniqueness for primary keys.privacy_report() reports the nearest-neighbour distance
ratio (NNDR) and, optionally, attribute disclosure risk.autoplot() methods for quality, diagnostic, and privacy
reports.adult_income — a 500-row sample of the
UCI Adult Income dataset used in examples and vignettes.