canpumf 0.5.2
New features
- Experimental
list_statcan_pumf_catalogue() crawls the
live Statistics Canada “Public use microdata” listing and returns one
row per discovered survey edition (catalogue_id,
Title, edition, format,
url, product_url) — a discovery counterpart to
the curated list_canpumf_collection() that picks up newly
released PUMFs automatically. Editions offered in several formats
collapse to a single preferred row (CSV / flat text first), but
genuinely distinct surveys or file-types that share a reference year are
kept as separate rows (e.g. the GSS cycle and the Giving/Volunteering
survey both released in 2007, or the census
individual/family/household/hierarchical files for one year). Surveys
distributed only by Electronic File Transfer report
url = "(EFT)".
- The full crawl is expensive (hundreds of requests), so
list_statcan_pumf_catalogue() caches its result for the
duration of the R session and reuses it on subsequent calls with the
same arguments. Pass refresh = TRUE to re-scrape the live
catalogue and replace the cached result, e.g. to pick up a newly
released survey mid-session.
- Census PUMF editions are decoded from their
cenNN /
nhsNN filename prefix (the 2011 cycle shipped as the
National Household Survey) and ind / fam /
hous / hier file type into canonical
"YYYY (individuals)"-style strings matching
list_canpumf_collection(). This is forward-compatible: the
2026 census PUMF will resolve automatically once released.
list_statcan_pumf_catalogue() now returns a
SeriesTitle column (the plain-language series name matching
the acronym) alongside an edition-specific Title. For
umbrella products whose catalogue title is only the series name
(e.g. the consolidated General Social Survey, or a census year’s
individuals/hierarchical pair) the Title is synthesised as
"<series> — <edition>", where the structural
edition descriptor disambiguates colliding reference years
("General Social Survey — Cycle 16 (2002)",
"Census of Population — 2021 (individuals)"); per-edition
products keep Statistics Canada’s own title.
get_pumf() now resolves download URLs from the scraped
catalogue first for the series the crawler covers (GSS, SHS, SFS, CPSS,
CIS, CHS, ITS, CCAHS), so a newly released edition is downloadable
without a package update. Series the crawler deliberately does not cover
— LFS and Census (which keep their dedicated paths) and the
Giving/Volunteering surveys (SGVP, which Statistics Canada
ships under reused zip names the umbrella crawl cannot disambiguate) —
continue to resolve through the curated
list_canpumf_collection(). URL resolution never triggers a
live crawl; it reads the cached catalogue.
- The package now ships a frozen snapshot of the full catalogue crawl
(
inst/extdata/pumf_catalogue.rds). It is the terminal,
always-available fallback for both URL resolution and
list_statcan_pumf_catalogue(): a freshly installed package
with no user cache and no network still resolves every supported
survey’s download URL, so a change to the Statistics Canada website
cannot silently break get_pumf() between releases. The
shipped snapshot is regenerated at each release.
Robustness
get_pumf(), get_pumf_connection() and
pumf_metadata() now fail gracefully when Statistics Canada
is unreachable: a download failure no longer raises an error but instead
emits an informative message and returns NULL.
list_available_lfs_pumf_versions() likewise returns an
empty result with a warning rather than erroring, matching the existing
behaviour of list_canpumf_collection() and
list_statcan_pumf_catalogue().
close_pumf(NULL) is now a no-op, so it can be called
unconditionally on a get_pumf() result that may be
NULL.
- When
options(canpumf.cache_path = ) is not set, the
package now notes this once when attached and again on the first
download, explaining that data is written to a temporary directory (and
discarded at the end of the session) and how to configure a persistent
cache. The underlying behaviour is unchanged — without a cache path,
data is stored in tempdir() for the session.
Bug fixes
- Surveys whose StatCan ZIP archives carry accented path names stored
in CP437/Latin-1 without the UTF-8 flag (e.g. the Survey of Household
Spending 2017, whose data live under a
Data - Données/
folder) now extract correctly on Linux and Windows. Previously
utils::unzip() either errored with “invalid multibyte
string” (Windows) or silently dropped the affected files under a
non-UTF-8 locale (Linux), so the survey failed to import with “No
parseable metadata files found”. Extraction now uses
zip::unzip() as the primary, locale-agnostic extractor on
every platform (with the macOS
ditto/system-unzip chain retained as a
fallback for newer ZIP compression variants), giving uniform
cross-platform behaviour. zip is a new dependency.
canpumf 0.5.1
New features
- Multi-module survey support. Surveys that ship several linked files
sharing a respondent key are now modelled as several joinable tables in
one DuckDB file.
get_pumf() returns the survey’s primary
(respondent-level) module and emits a one-time message listing the
available sibling modules;
pumf_module(tbl, "<module>") opens a sibling on the
same connection so the two are joinable, and announces
the shared join key. Each module’s join key is recorded in the registry
(module_key) so it never has to be guessed (it varies:
RECID, PUMFID, MICRO_ID,
CASEID, IDNUM). Converted surveys include GSS
cycle 16 / “Aging and Social Support” 2002 (MAIN + CG4 + CG6 + CR), GSS
Time Use 1998/2010/2015/2022 (Main + Episode), the Survey of Household
Spending 2017 (Interview + Diary, each with its own bootstrap weights),
and the Giving/Volunteering/Participating cycles 1997–2010 (MAIN +
GS/VD/GIVE/VOLNTR).
close_pumf() now also accepts a DuckDB connection
returned by get_pumf_connection(), closing it directly, in
addition to a lazy dplyr::tbl() returned by
get_pumf().
- New
parse_pdf_codebook() metadata parser for StatCan
bilingual PDF frequency codebooks. This recovers variable and
value labels for surveys whose only machine-readable companion is the
data file — notably CPSS cycle 1, which (unlike CPSS 2–6) ships no
variables.csv. CPSS 1 now imports with full bilingual
labels (parity with the other cycles) when pdftools is
installed. Like the existing PDF data-dictionary parser, it is a label
fallback that only fires when no command file or codebook CSV is found,
and requires pdftools (Suggests).
Documentation
- New “Working with multi-module PUMF surveys” vignette showing how to
load the primary module, open sibling modules with
pumf_module(), join them inside DuckDB, and use
get_pumf_connection() / close_pumf()
directly.
- New “Bootstrap weights” vignette documenting the resampling method,
how the weights are stored, stratification, estimating uncertainty, and
the incremental re-run behaviour (reuse, adding replicates, and
regeneration when rows are added).
Bug fixes
get_pumf("LFS") (and other calls) no longer trigger
spurious RStudio “Error in dbSendQuery(…)” Connections-pane popups.
Transient internal DuckDB connections (status checks, write phases, BSW
edits) are no longer registered in the RStudio Connections pane; only
the final connection returned to the user is registered.
add_bootstrap_weights() on an in-memory
data.frame/tibble that already has replicate
columns now extends the existing set (generating only the additional
replicates) instead of regenerating a full set and producing duplicate
column names. This matches the DuckDB-backed behaviour.
add_bootstrap_weights() now handles rows added to a
survey table that already has bootstrap weights correctly. Previously it
generated replicates for the new rows in isolation (resampling only
among the new rows), which is statistically wrong. It now deletes and
regenerates the affected weights: every row when unstratified, or only
the strata that gained rows when strata_cols are in effect
(complete strata keep their existing weights).
- GSS Time Use 1998 now imports cleanly regardless of locale. Under a
C locale (as in
R CMD check) list.files()
selected the Main module’s SAS PROC FORMAT, which injected
categorical codes onto continuous clock-time, duration, decimal-hour and
birth-year variables; these are now declared force_numeric
so their values are preserved. In addition,
merge_metadata() no longer warns about label conflicts that
arise solely from lossy supplement parsers (SAS labels, PDF
dictionary/codebook) — authoritative-source conflicts still warn.
canpumf 0.5.0
Major changes
- Data is now imported into DuckDB (breaking change, but only
requiring slight modification of code)
- Adaptable metadata parsing registry
- Multiple more robust strategies to parse metadata
- Better data download and import mechanics
- Extensive test suite to prevent regressions and catch if StatCan
re-releases data with changed metadata