This vignette collects practical advice for reading large DATASUS
files efficiently with datasusr.
The single most effective way to speed up reads is to select only the columns you need. The C parser skips over unneeded fields at the byte level, so fewer columns means less allocation and less parsing:
When guess_types = TRUE (the default), the reader scans
every value of each numeric field to decide whether it fits in an
integer or requires a double. Disabling this skips the scan and relies
on the DBF field metadata alone:
This is safe for most analyses and can save noticeable time on files with millions of rows.
When you know exactly which types you need, col_types
lets you specify them upfront and skip inference entirely:
Date parsing (parse_dates = TRUE) adds a small overhead
per date field. Only enable it when you actually need Date
objects; otherwise the raw YYYYMMDD character strings are
often sufficient for filtering and grouping.
library(bench)
file <- "RDPE2401.dbc"
bench::mark(
default = read_datasus_dbc(file),
no_guess = read_datasus_dbc(file, guess_types = FALSE),
selected = read_datasus_dbc(file, select = c("uf_zi", "val_tot")),
typed = read_datasus_dbc(
file,
select = c("uf_zi", "dt_inter", "val_tot"),
col_types = c(uf_zi = "character", dt_inter = "date", val_tot = "double"),
parse_dates = TRUE,
guess_types = FALSE
),
check = FALSE,
iterations = 5
)datasusr allocates R vectors directly from C without
materialising intermediate files on disk. For very large datasets (tens
of millions of rows), consider processing files in batches rather than
binding everything into a single data frame.