phrase_counts()This document provides a practical overview and methodological
recommendations for using childeswordfreq.
childeswordfreq is designed as a thin, reproducible
layer on top of childesr. All data access is
remote.
childesr, dplyr, tidyr,
tibble, rlang, writexl,
readr, cachem, memoise,
rappdirsCaching is off by default; Section 8.2 explains when and why to use it.
Installation
word_counts()word_counts() accepts either:
.csv file with a column named word,
orwords argument.If word_list_file = NULL and words = NULL,
word_counts() enters “all-words” mode and
counts every type in the selected slice.
# Prepare a temporary CSV with a `word` column
tmp_csv <- tempfile(fileext = ".csv")
write.csv(
data.frame(word = c("go", "want", "think")),
tmp_csv,
row.names = FALSE
)
# Output path for the Excel workbook
out <- tempfile(fileext = ".xlsx")
# Run word_counts on a specific CHILDES slice
word_counts(
word_list_file = tmp_csv,
output_file = out,
language = "eng",
corpus = "Brown",
age = c(24, 36),
role = c("CHI", "MOT")
)Requirements for .csv inputs:
.csv file.word.All CHILDES filters (language, corpus,
collection, age, sex,
role, role_exclude) are passed directly to
childesr. In practice:
corpus and language names must match
CHILDES conventions exactly.age is in months and can be a single value or
c(min, max).role and role_exclude are speaker code(s),
for example "CHI", "MOT", "FAT",
"ADU".NULL is interpreted as “no
restriction” on that dimension.Example: all English-language corpora, no restriction on age or role:
word_counts() has two operational modes:
childesr::get_types().wildcard = TRUE% or _collapse = "stem"part_of_speech is non-NULLtier = "mor"Token mode uses childesr::get_tokens() and may be
substantially slower, particularly when used over wide
age ranges or many corpora.
Examples:
Type mode (exact forms)
word_counts(
words = c("go", "went", "going"),
output_file = out,
language = "eng",
corpus = "Brown",
age = c(24, 36)
)Token mode with wildcard and stems
word_counts(
words = c("go%", "run%"),
output_file = out,
language = "eng",
corpus = "Brown",
age = c(24, 36),
wildcard = TRUE, # "%" and "_" patterns
collapse = "stem", # aggregate inflected variants
part_of_speech = "v", # verb-only counts where MOR is available
tier = "mor" # MOR-tagged tokens only
)In token mode, collapse = "stem" attempts to aggregate
inflected variants under a single stem where MOR analysis supports it
(for example “go”, “goes”, “going”, “went” under one stem).
Limitations:
For analyses where inflection is theoretically important, users should inspect the resulting stems and, if necessary, operate on separate forms explicitly.
The part_of_speech argument restricts counts to specific
POS values in MOR (for example c("n","v")). This is only
available in token mode and is useful for:
Users should verify POS behavior on a subset of items before relying on it in a large-scale analysis.
word_counts() can attach normalized rates and
Zipf-scaled values.
normalize = TRUE adds per-per rates for
each speaker role and for Total (for example per
1,000 tokens).zipf = TRUE adds Zipf columns (*_Zipf),
computed as log10 of the estimated frequency per
billion tokens.Example:
word_counts(
word_list_file = tmp_csv,
output_file = out,
language = "eng",
corpus = "Brown",
age = c(24, 48),
role = c("CHI", "MOT"),
normalize = TRUE,
per = 1000,
zipf = TRUE
)The Dataset_Summary sheet records the total token counts
used as denominators, both overall and by speaker role, so that
normalized and Zipf values can be interpreted and recomputed if
needed.
By default, word_counts() applies CLAN/FREQ-style ignore
rules via freq_ignore_special = TRUE:
xxx, www, and any item beginning
with 0, &, +, -,
or #These rules are applied both to the internal data and to the final
frequency table. Set freq_ignore_special = FALSE to retain
these items.
The arguments include_patterns and
exclude_patterns provide an additional CHILDES-style filter
layer on lexical items:
word_counts(
words = c("go", "get", "make", "mom"),
output_file = out,
language = "eng",
corpus = "Brown",
age = c(24, 48),
include_patterns = c("g%"), # keep only words starting with "g"
exclude_patterns = c("%ing") # but drop "-ing" forms
)Patterns use % for “any number of characters” and
_ for “one character,” matching the FREQ/CLAN
convention.
phrase_counts()phrase_counts() is an experimental
companion to word_counts(). It operates on utterance text
rather than word types and is aimed at formulaic sequences, frames, and
multiword expressions.
Basic Usage:
phr_out <- phrase_counts(
phrases = c("i don't know", "let's go"),
language = "eng",
corpus = "Brown",
age = c(24, 36),
role = c("CHI", "MOT"),
wildcard = FALSE,
normalize = TRUE
)
phr_outKey Points:
phrases are matched in the utterance
string, not via MOR.wildcard = TRUE enables * (any characters)
and ? (single character) in patternsper_utts).output_file is NULL.
Otherwise an Excel workbook with counts, a dataset
summary, and run metadata.Because phrase matching is string-based, users should inspect a sample of matches and refine their patterns to avoid obvious false positives.
If a word or pattern in the input list is appears zero
times in the selected slice of CHILDES,
word_counts() returns zero counts for that
row.
This is true both in type and token modes:
Word_Frequencies.Total column are
0.This behavior is deliberate:
When reporting results, it is good practice to state explicitly that non-attested words are retained with 0 counts.
Runtime depends on:
The following are approximate ranges from typical runs on a modern laptop with a stable connection:
Small type-mode query
Fewer than 10 words, single English corpus, moderate age range
→ seconds (often < 15 s).
Medium type-mode query
50–100 words, several corpora, broad age range
→ tens of seconds to a couple of minutes.
Token-mode query with wildcards/POS
Many patterns, collapse = "stem",
part_of_speech set, tier = "mor"
→ expect 2–5× the corresponding type-mode runtime.
Wide, multi-corpus token-mode queries
Large word lists or “all-words” mode over many corpora and ages
→ can take several minutes, especially without
caching.
language, corpus,
age, and role as tightly as your research
question allowschildeswordfreq includes optional disk caching to
accelerate repeated queries during interactive work. Caching stores the
results of remote CHILDES queries so that subsequent calls with the same
arguments return immediately rather than re-downloading data.
What counts as “the same query”? Caching speeds up calls that hit the same CHILDES slice, meaning all of the following match exactly:
language, corpus,
collectionage, sex, role,
role_excludewildcard,
collapse, part_of_speech,
tier)If any of these change, a new remote query is required and caching does not apply.
When this is useful: During development, analyses are regularly re-run when:
In practice, you may run the same CHILDES query dozens of times. Without caching, each run contacts the API and can take seconds to minutes. With caching, repeated runs are effectively instantaneous.
Typical speedup - First run: full CHILDES download (seconds to minutes) - Subsequent identical runs: near 0 seconds
This is a substantial benefit for workflows that involve frequent re-execution.
Usage
cwf_cache_enable() # turn caching on for interactive exploration
# ...run exploratory word_counts() or phrase_counts() calls...
cwf_cache_disable() # disable before final or published analysesFinal analyses should always be derived directly from the current CHILDES database, not from previously cached local data. Disabling caching ensures that the published output reflects up-to-date datasets.
Every word_counts() and phrase_counts() run
writes:
Lexical/phrasal frequencies were computed using the childeswordfreq R package (version X.Y.Z) through childesr (version A.B.C), querying the CHILDES database (version V). We analyzed language(s) L and corpus/corpora C, with an age range of [Amin, Amax] months. Target speaker roles included R, with excluded roles R_excl. Counts were obtained in [type/token] mode. [If token mode used:] We specified: [wildcards, stem collapsing, POS filters, MOR tier]. We applied normalization [per N tokens/Zipf scaling].
Adjust the placeholders to match the values recorded in your own outputs.