Type: Package
Title: Word and Phrase Frequency Tools for CHILDES
Version: 0.2.0
Description: Tools for extracting word and phrase frequencies from the Child Language Data Exchange System (CHILDES) database via the 'childesr' API. Supports type-level word counts, token-mode searches with simple wildcard patterns and part-of-speech filters, optional stemming, and Zipf-scaled frequencies. Provides normalization per number of tokens or utterances, speaker-role breakdowns, dataset summaries, and export to Excel workbooks for reproducible child language research. The CHILDES database is maintained at https://talkbank.org/childes/.
License: MIT + file LICENSE
URL: https://github.com/n-albudoor/childeswordfreq
BugReports: https://github.com/n-albudoor/childeswordfreq/issues
Depends: R (≥ 4.4.0)
Imports: cachem, childesr, dplyr, memoise, rappdirs, readr, rlang, stats, tibble, tidyr, utils, writexl
Suggests: testthat (≥ 3.0.0), knitr, rmarkdown
VignetteBuilder: knitr
Encoding: UTF-8
RoxygenNote: 7.3.3
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2025-11-15 21:13:24 UTC; albudoor.1
Author: Nahar Albudoor [aut, cre]
Maintainer: Nahar Albudoor <n.albudoor@gmail.com>
Repository: CRAN
Date/Publication: 2025-11-15 22:40:09 UTC

childeswordfreq: Word and Phrase Frequency Tools for CHILDES

Description

The childeswordfreq package provides a simple, reproducible workflow for extracting word and phrase frequencies from the CHILDES database using the childesr API.

Details

The main user-facing functions are:

Optional on-disk caching can be enabled via cwf_cache_enable() to speed up repeated queries, and disabled with cwf_cache_disable(). The current cache status can be checked with cwf_cache_enabled().

All queries are performed live against CHILDES through childesr; no local copy of the corpora is required.

Author(s)

Maintainer: Nahar Albudoor n.albudoor@gmail.com

See Also

Useful links:


Disable caching

Description

Disable caching

Usage

cwf_cache_disable()

Enable on-disk caching of CHILDES queries

Description

Enable on-disk caching of CHILDES queries

Usage

cwf_cache_enable(cache_dir = NULL)

Arguments

cache_dir

Directory for cached results; defaults to user cache dir.


Return TRUE if caching is enabled

Description

Return TRUE if caching is enabled

Usage

cwf_cache_enabled()

Count phrase matches in CHILDES utterances (experimental)

Description

Matches surface phrases in utterance text and outputs counts, plus dataset summary and run metadata. Supports simple wildcards in phrases: * (any chars), ? (one char). Normalization is per number of utterances.

Usage

phrase_counts(
  phrases,
  collection = NULL,
  language = NULL,
  corpus = NULL,
  age = NULL,
  sex = NULL,
  role = NULL,
  role_exclude = NULL,
  wildcard = FALSE,
  ignore_case = TRUE,
  normalize = FALSE,
  per_utts = 10000L,
  db_version = "current",
  cache = FALSE,
  cache_dir = NULL,
  output_file = NULL
)

Arguments

phrases

Character vector of phrases or patterns.

collection, language, corpus, age, sex, role, role_exclude

CHILDES filters.

wildcard

Logical; enable * and ? in phrases.

ignore_case

Logical; case-insensitive matching.

normalize

Logical; if TRUE, add per-N utterance rates.

per_utts

Integer; denominator for utterance rates (default 10000).

db_version

CHILDES DB version (recorded).

cache

Logical; cache CHILDES queries on disk.

cache_dir

Optional cache directory.

output_file

Optional .xlsx path; if NULL, returns a tibble.

Details

Tier targeting is not applied in phrase mode. Phrases are matched in the main utterance text. For tier-constrained contexts around words, use contexts_for(..., mode = "word", tier = "mor").

Value

If output_file is NULL, returns a tibble of phrase counts; otherwise writes an Excel file and returns the file path (invisibly).


Get word counts by speaker role

Description

Reads a CSV with a word column or an in-memory character vector and writes an Excel file with Word_Frequencies, Dataset_Summary, File_Speaker_Summary, and Run_Metadata. If no word list is provided, all types in the selected slice are counted (FREQ-style “all words” mode).

Usage

word_counts(
  word_list_file = NULL,
  output_file,
  words = NULL,
  collection = NULL,
  language = NULL,
  corpus = NULL,
  age = NULL,
  sex = NULL,
  role = NULL,
  role_exclude = NULL,
  wildcard = FALSE,
  collapse = c("none", "stem"),
  part_of_speech = NULL,
  tier = c("main", "mor"),
  normalize = FALSE,
  per = 1000L,
  zipf = FALSE,
  include_patterns = NULL,
  exclude_patterns = NULL,
  sort_by = c("word", "frequency"),
  min_count = 0L,
  freq_ignore_special = TRUE,
  db_version = "current",
  cache = FALSE,
  cache_dir = NULL,
  ...
)

Arguments

word_list_file

Optional path to a CSV file with a column named word. If NULL and words is also NULL, all types in the slice are counted.

output_file

Path to the output .xlsx file.

words

Optional character vector of target words/patterns. Ignored if word_list_file is provided. If both are NULL, all types are counted.

collection

Optional CHILDES filter.

language

Optional CHILDES filter.

corpus

Optional CHILDES filter.

age

Optional numeric: single value or c(min, max) in months.

sex

Optional: "male" and/or "female".

role

Optional character vector of roles to include.

role_exclude

Optional character vector of roles to exclude.

wildcard

Logical; treat "%" as any number of characters and "_" as one character (token mode).

collapse

Either "none" or "stem". Using "stem" triggers token mode.

part_of_speech

Optional POS filter, e.g., c("n","v") (token mode).

tier

Which tier to count from: "main" or "mor".

normalize

Logical; if TRUE, add per-N rate columns.

per

Integer denominator for rates (for example 1000 for per-1k).

zipf

Logical; if TRUE, also add Zipf columns (log10 per-billion).

include_patterns

Optional character vector of CHILDES-style patterns, using "%" and "_" to restrict output to matching words (FREQ-style +s).

exclude_patterns

Optional character vector of CHILDES-style patterns to drop from the output.

sort_by

Final sort order: "word" (alphabetical) or "frequency" (descending Total).

min_count

Integer; drop rows with Total < min_count (after counting).

freq_ignore_special

Logical; if TRUE, drop "xxx", "www", and any word starting with 0, &, +, -, or # (FREQ default ignore rules).

db_version

CHILDES database version label to record in metadata.

cache

Logical; if TRUE, cache CHILDES queries on disk.

cache_dir

Optional cache directory when cache = TRUE.

...

Reserved for future extensions; currently unused.

Details

Uses exact type counts by default; switches to token mode when wildcards, stems, or POS filters are requested. Optional MOR-only tier.

Value

Invisibly returns output_file after writing the workbook.

Examples

## Not run: 
# Minimal example (not run during R CMD check)
tmp_csv <- tempfile(fileext = ".csv")
write.csv(data.frame(word = c("the","go")), tmp_csv, row.names = FALSE)

out_file <- tempfile(fileext = ".xlsx")
word_counts(
  word_list_file = tmp_csv,
  output_file    = out_file,
  language       = "eng",
  corpus         = "Brown",
  age            = c(24, 26)
)

# All-words mode (no word list; counts every type in the slice)
out_all <- tempfile(fileext = ".xlsx")
word_counts(
  word_list_file = NULL,
  words          = NULL,
  output_file    = out_all,
  language       = "eng",
  corpus         = "Brown",
  age            = c(24, 26)
)

## End(Not run)