Creating Survey Design Objects

Why Survey Weights Matter

NHANES uses a complex, multistage probability sampling design to select participants who represent the non-institutionalized U.S. population. Without proper survey weights, analyses will produce biased estimates. The create_design() function automates the calculation of appropriate weights when combining multiple NHANES cycles, following CDC weighting guidelines.

Understanding NHANES Weights

NHANES provides three categories of sampling weights, each reflecting different levels of participation:

  1. Interview weights (wtint2yr, wtint4yr): Used when all variables come from the household interview (demographics, questionnaires).
  2. Mobile Exam Center (MEC) weights (wtmec2yr, wtmec4yr): Used when any variable requires a physical exam (laboratory tests, body measurements, DEXA scans).
  3. Fasting weights (wtsaf2yr): Used when any variable requires fasting laboratory tests (glucose, insulin, lipids).

The probability of being sampled decreases from interview to MEC to fasting subsamples. When combining variables across categories, always use the weight with the lowest probability of selection. For example, if your analysis includes both demographics (interview) and body measurements (MEC), use MEC weights.

Weight Calculation Logic

CDC recommendations for combining cycles are based on the number of cycles present in your data, not the timespan covered. This distinction matters when you have gaps in your data.

Early Cycles (1999-2002)

NHANES provides 4-year weights (wtint4yr, wtmec4yr) for 1999-2000 and 2001-2002 cycles, while all subsequent cycles provide only 2-year weights. When combining multiple cycles:

Example Calculation

Combining 4 cycles (1999, 2001, 2003, 2005) with MEC weights:

If you excluded the 2003 cycle, you would have 3 cycles total, so:

The key principle: n is the number of cycles present, not the timespan.

Basic Usage

library(nhanesdata)
library(dplyr)
library(srvyr)

Example 1: Interview Weights

When analyzing demographics and questionnaire data only:

# Load demographics data
demo <- read_nhanes("demo")

# Create design with interview weights
design_int <- create_design(
  dsn = demo,
  start_yr = 1999,
  end_yr = 2011,
  wt_type = "interview"
)

# Calculate weighted means
design_int |>
  summarize(
    mean_age = survey_mean(ridageyr, na.rm = TRUE),
    pct_female = survey_mean(riagendr == 2, na.rm = TRUE)
  )

Example 2: MEC Weights

When including any examination or laboratory data:

# Load demographics and body measures
demo <- read_nhanes("demo")
bmx <- read_nhanes("bmx")

combined <- demo |>
  left_join(bmx, by = c("seqn", "year"))

# Use MEC weights because body measures require exam participation
design_mec <- create_design(
  dsn = combined,
  start_yr = 2007,
  end_yr = 2017,
  wt_type = "mec"
)

# Weighted BMI analysis
design_mec |>
  filter(!is.na(bmxbmi)) |>
  summarize(
    mean_bmi = survey_mean(bmxbmi, na.rm = TRUE),
    pct_obese = survey_mean(bmxbmi >= 30, na.rm = TRUE)
  )

Example 3: Fasting Weights

When including fasting laboratory measurements:

# Load demographics and fasting lab data
demo <- read_nhanes("demo")
glu <- read_nhanes("glu")

combined <- demo |>
  left_join(glu, by = c("seqn", "year"))

# Use fasting weights for glucose analysis
design_fast <- create_design(
  dsn = combined,
  start_yr = 2005,
  end_yr = 2015,
  wt_type = "fasting"
)

# Analyze fasting glucose
design_fast |>
  filter(!is.na(lbxglu)) |>
  summarize(
    mean_glucose = survey_mean(lbxglu, na.rm = TRUE)
  )

Handling Edge Cases

Non-Sequential Cycles

You can specify a wide year range even if some cycles are missing from your data. The function calculates weights based only on cycles actually present:

# Data might be missing 2007-2010 cycles
# Weights calculated on cycles present, not timespan
design <- create_design(
  dsn = demo,
  start_yr = 1999,
  end_yr = 2017,
  wt_type = "interview"
)

Participants Without Valid Weights

When creating a survey design, some participants may lack the weight variable needed for your analysis. This happens naturally in NHANES because not everyone completes every component.

How create_design() handles this:

Example message you might see:

Filtered out 150 participants without valid mec weights.
These participants were not in the subsample for this weight category.
Learn more:
  + CDC weighting guidance:
    https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx
  + Survey design vignette: vignette('survey-design', package = 'nhanesdata')

Zero weights are different from missing weights:

Variance Estimation and Lonely PSUs

NHANES uses a stratified, multistage sampling design with Primary Sampling Units (PSUs) nested within strata. Variance estimation requires at least 2 PSUs per stratum. When subsetting data (e.g., filtering to diabetes patients only), you may create strata with only one PSU.

The create_design() function sets options(survey.lonely.psu = "adjust"), which handles this conservatively by centering single-PSU strata at the sample grand mean rather than the stratum mean. This approach:

For more details on lonely PSU handling, see Thomas Lumley’s {survey} package documentation.

Required Variables

The function validates that your dataset contains:

These variables are automatically included in datasets loaded via read_nhanes().

Workflow Recommendations

  1. Load and combine datasets using read_nhanes() and {dplyr} joins
  2. Preprocess variables (recode, create derived variables, apply exclusions)
  3. Create the design object with create_design()
  4. Perform weighted analyses using {srvyr} or {survey} functions

Preprocessing before design creation is strongly recommended. Once the design object is created, filtering and recoding become more complex due to the survey structure.

Additional Resources