---
title: "Missing data (tagged_na)"
format: html
vignette: >
  %\VignetteIndexEntry{Missing data (tagged_na)}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Missing data (`NA` --> `tagged_na`)

Base R supports only one type of NA ('not available') to represent missing 
values. The CHMS, however, has several types of missing data or incomplete 
category responses. `chmsflow` uses the `haven` package [`tagged_na()`](https://haven.tidyverse.org/reference/tagged_na.html) 
to allow multiple missing data categories. `tagged_na()` adds an addition 
character to the NA value, thereby allowing users to define additional missing 
data types. `tagged_na()` applies only for numeric values, as character base 
values can use any string to represent NA or missing data. 

## CHMS approach to coding missing data

CHMS category values can change across variables and survey 
cycles, but the most common values are:

| CHMS category value | Label |
|----------------|--------------|
| 6                  | not applicable   |
| 7                  | don't know       |
| 8                  | refusal          |
| 9                  | not stated       |


`chmsflow` recodes the four CHMS category missing data categories values into 
two NA values that are commonly used for most studies. NA(a) = 'not applicable' 
and NA(b) = 'missing'. Furthermore, variables may be entirely missing from a 
specific cycle, which results in 'not asked' missing data that is not included 
or coded in CHMS surveys, recoded to NA(a)

**Summary of `tagged_na` values and their corresponding CHMS category values.**

| chmsflow `tagged_na` | CHMS category value | Label |
|----------------|--------------|-----------------|
| NA(a) | 6                  | not applicable     |
| NA(b) | 7                  | don't know         |
| NA(b) | 8                  | refusal            |
| NA(b) | 9                  | not stated         |
| NA(c) | | question not asked in the survey cycle  |

## Example `haven::tagged_na()`
```{r}
library(haven)
x <- c(1:5, tagged_na("a"), tagged_na("b"))

# Is used to read the tagged NA in most other functions they are still viewed as NA
na_tag(x)
```
```{r}
# Is used to print the na as well as their tag
print_tagged_na(x)
```
```{r}
# Tagged NA's work identically to regular NAs
x
```
```{r}
is.na(x)
```

## Using tagged_na for creating derived variables

See the [derived variables](derived_variables.html) and
[how to add variables](how_to_add_variables.html) for details on derived
variables and how to add them.

When creating derived variables from CHMS variables, it is important to
distinguish the different NA values. Certain derived variables include the use
of variables that may not be applicable to respondents. For example,
[smoking pack-years](../reference/calculate_pack_years.html)
involves the use of smoking variables that may not be applicable to all CHMS
respondents (i.e. non-smokers who have never smoked). In this scenario,
respondents who had values of `NA(a)` for the various base smoking variables
would have a value of `NA(a)` for smoking pack-years. Respondents who had
identified as having smoked at one point in their lives, but had values of
`NA(b)` in the base smoking variables would have a value of `NA(b)` for smoking
pack-years as pack-years cannot be calculated due to missing values.

On the other hand, there are certain derived variables that use variables that
are applicable to all respondents. For example, [WHR](../reference/calculate_waist_height_ratio.html) uses CHMS
height and waist circumference variables which are asked to all CHMS respondents. In this
scenario, all NA variables would be tagged as `NA(b)` as these variables are
applicable to everyone, and respondents with NA values would be classified as
missing. 

When creating deriving variables, it is important to examine the base CHMS
variables to check for the presence of `NA(a)` and `NA(b)`. This can be done
by reviewing CHMS documentation.

## Next steps

- **See tagged NAs in practice** -- The [Analysis walkthrough](analysis_walkthrough.html) shows how tagged NAs appear in hypertension derivation results.
- **Understand derived variables** -- Learn how `Func::` and `DerivedVar::` entries work in [Derived variables](derived_variables.html).
- **Inspect the recoding rules** -- See how missing data codes are mapped in [Variable schema reference](variables_and_variable_details.html).
