---
title: "Subsetting concepts"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{subset_concepts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

`subsetVocabularyTables()` lets you reduce a CDM vocabulary to a smaller set of concept IDs while keeping the vocabulary tables internally consistent.

This is useful when you want:

- a smaller mock CDM for package tests
- to focus on one clinical concept or code set
- to drop unused vocabulary rows after building a mock dataset

```{r}
library(omock)
library(dplyr)
```

## Start with a mock CDM

We first create a simple mock CDM with vocabulary tables.

```{r}
cdm <- mockCdmReference() |>
  mockVocabularyTables()

cdm$concept |>
  tally()
```

## Keep a target concept set

Now we subset the vocabulary to two concept IDs.

```{r}
cdm_subset <- cdm |>
  subsetVocabularyTables(conceptSet = c(8507L, 8532L))

cdm_subset$concept |>
  select(concept_id, concept_name, domain_id, vocabulary_id)
```

By default, `subsetVocabularyTables()` also keeps:

- concepts directly related to `conceptSet`
- concepts in `Unit`, `Visit`, and `Gender` domains

This behaviour helps preserve a usable mock CDM after vocabulary subsetting.

## Exclude directly related concepts

If you want to keep only the requested concept IDs plus the configured kept domains, set `includeRelated = FALSE`.

```{r}
cdm_strict <- cdm |>
  subsetVocabularyTables(
    conceptSet = c(8507L, 8532L),
    includeRelated = FALSE
  )

cdm_strict$concept |>
  count(domain_id)
```

## Control which domains are always retained

You can override the default kept domains with `keepDomains`.

```{r}
cdm_no_defaults <- cdm |>
  subsetVocabularyTables(
    conceptSet = c(8507L, 8532L),
    includeRelated = FALSE,
    keepDomains = character(0)
  )

cdm_no_defaults$concept |>
  select(concept_id, concept_name, domain_id)
```

This is useful when you want the smallest possible vocabulary subset.

## Apply subsetting after building a CDM

The function is also useful after creating a CDM with clinical tables. In that case, rows in other OMOP tables that reference removed concepts are also filtered.

```{r}
cdm_clinical <- mockVocabularyTables() |>
  mockPerson(nPerson = 10, seed = 1) |>
  mockObservationPeriod(seed = 1) |>
  mockConditionOccurrence(seed = 1)

cdm_clinical_small <- cdm_clinical |>
  subsetVocabularyTables(conceptSet = c(8507L, 8532L))

cdm_clinical_small$concept |>
  tally()
```

If your chosen concept set removes concepts used by clinical tables, the corresponding rows are dropped so the resulting CDM stays consistent.

For example, imagine a `condition_occurrence` row uses `condition_concept_id = 123`, but after subsetting the vocabulary, concept `123` is no longer present in `cdm$concept`. In that case, that `condition_occurrence` row is removed as well.

```{r}
cdm_example <- mockVocabularyTables(
  concept = dplyr::tibble(
    concept_id = c(1L, 2L, 3L),
    concept_name = c("condition a", "condition b", "gender"),
    domain_id = c("Condition", "Condition", "Gender"),
    vocabulary_id = c("SNOMED", "SNOMED", "Gender"),
    standard_concept = "S",
    concept_class_id = c("Clinical Finding", "Clinical Finding", "Gender"),
    concept_code = "1",
    valid_start_date = as.Date(NA),
    valid_end_date = as.Date(NA),
    invalid_reason = NA_character_
  )
) |>
  mockCdmFromTables(tables = list(
    person = dplyr::tibble(
      person_id = c(1L, 2L),
      gender_concept_id = c(3L, 3L),
      year_of_birth = c(1990L, 1991L)
    ),
    condition_occurrence = dplyr::tibble(
      condition_occurrence_id = c(1L, 2L),
      person_id = c(1L, 2L),
      condition_concept_id = c(1L, 2L),
      condition_start_date = as.Date(c("2020-01-01", "2020-01-02")),
      condition_end_date = as.Date(c("2020-01-01", "2020-01-02")),
      condition_type_concept_id = c(0L, 0L)
    )
  ))

cdm_example_small <- cdm_example |>
  subsetVocabularyTables(
    conceptSet = 1L,
    includeRelated = FALSE,
    keepDomains = "Gender"
  )

cdm_example_small$concept |>
  select(concept_id, domain_id)

cdm_example_small$condition_occurrence |>
  select(person_id, condition_concept_id)
```

In this example, the row using `condition_concept_id = 2` is removed because concept `2` is no longer present after subsetting.