Source intelligence and AI tools

Beyond manually configuring filters, cohortBuilder can inspect a source and describe or build filters for you. These features are also the foundation for integrating a cohort with a Large Language Model (LLM), so an assistant can explore the data and apply filters on the user’s behalf.

This article covers four building blocks:

Describing a source

describe() builds a small description object (its text plus any extra fields) that you attach to a source via the description argument of set_source().

The description is a nested list keyed by dataset name. Within each dataset, the special key dataset_ describes the dataset itself, and any other key describes a variable of that dataset:

iris_source <- set_source(
  tblist(iris = iris),
  description = list(
    iris = list(
      dataset_ = describe("Edgar Anderson's measurements of iris flowers."),
      Species = describe("Iris species.", domain = c("setosa", "versicolor", "virginica"))
    )
  )
)

Extra named arguments to describe() (such as domain above) are stored alongside the text and can be picked up by other features - for example autofilter() uses a supplied domain instead of scanning the data.

describe() also accepts a label - a short, human-readable name for the field. When the field describes a variable, autofilter() reuses the label as the generated filter’s name (the underlying variable is unchanged), which is handy for giving filters friendlier names in a GUI:

labelled_source <- set_source(
  tblist(iris = iris),
  description = list(
    iris = list(
      Species = describe("the species of iris", label = "Iris species")
    )
  )
) |>
  autofilter(attach_as = "meta")

species_filter <- purrr::detect(
  labelled_source$available_filters, ~ .x@id == "iris-Species"
)
species_filter@name
#> [1] "Iris species"

Generating filters automatically

autofilter() analyses each column of the source and creates a filter suited to its type (using filter rules such as rule_character, rule_factor, rule_numeric, rule_Date, rule_POSIXct). The mapping is roughly:

Column type Filter type
character / factor discrete (or discrete_text when all values are unique)
numeric / integer range
Date date_range
POSIXct datetime_range

The attach_as argument controls where the generated filters go.

With attach_as = "step" (the default) the filters are added as a filtering step, so the cohort is immediately filterable:

iris_cohort <- set_source(tblist(iris = iris)) |>
  autofilter(attach_as = "step") |>
  cohort()

sum_up(iris_cohort)
#> >> Step ID: 1 [pending]
#> -> Filter ID: iris-SepalLength
#>    Filter Type: range
#>    Filter Parameters:
#>      active: TRUE
#>      description: 
#>      domain: 4.3, 7.9
#>      dataset: iris
#>      variable: Sepal.Length
#>      range: NA
#>      keep_na: TRUE
#> -> Filter ID: iris-SepalWidth
#>    Filter Type: range
#>    Filter Parameters:
#>      active: TRUE
#>      description: 
#>      domain: 2, 4.4
#>      dataset: iris
#>      variable: Sepal.Width
#>      range: NA
#>      keep_na: TRUE
#> -> Filter ID: iris-PetalLength
#>    Filter Type: range
#>    Filter Parameters:
#>      active: TRUE
#>      description: 
#>      domain: 1, 6.9
#>      dataset: iris
#>      variable: Petal.Length
#>      range: NA
#>      keep_na: TRUE
#> -> Filter ID: iris-PetalWidth
#>    Filter Type: range
#>    Filter Parameters:
#>      active: TRUE
#>      description: 
#>      domain: 0.1, 2.5
#>      dataset: iris
#>      variable: Petal.Width
#>      range: NA
#>      keep_na: TRUE
#> -> Filter ID: iris-Species
#>    Filter Type: discrete
#>    Filter Parameters:
#>      active: TRUE
#>      description: 
#>      domain: setosa, versicolor, virginica
#>      dataset: iris
#>      variable: Species
#>      value: NA
#>      keep_na: TRUE

With attach_as = "meta" the filters are stored in source$available_filters rather than applied. This is the “menu” of filters a GUI or an LLM can choose from, without forcing them onto the data:

meta_source <- iris_source |>
  autofilter(attach_as = "meta")

length(meta_source$available_filters)
#> [1] 5

When a domain was provided via describe(), the generated filter inherits it instead of scanning the data:

species_filter <- purrr::detect(
  meta_source$available_filters, ~ .x@id == "iris-Species"
)
species_filter@domain
#> [1] "setosa"     "versicolor" "virginica"

Inspecting a source with shape()

shape(source) returns a structured list(datasets, filters) describing the source - ideal for programmatic inspection or passing to an LLM:

result <- shape(meta_source)

# Dataset descriptions
result$datasets
#> $iris
#> [1] "Edgar Anderson's measurements of iris flowers."

# One filter entry
str(result$filters$`iris-Species`)
#> List of 6
#>  $ name       : chr "Species"
#>  $ dataset    : chr "iris"
#>  $ type       : chr "discrete"
#>  $ description: chr "Iris species."
#>  $ variables  :List of 1
#>   ..$ :List of 2
#>   .. ..$ name       : chr "Species"
#>   .. ..$ description: chr "Iris species."
#>  $ domain     : chr [1:3] "setosa" "versicolor" "virginica"

When a filter’s own @domain is unset, shape() falls back to the domain stored in the source’s metadata statistics, so the domain field is populated whenever possible.

Note. Called with a field (and optional subfield), shape(source, field, subfield) instead performs a description-text lookup - this is the form used internally by Cohort$show_help().

Connecting a cohort to an LLM

The functions in R/ai_tools.R wrap cohort operations as tools an ellmer chat can call. Each tool is a cb_tool object (a function plus a name, description, and argument schema).

The built-in tool factories each take a cohort and return a cb_tool:

Tool factory Purpose
cb_tool_filters_meta() Return available-filter metadata (via shape()) as JSON
cb_tool_describe_state() Describe current steps, filters, and pending state
cb_tool_get_data_summary() Report row counts per dataset and step
cb_tool_get_code() Return reproducible filtering code
cb_tool_add_filters() Add filters (no values) to a new or existing step
cb_tool_set_filter_values() Set values on existing filters
cb_tool_apply_filters() Add filters and set their values in one call
cb_tool_toggle_filters() Activate / deactivate filters
cb_tool_clear_filters() Reset filters to their defaults
cb_tool_remove_filters() Remove filters from a step
cb_tool_remove_step() Remove the last step
cb_tool_run() Run the pipeline (when auto-run is disabled)

A cb_tool prints its name, description, and arguments:

coh <- cohort(meta_source)
tool <- cb_tool_filters_meta(coh)
print(tool)
#> cohortBuilder tool: cb_get_filters_meta 
#> Description: Returns metadata about the datasets and available filters in JSON format. The JSON has two top-level keys: 'datasets' - an object keyed by dataset name, mapping each dataset to its description (or null). 'filters' - an object keyed by filter id, where each value describes one filter with fields: 'name' (the filter's display name), 'dataset' (the dataset the filter belongs to), 'type' (the filter type, e.g. 'discrete', 'range', 'date_range'), 'description' (a human-readable summary combining the filter's purpose and its variables, or null), 'variables' (an array of objects, each with 'name' and 'description', for the columns the filter covers), 'domain' (the set of valid values the filter accepts: an array of allowed values for discrete-type filters, or a two-element [min, max] array for range-type filters). Use the filter id keys when referring to filters in other tools.

For LLM-driven filtering to work, the source must expose a menu of filters via autofilter(attach_as = "meta") so the assistant knows what it can apply.

To register tools with an ellmer chat, use cb_register_tool() for a single tool or cb_register_tools() to register all of them at once:

library(ellmer)

source <- set_source(tblist(iris = iris)) |>
  autofilter(attach_as = "meta")
coh <- cohort(source)

chat <- chat_openai()
chat |> cb_register_tools(coh)

chat$chat("Filter the data to setosa flowers with sepal length over 5")

By default the cohort runs automatically after each tool modifies it. Set options(cb_tool_run_cohort = FALSE) to require an explicit cb_run call instead.

To trace which tools the LLM invokes (and with which arguments), set options(cb_tool_verbose = TRUE). Each call then emits an informative message() such as [cohortBuilder AI tool] cb_apply_filters (filters = ...; action = new_step). Logging is off by default, so tools stay silent during normal use.

Note. The AI tools require the suggested ellmer package.