In any data analysis project involving multiple tables, a primary
task is to identify how these tables can be joined together. This
typically involves matching foreign keys in one table to primary keys in
another. The map_join_paths() function automates this
discovery process, making it faster and less error-prone.
This function operates in two powerful modes:
Metadata-Driven Discovery: It can find join
paths based solely on a metadata definition, matching tables where a
defined grouping_variable identically matches another
table’s identifier_columns (primary key). This is extremely
fast and useful for well-documented schemas.
Data-Driven Discovery: Optionally, by providing the actual data, the function can scan column values to find “inferred” joins where key names do not match. This is invaluable for exploring new or messy datasets.
First, let’s define metadata for a standard retail scenario with
customers, products, and
transactions tables.
# Define metadata for each table
customers_meta <- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))
products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))
transactions_meta <- table_info("transactions", "t.csv", "trans_id", list(
list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
# This grouping variable will match the primary key of 'customers'
list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"),
# This one will match the primary key of 'products'
list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id")
))
))
# Combine into a master metadata object
master_meta <- rbindlist(list(customers_meta, products_meta, transactions_meta))Now, we can find the join paths using only this metadata.
# Find paths without looking at the data
metadata_paths <- map_join_paths(master_meta)
print(metadata_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: transactions customers customer_id customer_id
#> 2: transactions products product_id product_idThe function correctly identifies two METADATA paths:
transactions can be joined to customers on
customer_id, and to products on
product_id.
The function also handles composite keys. Let’s imagine a table
daily_promos whose primary key is the combination of
product_id and region.
daily_promos_meta <- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))
# Add a grouping variable to transactions that matches this composite key
transactions_multi_meta <- table_info("transactions", "t.csv", "trans_id", list(
list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region"))
))
))
multi_key_meta <- rbindlist(list(daily_promos_meta, transactions_multi_meta))multi_key_paths <- map_join_paths(multi_key_meta)
print(multi_key_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: transactions daily_promos product_id,region product_id,regionThe function correctly identifies the single, multi-variable join path.
What if our data is messy and key names don’t align? Consider an
inventory table where the product key is called
sku, and an orders table that refers to it as
product_code.
# Define the data
inventory_data <- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5))
orders_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2"))
data_list <- list(
inventory = inventory_data,
orders = orders_data
)
# Define the metadata. Note the mismatched names.
inventory_meta <- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock")))))
orders_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code")))))
inferred_meta <- rbindlist(list(inventory_meta, orders_meta))Running map_join_paths with only metadata would fail to
find a path here. But by providing the data_list, we enable
the data-driven search.
inferred_paths <- map_join_paths(inferred_meta, data_list = data_list)
print(inferred_paths)
#> table_from table_to key_from key_to
#> <char> <char> <list> <list>
#> 1: orders inventory product_code skuSuccess! The function scanned the values and found that
orders$product_code can be joined to
inventory$sku. It correctly marks this path as
INFERRED.
The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.