fuzzystring provides fast, flexible fuzzy string
joins for data frames using approximate string matching. Built on top of
data.table and stringdist, it’s designed for
efficiently merging datasets where exact matches aren’t possible due to
misspellings, inconsistent formatting, or slight variations in text.
You can install the development version of fuzzystring from GitHub:
Here’s a simple example matching diamond cuts with slight misspellings:
# Your messy data
x <- data.frame(
name = c("Idea", "Premiom", "Very Good"),
id = 1:3
)
# Reference data
y <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood"),
grp = c("A", "B", "C")
)
# Fuzzy join with max distance of 2 edits
fuzzystring_inner_join(
x, y,
by = c(name = "approx_name"),
max_dist = 2,
distance_col = "distance"
)
#> name id approx_name grp distance
#> <char> <int> <char> <char> <num>
#> 1: Idea 1 Ideal A 1
#> 2: Premiom 2 Premium B 1
#> 3: Very Good 3 VeryGood C 1fuzzystring supports all standard join types. Below is a small, reusable example dataset so you can compare the behavior of each join family.
x_join <- data.frame(
name = c("Idea", "Premiom", "Very Good", "Gooood"),
id = 1:4
)
y_join <- data.frame(
approx_name = c("Ideal", "Premium", "VeryGood", "Good"),
grp = c("A", "B", "C", "D")
)fuzzystring_inner_join(): Only matching rows.fuzzystring_left_join(): All rows from x,
matching rows from y.fuzzystring_right_join(): All rows from y,
matching rows from x.fuzzystring_full_join(): All rows from both
tables.fuzzystring_semi_join(): Rows from x that
have a match in y.fuzzystring_anti_join(): Rows from x that
don’t have a match in y.x with a match in
y)x without a match in
y)fuzzystring_join()If you prefer a single entry point, you can use
fuzzystring_join() directly by specifying
mode.
You can choose from various distance metrics provided by the
stringdist package:
# Optimal String Alignment (default)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
# Damerau-Levenshtein
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
# Jaro-Winkler (good for names)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
# Soundex (phonetic matching)
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")You can match on multiple columns using different matching functions for each:
fuzzystring uses a C++ implementation for row
binding combined with a data.table backend for fast
performance on large datasets. It is optimized for memory efficiency and
type safety.