Function translation

There are two parts to dbplyr SQL translation: translating dplyr verbs, and translating expressions within those verbs. This vignette describes how individual expressions (function calls) are translated; vignette("translation-verb") describes how entire verbs are translated.

library(dbplyr)
library(dplyr)

con <- simulate_dbi()

dbplyr::translate_sql() powers translation of individual function calls, and I’ll use it extensively in this vignette to show what’s happening. You shouldn’t need to use it ordinary code as dbplyr takes care of the translation automatically.

translate_sql((x + y) / 2, con = con)
#> <SQL> (`x` + `y`) / 2.0

translate_sql() takes an optional con parameter. If not supplied, this causes dplyr to generate (approximately) SQL-92 compliant SQL. If supplied, dplyr uses sql_translation() to look up a custom environment which makes it possible for different databases to generate slightly different SQL: see vignette("new-backend") for more details. You can use the various simulate helpers to see the translations used by different backends:

translate_sql(x ^ 2L, con = con)
#> <SQL> POWER(`x`, 2)
translate_sql(x ^ 2L, con = simulate_sqlite())
#> <SQL> POWER(`x`, 2)
translate_sql(x ^ 2L, con = simulate_access())
#> <SQL> `x` ^ 2

Perfect translation is not possible because databases don’t have all the functions that R does. The goal of dplyr is to provide a semantic rather than a literal translation: what you mean, rather than precisely what is done. In fact, even for functions that exist both in databases and R, you shouldn’t expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, mean() loops through the data twice. R’s mean() also provides a trim option for computing trimmed means; this is something that databases do not provide.

If you’re interested in how translate_sql() is implemented, the basic techniques that underlie the implementation of translate_sql() are described in “Advanced R”.

Basic differences

The following examples work through some of the basic differences between R and SQL.

Known functions

Mathematics

Modulo arithmetic

dbplyr translates %% to the SQL equivalents but note that it’s not precisely the same: most databases use truncated division where the modulo operator takes the sign of the dividend, where R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.

df <- tibble(
  x = c(10L, 10L, -10L, -10L), 
  y = c(3L, -3L, 3L, -3L)
)
mf <- tbl_memdb(df)

df %>% mutate(x %% y)
#> # A tibble: 4 × 3
#>       x     y `x%%y`
#>   <int> <int>  <int>
#> 1    10     3      1
#> 2    10    -3     -2
#> 3   -10     3      2
#> 4   -10    -3     -1
mf %>% mutate(x %% y)
#> # Source:   SQL [4 x 3]
#> # Database: sqlite 3.45.0 [:memory:]
#>       x     y `x%%y`
#>   <int> <int>  <int>
#> 1    10     3      1
#> 2    10    -3      1
#> 3   -10     3     -1
#> 4   -10    -3     -1

dbplyr no longer translates %/% because there’s no robust cross-database translation available.

Logical comparisons

Aggregation

All database provide translation for the basic aggregations: mean(), sum(), min(), max(), sd(), var(). Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. The aggregation functions warn you about this important difference:

translate_sql(mean(x), con = con)
#> <SQL> AVG(`x`) OVER ()
translate_sql(mean(x, na.rm = TRUE), con = con)
#> <SQL> AVG(`x`) OVER ()

Note that, by default, translate() assumes that the call is inside a mutate() or filter() and generates a window translation. If you want to see the equivalent summarise()/aggregation translation, use window = FALSE:

translate_sql(mean(x, na.rm = TRUE), window = FALSE, con = con)
#> <SQL> AVG(`x`)

Conditional evaluation

if and switch() are translate to CASE WHEN:

translate_sql(if (x > 5) "big" else "small", con = con)
#> <SQL> CASE WHEN (`x` > 5.0) THEN 'big' WHEN NOT (`x` > 5.0) THEN 'small' END
translate_sql(switch(x, a = 1L, b = 2L, 3L), con = con)
#> <SQL> CASE `x` WHEN ('a') THEN (1) WHEN ('b') THEN (2) ELSE (3) END

String manipulation

Date/time

Unknown functions

Any function that dplyr doesn’t know how to convert is left as is. This means that database functions that are not covered by dplyr can often be used directly via translate_sql().

Prefix functions

Any function that dbplyr doesn’t know about will be left as is:

translate_sql(foofify(x, y), con = con)
#> <SQL> foofify(`x`, `y`)

Because SQL functions are general case insensitive, I recommend using upper case when you’re using SQL functions in R code. That makes it easier to spot that you’re doing something unusual:

translate_sql(FOOFIFY(x, y), con = con)
#> <SQL> FOOFIFY(`x`, `y`)

Infix functions

As well as prefix functions (where the name of the function comes before the arguments), dbplyr also translates infix functions. That allows you to use expressions like LIKE which does a limited form of pattern matching:

translate_sql(x %LIKE% "%foo%", con = con)
#> <SQL> `x` LIKE '%foo%'

Or use || for string concatenation (although most backends will translate paste() and paste0() for you):

translate_sql(x %||% y, con = con)
#> <SQL> `x` || `y`

Special forms

SQL functions tend to have a greater variety of syntax than R. That means there are a number of expressions that can’t be translated directly from R code. To insert these in your own queries, you can use literal SQL inside sql():

translate_sql(sql("x!"), con = con)
#> <SQL> x!
translate_sql(x == sql("ANY VALUES(1, 2, 3)"), con = con)
#> <SQL> `x` = ANY VALUES(1, 2, 3)

This gives you a lot of freedom to generate the SQL you need:

mf <- memdb_frame(x = 1, y = 2)

mf %>% 
  transmute(factorial = sql("x!")) %>% 
  show_query()
#> <SQL>
#> SELECT x! AS `factorial`
#> FROM `dbplyr_zAmugrm8vk`

mf %>% 
  transmute(factorial = sql("CAST(x AS FLOAT)")) %>% 
  show_query()
#> <SQL>
#> SELECT CAST(x AS FLOAT) AS `factorial`
#> FROM `dbplyr_zAmugrm8vk`

Error for unknown translations

If needed, you can also force dbplyr to error if it doesn’t know how to translate a function with the dplyr.strict_sql option:

options(dplyr.strict_sql = TRUE)
translate_sql(glob(x, y), con = con)
#> Error in `glob()`:
#> ! Don't know how to translate `glob()`

Window functions

Things get a little trickier with window functions, because SQL’s window functions are considerably more expressive than the specific variants provided by base R or dplyr. They have the form [expression] OVER ([partition clause] [order clause] [frame_clause]):

To see how individual window functions are translated to SQL, we can again use translate_sql():

translate_sql(mean(G), con = con)
#> <SQL> AVG(`G`) OVER ()
translate_sql(rank(G), con = con)
#> <SQL> CASE
#> WHEN (NOT((`G` IS NULL))) THEN RANK() OVER (PARTITION BY (CASE WHEN ((`G` IS NULL)) THEN 1 ELSE 0 END) ORDER BY `G`)
#> END
translate_sql(ntile(G, 2), con = con)
#> <SQL> NTILE(2) OVER (ORDER BY `G`)
translate_sql(lag(G), con = con)
#> <SQL> LAG(`G`, 1, NULL) OVER ()

If the tbl has been grouped or arranged previously in the pipeline, then dplyr will use that information to set the “partition by” and “order by” clauses. For interactive exploration, you can achieve the same effect by setting the vars_group and vars_order arguments to translate_sql()

translate_sql(cummean(G), vars_order = "year", con = con)
#> <SQL> AVG(`G`) OVER (ORDER BY `year` ROWS UNBOUNDED PRECEDING)
translate_sql(rank(), vars_group = "ID", con = con)
#> <SQL> RANK() OVER (PARTITION BY `ID`)

There are some challenges when translating window functions between R and SQL, because dplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you’re using:

The three options are illustrated in the snippet below:

mutate(players,
  min_rank(yearID),
  order_by(yearID, cumsum(G)),
  lead(G, order_by = yearID)
)

Currently there is no way to order by multiple variables, except by setting the default ordering with arrange(). This will be added in a future release.