vtree is a flexible tool for generating variable trees — diagrams that display information about nested subsets of a data frame. Given simple specifications, vtree produces these diagrams and automatically labels them with counts, percentages, and other summaries. With vtree, you can:
explore a data set interactively, and
produce customized figures for reports and publications.
Subsets play an important role in almost any data analysis. Imagine a data set of countries, with variables named population, continent, and landlocked. We might wish to examine subsets of the data set based on the continent variable. Within each of these subsets, we might wish to examine nested subsets based on the population variable, for example, countries with populations under 30 million and over 30 million. We might continue to a third level of nesting based on the landlocked variable. vtree provides a general solution to the problem of calculating nested subsets and displaying information about them. Nested subsets help us to answer questions like the following: Among African countries with a population over 30 million, what percentage are landlocked?
The variable tree below answers this question:
Even in simple situations like this, it can be a chore to keep track of nested subsets and to calculate percentages. But it’s often even more tedious—and there are two reasons why. First, the presence of missing values makes it harder to determine denominators. Second, as the number of variables increases, the number of nested subsets grows rapidly. In spite of these difficulties, people often calculate nested subsets by hand (along with percentages and other summaries). Not only is this tiresome work, it is extremely error prone.
Nested subsets arise in all kinds of situations. Consider, for example, flow diagrams for clinical studies, such as the following rudimentary CONSORT diagram, which is also a variable tree.
Because manual calculation and transcription are error-prone, mistakes in published flow diagrams are all too common. And although the errors that make it to publication are often small, they can sometimes be disastrous.
Note that at the end of this vignette, there is a collection of examples using R datasets that you can try.
The examples that follow use a data set called FakeData which represents 46 fictitious patients. The variable tree below depicts subsets defined by Sex (M or F) nested within subsets defined by disease Severity (Mild, Moderate, Severe, or NA). Although this example—and many subsequent ones—use just two variables, variable trees are especially useful with three or more variables.
A variable tree consists of nodes connected by arrows. At the top of the diagram above, the root node of the tree contains all 46 patients. The rest of the nodes are arranged in successive levels, where each level corresponds to a variable. Note that this highlights one difference between variable trees and some other kinds of trees: at each level of a variable tree, regardless of the branch, the nodes represent values of the same variable. (Decision trees, in contrast, can have splits on different variables at the same level.)
Continuing with the variable tree above, the nodes immediately below the root represent values of Severity and are referred to as the children of the root node. In this case, Severity was missing (NA) for 6 patients, and there is a node for these patients. Inside each of the nodes, the number of patients is displayed and—except for in the missing value node—the corresponding percentage is also shown. Note that, by default, vtree displays “valid” percentages, i.e. the denominator used to calculate the percentage is the total number of non-missing values, 40.
The final level of the tree corresponds to values of Sex. These nodes represent males and females within subsets defined by each value of Severity. In each of these nodes the percentage is calculated in terms of the number of patients in its parent node.
Like any node, a missing-value node can have children. For example, of the 6 patients for whom Severity is missing, 3 are female and 3 are male. By default, vtree displays the full missing-value structure of the specified variables.
Also by default, vtree automatically assigns a color palette to each variable. Severity has been assigned red hues (lightest for Mild, darkest for Severe), while Sex has been assigned blue hues (light blue for females, dark blue for males). The node representing missing values of Severity is colored white to draw attention to it.
A tree with two variables is similar to a two-way contingency table. In the example above, Sex is shown within levels of Severity. This corresponds to the following contingency table, where the percentages within each column add to 100%. These are called column percentages.
| Mild | Moderate | Severe | NA | |
|---|---|---|---|---|
| F | 11 (58%) | 11 (69%) | 2 (40%) | 3 (50%) | 
| M | 8 (42%) | 5 (31%) | 3 (60%) | 3 (50%) | 
Likewise, a tree with Severity shown within levels of Sex corresponds to a contingency table with row percentages.
The contingency table above is more compact than the corresponding variable tree, but some people may find the variable tree easier to interpret. When three of more variables are of interest, multi-way contingency tables are often used. These are typically displayed using several two-way tables. In this situation, variable trees are generally easier to interpret.
It is also worth noting that contingency tables are not always more compact than variable trees. When most cells of a large contingency table are empty (in which case the table is said to be sparse), the corresponding variable tree may be more compact since empty-nodes are not shown.
vtree is designed to be quick and easy to use, so that it is convenient for data exploration, but also flexible enough that it can be used to prepare publication-ready figures. To generate a basic variable tree, it is only necessary to provide vtree with a data frame and some variable names. However extra features make vtree much more useful. vtree provides:
control over labeling, colors, legends, line wrapping, text formatting and other customization features;
flexible pruning to remove parts of the tree that are of lesser interest, which is particularly useful when a tree gets large;
display of information about other variables in each node, including a variety of summary statistics;
special displays for indicator variables, patterns of values, and missingness;
support for checkbox variables from REDCap databases;
features for dichotomizing variables and checking for outliers; and
automatic generation of PNG image files and embedding in R Markdown documents.
In many cases, you may wish to generate several different variable trees to investigate a collection of variables in a data frame. For example, it is often useful to change the order of variables, prune parts of the tree, etc.
vtree is built on open-source software: in particular Richard Iannone’s DiagrammeR package, which provides an interface to the Graphviz software using the htmlwidgets framework. A formal description of variable trees follows.
The root node of the variable tree represents the entire data frame. The root node has a child for each observed value of the first variable that was specified. Each of these child nodes represents a subset of the data frame with a specific value of the variable, and is labeled with the number of observations in the subset and the corresponding percentage of the number of observations in the entire data frame. The nth level below the root of the variable tree corresponds to the nth variable specified. Apart from the root node, each node in the variable tree represents the subset of its parent defined by a specific observed value of the variable at that level of the tree, and is labeled with the number of observations in that subset and the corresponding percentage of the number of observations in its parent node.
Note that a node always represents at least one observation. And unlike a contingency table, which can have empty cells, a variable tree has no empty nodes.
vtree functionConsider a data frame named df, which includes discrete variables v1 and v2. In this case, a variable tree can be displayed using the following command:
For additional details about how variables can be specified, see the section on specification of variables below. Note that if vtree is called without a list of variables, it uses all of the variables in the data frame in the order in which they appear.
Numerous additional parameters can be supplied. For example, by default vtree produces a horizontal tree (that is, a tree that grows from left to right). To generate a vertical tree, specify horiz=FALSE.
To display a variable tree for a single variable, say Severity, use the following command:
Next, consider a vertical variable tree with two variables, Severity and Sex. A less colorful display with more spacing can be requested by specifying plain=TRUE:
By default, “valid percentages” are shown, i.e. the denominator is the total number of non-missing values. In the case of Severity, there are 6 missing values, so the denominator is 46 - 6 , or 40. There are 19 Mild cases, and 19/40 = 0.475 so the percentage shown is 48%. No percentage is shown in the NA node since missing values are not included in the denominator.
If you prefer that the denominator represent the complete set of observations (including any missing values), specify vp=FALSE. With this setting, a percentage will be shown in each of the nodes, including any NA nodes.
If you don’t wish to see percentages, specify showpct=FALSE, and if you don’t wish to see counts, specify showcount=FALSE.
To display a legend, specify showlegend=TRUE. Next to each level of the tree, the variable name is displayed together with color discs and the values they correspond to. For each of the values, overall (marginal) counts are shown, together with percentages.
When the legend is shown, the node labels become redundant, since the colors identify the values of the variables (although the labels may aid readability). If you prefer, you can hide the node labels, by specifying shownodelabels=FALSE:
The legend shows how colors are assigned to the different values of each variable, and additionally provides marginal (that is, overall) counts and percentages for each variable. Since Severity is the first variable in the tree—i.e., it is not nested within another variable— the marginal counts and percentages for Severity are identical to those displayed in the nodes. In contrast, for Sex, the marginal counts and percentages are different from what is shown in the nodes because the nodes for Sex are nested with levels of Severity.
(Unfortunately the NA circle in the legend is oddly sized and positioned due to an issue with the corresponding unicode symbols.)
When a variable tree is large, it can be difficult to display it in a readable way. One approach that helps is to display the tree horizontally and also to put the node labels on the same line as the counts and percentage by specifying sameline=TRUE. For example, the following results in nodes with single-lines labels such as Moderate, 16 (40%), etc.:
By default, next to each level of the tree, vtree shows the variable name. These can be removed by specifying showvarnames=FALSE.
By default, vtree wraps text onto the next line whenever a space occurs after at least 20 characters. This can be adjusted, for example, to 15 characters, by specifying splitwidth=15. To disable line splitting, specify splitwidth=Inf. Text wrapping in the legend is controlled independently. To set the splitting in the legend to 8 characters, specify lsplitwidth=8. Also note that in the legend, text wrapping can take place not only at spaces, but also at any of the following characters: . - + _ = /
This concludes the mini-tutorial. vtree has many more features, described in the following sections.
When a variable tree gets too big, or you are only interested in certain parts of the tree, it may be useful to remove some nodes along with their descendants. This is known as pruning. For convenience, vtree provides several different ways to prune a tree, described below.
prune parameterSuppose you don’t want the tree to include individuals whose disease is Mild or Moderate. Specifying prune=list(Severity=c("Mild","Moderate")) removes those nodes, and all of their descendants:
In general, the argument of the prune parameter is a list with an element named for each variable you wish to prune. In the example above, the list has a single element, named Severity. In turn, that element is a vector c("Mild","Moderate") indicating the values of Severity to prune.
Caution: Once a variable tree has been pruned, it is no longer complete. This can sometimes be confusing since not all observations are represented at certain levels of the tree. It is particularly important to avoid pruning missing value nodes, since this makes it hard to interpret “valid” percentages (i.e. percentages calculated using the number of non-missing observations as denominator).
keep parameterSometimes it is more convenient to specify which nodes should be retained rather than which ones should be discarded. The keep parameter is used for this purpose, and can thus be considered the complement of the prune parameter.
For example, to retain only the Moderate Severity node:
keep parameterIt is important to note how the keep parameter functions when missing values are present. Consider a variable tree for the Severity variable, shown on the left with so-called “valid” percentages. These are percentages calculated using the number of non-missing observations as denominator (which is the default, specified by vp=TRUE). On the right is the same tree, but with percentages calculated using the total number of observations as the denominator.
 
Suppose we use keep to retain only the Moderate node.
vtree(FakeData,"Severity",keep=list(Severity="Moderate"))
vtree(FakeData,"Severity",vp=FALSE,keep=list(Severity="Moderate")) 
Note that in the tree on the left (which uses valid percentages), the NA node is retained. This is done so that the percentage of Moderate cases can be interpreted.. (There are 16 Moderate cases, and a total of 40 non-missing cases, which is 40%.)
On the right, the NA node has been removed, because the denominator (46) doesn’t depend on the number of missing values.
prunebelow parameterA disadvantage of the prune parameter is that in the resulting tree, the counts shown in child nodes may not add up to the counts shown in the parent node. For example in the variable tree above, of a total of 46 patients, 5 have Severe disease and Severity is unknown for 6. One might wonder what happened to the other 35 patients.
A solution to this problem is to retain the specified nodes, but to prune below them (i.e. to prune their descendants). In the present example, this means that the Mild and Moderate nodes will be shown, but not their descendants. The prunebelow parameter is used to do this, and its argument has the same form as for the prune parameter.
follow parameterThe complement of the prunebelow function is the follow function. Instead of specifying which nodes should be pruned below, this allows you to specify which nodes should be “followed” (that is, not pruned below).
prunesmaller parameterAs a variable tree grows it can become difficult to see the forest for the tree. For example, consider the following variable tree:
One solution is to prune nodes that contain small numbers of observations. For example if you want to only see nodes with at least 3 observations, you can specify prunesmaller=3, as in this example:
As with the keep parameter, when vp=TRUE (which is the default, and means that valid percentages are shown), nodes represent missing values will not be pruned. (As noted in the section on the keep parameter, this is because percentages are confusing when missing values are not shown.) When vp=FALSE, missing nodes will be pruned (if they are small enough).
By default, vtree labels variables and nodes exactly as they are in the data frame. But it is often useful to change these labels.
labelvar parameterSuppose Severity in fact represents initial severity. To label it that way in the variable tree, specify labelvar=c(Severity="Initial severity").
labelnode parameterBy default, vtree labels nodes (except for the root node) using the values of the variable in question. (If the variable is a factor, the levels of the factor are used). Sometimes it is convenient to instead specify custom labels for nodes. You can use the labelnode argument to relabel the values. For example, you might want to use “Male” and “Female” instead of “M” and “F”.
The labelnode argument argument is specified as a list whose element names are variable names. To substitute New label for Old label, the syntax is: "New label"="Old label". Thus the full specification is: labelnode=list(Sex=c(Male="M",Female="F")).
tlabelnode parameterSuppose in the example above that Group A represents children and Group B represents adults. In Group A, we would like to use the labels “girl” and “boy”, while in Group B we would like to use “woman” and “man”. The labelnode parameter cannot handle this situation because the values of Sex need to labeled differently in different branches of the tree. The tlabelnode parameter allows “targeted” node labels.
vtree(FakeData,"Group Sex",horiz=FALSE,
  labelnode=list(Group=c(Child="A",Adult="B")),
  tlabelnode=list(
    c(Group="A",Sex="F",label="girl"),
    c(Group="A",Sex="M",label="boy"),
    c(Group="B",Sex="F",label="woman"),
    c(Group="B",Sex="M",label="man")))Graphviz, the open source graph visualization software that vtree is built on, supports a variety of text formatting (including boldface, colors, etc.). This is used in vtree to control formatting of text such as node labels.
By default, the vtree package uses markdown-style codes for text formatting.
\n means insert a line break\n*l means make the preceding line left-justified and insert a line break*...* means display text in italics**...** means display text in bold^...^ means display text in superscript (using 10 point font)~...~ means display text in subscript (using 10 point font)%%red ...%% means display text in red (or whichever color is specified)As an alternative, if you specify HTMLtext=TRUE you can use “HTML-like labels” (implemented in Graphviz), including:
<BR/> means insert a line break<BR ALIGN='LEFT'/> means make the preceding line left-justified and insert a line break<I> ... </I> means display text in italics<B> ... </B> means display text in bold<SUP> ... </SUP> means display text in superscript, but note that the font size does not change<SUB> ... </SUB> means display text in subscript but again note that the font size does not change<FONT POINT-SIZE='10'> ... </FONT> means set font to 10 point<FONT FACE='Times-Roman'> ... </FONT> means set font to Times-Roman<FONT COLOR='red'> ... </FONT> means set font to redSee https://www.graphviz.org/doc/info/shapes.html#html for more details.
text parameterSuppose you wish to add the italicized text “Excluding new diagnoses” to any Mild nodes in the tree. The parameter text lets you add text to nodes. It is specified as a list with an element named for each variable. In the example below the list has one element, named Severity. That element in turn is a vector c(Mild="\n*Excluding\nnew diagnoses*") indicating that the Mild node should include additional text using Markdown-style formatting (i.e. there is a linebreak and the asterisks around the text indicate that it should be displayed in italics):
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
  text=list(Severity=c(Mild="\n*Excluding\nnew diagnoses*")))ttext parameterIn the example above, suppose that new diagnoses are only excluded from Mild cases in Group B. But the text parameter is used to add text to all Mild nodes. Thus, in situations like this, the text parameter is not sufficient. Instead, you can use the ttext parameter to target exactly which nodes should have the specified text.
The ttext parameter requires that you specify the full path from the root of the tree to the node in question, along with the text in question. The ttext parameter is specified as a list so that multiple targeted text strings can be specified at once. For example:
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
  ttext=list(
    c(Group="B",Severity="Mild",text="\n*Excluding\nnew diagnoses*"),
    c(Group="A",text="\nSweden"),
    c(Group="B",text="\nNorway")))For convenience, vtree allows you to specify variable names (separated by whitespace) in a single character string. If, however, any of the variable names have internal spaces, the variable names must be specified as a vector of character strings.
Additionally, there are several modifiers that can be used to change the way variables are represented in a tree.
is.na:If an individual variable name is preceded by is.na:, that variable will be replaced by a missing value indicator in the variable tree. (This differs from the check.is.na parameter, described later, which is used to replace all of the specified variables with missing value indicators.)
stem: and rc:In datasets exported from REDCap, checkboxes are represented using multiple variables. The stem: prefix makes it easier to work with them. This is described in the section on REDCap checkboxes later in this vignette.
tri:The tri: prefix is useful for identifying values of a numeric variable that are extreme compared to the other values in a node. Note: Unlike other variable specifications, which take effect at the level of the entire data frame, the tri: prefix takes effect within each node.
The effect of this variable specification is to trichotomize the values of a numeric variable, i.e. to divide them into three groups:
“mid”: values within plus or minus 1.5×IQR of the median,
“high”: values more than 1.5×IQR above the median,
“low”: values more than 1.5×IQR below the median.
*Specifying Ind* matches all variable names that start with Ind. In FakeData these are Ind1, Ind2, and Ind3.
The * suffix matches all variable names
#Specifying Ind# matches all variable names that start with Ind and end with a numeric digit, namely Ind1, Ind2, and Ind3. (In this particular case, this is the same result as using Ind*).
variable=valueWhen a variable takes on a large number of different values, the resulting variable tree will very large. One solution is to prune the tree, for example by keeping just the node corresponding to one value of a particular variable. An alternative is to specify the value of the variable that is of primary interest and vtree will dichotomize the variable at that value. For example if Severity=Mild is specified, the Severity variable will be dichotomized between Mild and Not Mild.
variable<value, variable>valueThese two specifications are used to dichotomize a numeric variable, splitting above and below a specified value. This can be useful for identifying subsets with extreme values.
It is often useful to display information about other variables (apart from those that define the tree) in the nodes of a variable tree. This is particularly useful for numeric variables, which generally cannot be used to build the tree since they have too many distinct values. For example, we might wish to display the mean age for individuals in each node. Or we might wish to list the ID numbers of the individuals in each node. The summary argument can be used to flexibly specify additional information to display.
The argument of the summary parameter is a character string with the following structure:
It starts with the name of the variable for which a summary is desired. (We’ll see later that variable specifications and expressions can also be used, as long as they do not contain any spaces.)
Next there is a space.
The remainder of the string specifies what to display, with text as well as special codes to indicate the type of summary desired and to control which nodes display the summary, etc.
For example %mean% indicates that the mean of the specified variable should be shown. Thus to display the mean of the numeric variable Score, you could specify summary="Score \nmean score: %mean%". Note that the part of the string following the first space is "\nmean score: %mean%". This specifies that in each node, after the usual frequency and percentage, the summary should start on a new line with the words “mean score:” followed by the mean.
To see the means without any decimals, the cdigits parameter can be used; for example:
vtree(FakeData,"Severity",summary="Score \nmean score: %mean%",cdigits=0,
  sameline=TRUE,horiz=FALSE)The following codes can be used to show summary information:
| code | result | 
|---|---|
| %mean% | mean | 
| %SD% | standard deviation | 
| %sum% | sum | 
| %min% | minimum | 
| %max% | maximum | 
| %pX% | Xth percentile (e.g. p50means the 50th percentile) | 
| %median% | median, i.e. p50 | 
| %IQR% | IQR, i.e. p25, p75 | 
| %npct% | frequency and percentage of a logical variable. By default “valid percentages” are used. Any missing values are also reported. | 
| %pct% | same as %npct%but percentage only (with no parentheses). | 
| %list% | list of individual values, separated by commas | 
| %listlines% | list of individual values, each on a separate line | 
| %mv% | the number of missing values | 
| %nonmv% | the number of non-missing values | 
| %v% | the name of the variable | 
The summary argument can use any number of these codes, mixed with text and formatting codes.
Sometimes it is useful to display summary information for more than one variable. To do this, specify summary as a vector of character strings:
vtree(FakeData,"Severity",horiz=FALSE,showvarnames=FALSE,splitwidth=Inf,sameline=TRUE,
  summary=c("Score \nScore: mean (SD) %mean% (%SD%)","Pre \nPre: range %min%, %max%"))%list% codeIt is sometimes convenient to see individual values of a variable in each node. For example you might want to see ID numbers. To do this, use the %list% code. By default this information will be displayed in each node. When a value occurs more than once in the subset, it will be followed by a count of the number of repetitions in parentheses. The %list% code separates values by commas. Alternatively, the %listlines% code can be used to put each value on a new line.
When there are many IDs, it is often convenient to truncate the output. If you specify %trunc=N%, summary information will be truncated after N characters with “…”.
%noroot%, %leafonly%, %var=v%, and %node=n% codesBy default, summary information is shown in all nodes. However, it may also be convenient to only show it in specific nodes. The following codes are available:
| code | summary information restricted to: | 
|---|---|
| %noroot% | all nodes except the root | 
| %leafonly% | leaf nodes | 
| %var=v% | nodes of variable v | 
| %node=n% | nodes named n | 
Variables in the summary parameter can also be specified in a way similar to the specification of variables for structuring a variable tree. For example, if we wish to know the proportion of patients in each node whose Category is single, we specify Category=single in the summary argument.
Continuous variables such as Score can be dichotomized using notation such as Score>10 or Score<20.
Rather than starting the summary argument with a variable name, an R expression involving variables in the data frame can be given, as long as it does not contain any spaces.
vtree(FakeData,"Severity Category",
  summary="(Post-Pre)/Pre \nmean = %mean%",sameline=TRUE,horiz=FALSE,cdigits=1)Expressions involving functions can also be used; for example sqrt(abs(Post/Pre)).
Each node in a variable tree provides the frequency of a particular combination of values of the variables. The leaf nodes represent all the observed combinations of values of all of the variables. For example, in a variable tree for Severity and Sex, the leaf nodes correspond to Mild F, Mild M, Moderate F, Moderate M, etc. These combinations, or “patterns”, can be treated as an additional variable. And if this new pattern variable is used as the first variable in a tree, then the branches of the tree will be simplified: each branch will represent a unique pattern, with no sub-branches. A “pattern tree” can be easily produced by specifying pattern=TRUE:
       
Pattern trees are easier to read than ordinary variable trees, but they involve a considerable loss of information, since they only represent the nth-level subsets (where n is the number of variables).
Note that by default, when pattern=TRUE is specified, the root node is not shown (in order to simplify the display). A disadvantage of this is that the total sample size is not shown. You can override this behavior by specifying showroot=TRUE.
A pattern tree has two other special characteristics. First, note that after the first level (representing pattern), counts and percentages are not shown, since they are not informative: by definition, all nodes within a branch have the same count. Second, note that in place of arrows, undirected line segments are shown. This is because, unlike in a regular variable tree, the order of variables is irrelevant in a pattern tree. Sometimes, however, the variables do have a natural ordering, as in the case of longitudinal variables. To show arrows, specify seq=TRUE instead of pattern=TRUE, and a “sequence” (i.e. an ordered pattern) will be shown.
Summaries can be shown in pattern trees (using the summary parameter), but they only appear in the pattern node (or the sequence node if seq=TRUE).
A pattern tree has the same structure as a table. Indeed, it may be more convenient to produce a table rather than a tree. A data frame containing the information from the pattern tree can be exported by specifying ptable=TRUE:
##    n pct Severity Sex
## 1  2   4   Severe   F
## 2  3   7     <NA>   F
## 3  3   7     <NA>   M
## 4  3   7   Severe   M
## 5  5  11 Moderate   M
## 6  8  17     Mild   M
## 7 11  24     Mild   F
## 8 11  24 Moderate   FThe pattern table includes a column for the counts from the pattern nodes, and a column for percentages. Compared to a variable tree, this table is much more compact, and may be more suitable for use in a manuscript.
Pattern trees are useful for indicator variables, i.e. variables that take values like 0/1, no/yes, FALSE/TRUE, etc. For convenience in this section, we’ll refer to 0 (or no, FALSE, etc.) as a negative and 1 (or yes, TRUE, etc.) as an affirmative.
The variables Ind1 through Ind4 in FakeData are 0/1 indicator variables. If these variables are interpreted as representing set membership (0 = non-member, 1 = member), then a pattern tree is an alternative representation of a Venn diagram. If you specify Venn=TRUE, the nodes (except for the pattern nodes) will be blank, with only their shade indicating their value (dark = 1, light = 0, white = missing).
A pattern tree for indicator variables provides all the information that a Venn diagram represents, but unlike a Venn diagram, missing values are also represented. This can also be shown as a pattern table. For example:
##    n pct Ind1 Ind2
## 1  1   2 <NA>    0
## 2 10  22    1    0
## 3 11  24    0    1
## 4 12  26    0    0
## 5 12  26    1    1VennTable functionFor indicator variables, there is an extra function, VennTable, which converts the pattern table to a matrix of character strings and adds some additional totals.
##       n    pct   Ind1 Ind2
##       " 1" " 2"  NA   "0" 
##       "10" "22"  "1"  "0" 
##       "11" "24"  "0"  "1" 
##       "12" "26"  "0"  "0" 
##       "12" "26"  "1"  "1" 
## Total "46" "100" ""   ""  
## N     ""   ""    "22" "23"
## pct   ""   ""    "48" "50"By default in R, when a matrix of character strings is printed, quotation marks are displayed around each element. Unfortunately the result is unattractive. Instead it’s helpful to call the print function and specify quote=FALSE:
##       n  pct Ind1 Ind2
##        1  2  <NA> 0   
##       10 22  1    0   
##       11 24  0    1   
##       12 26  0    0   
##       12 26  1    1   
## Total 46 100          
## N            22   23  
## pct          48   50Without all those quotation marks, it’s easier to see what VennTable adds:
the total sample size (46) and percentage (100), and
the total number (N) of affirmatives for each variable, together with a percentage.
The VennTable function can also be used in an R Markdown document. Specifying markdown=TRUE generates a pandoc markdown pipetable, with several formatting tweaks:
the rows and columns of the table are transposed
affirmatives are represented by checkmarks
negatives are represented by spaces
missing values are represented by dashes (which can be changed with the NAcode parameter).
To display the table in R Markdown, use this inline call:
| Total | N | % | ||||||
|---|---|---|---|---|---|---|---|---|
| n | 1 | 10 | 11 | 12 | 12 | 46 | ||
| % | 2 | 22 | 24 | 26 | 26 | 100 | ||
| Ind1 | - | ✔ | ✔ | 22 | 48 | |||
| Ind2 | ✔ | ✔ | 23 | 50 | 
VennTable has some additional parameters. The checked parameter is used to specify values that should be interpreted as affirmative. By default, it is set to c("1","TRUE","Yes","yes","N/A"). Similarly, the unchecked parameter is used to specify values that should be interpreted as negative, with default c("0","FALSE","No","no","not N/A").
summary parameter in pattern tablesThe summary parameter can also be used in pattern tables. If a single summary is requested, it appears in the summary_1 variable in the data frame. Additional summaries appear as summary_2, summary_3, etc.
##    n pct Severity Sex summary_1 summary_2
## 1  2   4   Severe   F      28.0      -0.4
## 2  3   7     <NA>   F       6.3      -0.1
## 3  3   7     <NA>   M      23.7      -0.9
## 4  3   7   Severe   M      44.0      -0.3
## 5  5  11 Moderate   M       8.2      -0.7
## 6  8  17     Mild   M       6.3       0.2
## 7 11  24     Mild   F      15.7      -0.4
## 8 11  24 Moderate   F      21.5       0.0check.is.na parameterIf check.is.na=TRUE is specified, each variable is replaced by an indicator of whether or not it is missing, and pattern=TRUE is automatically set. As when Venn=TRUE is specified, all nodes except for the pattern node are blank, and only their shade indicates missing (dark) or not (light). Whereas the variables used to build a variable tree are normally categorical, in this situation non-categorical variables can be used, because their missingness is represented instead of their actual values.
Specifying ptable=TRUE produces this information in a data frame, and calling VennTable shows additional information. To display the table in R Markdown, use this inline call:
| Total | N | % | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| n | 1 | 1 | 1 | 1 | 2 | 4 | 4 | 32 | 46 | ||
| % | 2 | 2 | 2 | 2 | 4 | 9 | 9 | 70 | 100 | ||
| MISSING_Severity | ✔ | ✔ | 6 | 13 | |||||||
| MISSING_Age | ✔ | ✔ | ✔ | 7 | 15 | ||||||
| MISSING_Pre | ✔ | ✔ | ✔ | 3 | 7 | ||||||
| MISSING_Post | ✔ | ✔ | 2 | 4 | 
The rows n and pct represent the frequency and percentage of the total number of cases for each pattern of missingness, and the columns N and pct on the right-hand side represent the frequency and percentage of missingness for each variable.
It may be useful to identify the ID numbers for these patterns. Here the results are truncated to 15 characters:
##    n pct MISSING_Severity MISSING_Age MISSING_Pre MISSING_Post          summary_1
## 1  1   2          not N/A         N/A         N/A      not N/A                124
## 2  1   2          not N/A     not N/A         N/A          N/A                118
## 3  1   2          not N/A     not N/A         N/A      not N/A                108
## 4  1   2          not N/A     not N/A     not N/A          N/A                104
## 5  2   4              N/A         N/A     not N/A      not N/A           112, 135
## 6  4   9              N/A     not N/A     not N/A      not N/A 103, 116, 126, ...
## 7  4   9          not N/A         N/A     not N/A      not N/A 105, 119, 128, ...
## 8 32  70          not N/A     not N/A     not N/A      not N/A 101, 102, 106, ...In datasets exported from REDCap, checkboxes (i.e. the boxes where you select all that apply) are represented in a special way. For each item in a checklist, a separate variable is created. Suppose survey respondents were asked to select which flavors of ice cream (Chocolate, Vanilla, Strawberry) they like. Within REDCap, the variable name for this list of checkboxes is IceCream, but when the dataset is exported, individual variables IceCream___1 (representing Chocolate), IceCream___2 (Vanilla), and IceCream___3 (Strawberry) are created. When the dataset is read into R, the names of the flavors are embedded in the attributes of these variables.
stem:vtree includes a feature designed to make REDCap checkbox variables easier to use. Instead of typing:
you can use a special syntax where stem: precedes the REDCap variable name:
By default, vtree will also extract the names of the choices and create variables with those names. (This can be disabled by specifying choicechecklist=FALSE.)
An especially convenient way to display checkbox variables with vtree is:
rc:Alternatively, if you wish to only examine specific REDCap checkbox items, the rc: prefix can be used. For example to examine results for just Chocolate and Strawberry:
vtreeSpecifying getscript=TRUE lets you capture the DOT script representing a variable tree. (DOT is a graph description language used by Graphviz, which is used by DiagrammeR, which is used by vtree!). Here is an example:
digraph vtree {
graph [layout = dot, compound=true, nodesep=0.1, ranksep=0.5, fontsize=12]
node [fontname = Helvetica, fontcolor = black,shape = rectangle, color = black,margin=0.1]
rankdir=LR;
Node_L0[style=invisible]
Node_L1[label=<<FONT POINT-SIZE="18"><FONT COLOR="#DE2D26"><B>Severity  </B></FONT></FONT><BR/>> shape=none margin=0]
edge[style=invis];
Node_L0->Node_L1
edge[style=solid]
Node_1->Node_2 Node_1->Node_3 Node_1->Node_4 Node_1->Node_5
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_2[label=<Mild<BR/>19 (48%)> color=black style="rounded,filled" fillcolor=<#FEE0D2>  ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_3[label=<Moderate<BR/>16 (40%)> color=black style="rounded,filled" fillcolor=<#FC9272>  ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_4[label=<Severe<BR/>5 (12%)> color=black style="rounded,filled" fillcolor=<#DE2D26>  ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_5[label=<NA<BR/>6> color=black style="rounded,filled" fillcolor=<white>  ]
}If you wish to directly edit this code, it can can be pasted into an online Graphviz editor, for example:
vtree behaves differently depending on the context in which it is called.
If vtree is called interactively in RStudio, it displays the variable tree in the Viewer window.
If vtree is called interactively from the RGui console (i.e. from R outside of RStudio), it displays the variable tree in a browser window.
vtree has a number of special features when called from R Markdown By default, when vtree is called from R Markdown, it generates a PNG image file.
Here’s how it does that. vtree uses the DiagrammeR package, which automatically generates an htmlwidget for display in HTML, using the htmlwidgets framework. Then vtree converts the htmlwidget into a PNG.
PNG files are useful because they allow you to display variable trees in Microsoft Word documents, and also because HTML files that use htmlwidgets can get large, and if they contain several widgets they can be slow to load.
If vtree is called while an R Markdown file is being knitted, it generates a PNG file and automatically embeds it into the knitted document. The resolution of the PNG file in pixels is determined by parameters pxwidth and pxheight. If neither is specified, pxwidth is automatically set to 2000, which provides good resolution for a printed page. The height of the image in the R Markdown output document can be specified using the imageheight parameter, for example imageheight="4in" for a 4-inch image. There is also an imagewidth parameter. If neither is specified, imageheight is automatically set to 3 inches.
The PNG file is stored in the folder specified by the folder parameter, or if not specified, a temporary folder will be used. Successive PNG files are named vtree1.png, vtree2.png, and so forth and are stored in the folder. During knitting, vtree uses the options function in base R to store a variable called vtcount to count the PNG files, and a variable called vtfolder to identify the fodler where they will be stored.
To call vtree in R Markdown, you can use inline code:
Or you can use a code chunk:
```{r, results="asis"}
cat(vtree(FakeData,"Sex Severity"))
```One advantage of code chunks is that they can also be run interactively (for example within RStudio, by clicking on the green arrow at the top right of a code chunk).
When knitting to an HTML document, an htmlwidget can be used rather than embedding a PNG file. (In fact PNG files are generated from htmlwidgets via SVG format.) To use an htmlwidget instead of a PNG file, you can use inline code:
Or a code chunk:
```{r}
vtree(FakeData,"Severity Sex",pngknit=FALSE)
```Note the differences between this and the code chunk (in the previous section) to produce a PNG file: (1) you specify pngknit=FALSE, (2) you don’t specify the chunk option results="asis", and (3) you don’t put cat around the call to vtree.
vtree is designed to generate a variable tree based on a data frame. However, sometimes no data frame is available, but the sizes of subsets are known.
The build.data.frame function allows you to build a data frame by specifying the size of subsets. Here’s an example involving pets:
build.data.frame(
  c("pet","breed","size"),
  list("dog","golden retriever","large",5),
  list("cat","tabby","small",2))##   pet            breed  size
## 1 dog golden retriever large
## 2 dog golden retriever large
## 3 dog golden retriever large
## 4 dog golden retriever large
## 5 dog golden retriever large
## 6 cat            tabby small
## 7 cat            tabby smallIn this case there are five large golden retrievers and 2 small tabby cats. Although a data frame like this could easily be created without using build.data.frame, it’s a different situation when the counts are large. For example:
vtree(build.data.frame(
  c("pet","breed","size"),
  list("dog","golden retriever","large",5),
  list("cat","tabby","small",2),
  list("dog","Dalmation","various",101),
  list("cat","Abyssinian","small",5),
  list("cat","Abyssinian","large",22),
  list("cat","tabby","large",86)))Consider the following fictitious data about a randomized controlled trial (RCT):
##      id   eligible     randomized group        followup analyzed
## 1   001   Eligible     Randomized     B     Followed up Analyzed
## 2   002   Eligible Not randomized  <NA>            <NA>     <NA>
## 3   003   Eligible     Randomized     A Not followed up     <NA>
## 4   004   Eligible     Randomized     B     Followed up Analyzed
## 5   005   Eligible     Randomized     A     Followed up Analyzed
## 6   006 Ineligible           <NA>  <NA>            <NA>     <NA>
## 7   007   Eligible     Randomized     A     Followed up Analyzed
## 8   008 Ineligible           <NA>  <NA>            <NA>     <NA>
## 9   009   Eligible     Randomized     A     Followed up Analyzed
## 10 0010 Ineligible           <NA>  <NA>            <NA>     <NA>
## 11 0011   Eligible     Randomized     B     Followed up Analyzed
## 12 0012 Ineligible           <NA>  <NA>            <NA>     <NA>The CONSORT diagram (http://www.consort-statement.org/) shows the flow of patients through the study, starting with those who meet eligibility criteria, then those who are randomized, etc. It is easy to produce a rudimentary version of a CONSORT diagram in vtree. The key step is to prune branches for those who are not eligible, not randomized, etc. This can be done using the keep parameter:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
  keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
  horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")Note that this does not include all of the additional information for a full CONSORT diagram (exclusion reasons and counts, as well as numbers of patients who received their allocated interventions, who discontinued intervention, and who were excluded from analysis). It does, however, provide the main flow information.
Additional information can be obtained by viewing the nodes for patients in the pruned branches (but not their descendants). The follow parameter makes that easy:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
  follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
  horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")Finally, it may be useful to see the ID numbers in each node. This can be done using the summary parameter with the %list% code. Since IDs are less useful in the root note, the %noroot% code is also specified here:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
  follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
  horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
  summary="id \nid: %list% %noroot%")The datasets package is loaded in R by default. In the following section, vtree is applied to several of these data sets for illustrative purposes. Note that the variable trees generated by the commands below are not shown. The reader can try these commands to see what the variable trees look like, and experiment with many other possibilities.
The esoph data set (data from a case-control study of esophageal cancer in Ille-et-Vilaine, France), has 88 different combinations of age group, alcohol consumption, and tobacco consumption. Let’s examine the total number of cases and the total number of controls among patients aged 75 and older compared to the rest of the patients:
The HairEyeColor data set is an array representing a contingency table (also called a crosstab or crosstabulation). Before vtree can be applied to this data set, it is necessary to convert the table of crosstabulated frequencies to a data frame of cases. For convenience, the vtree package includes a helper function to do this, called crosstabToCases. It is adapted from a function listed on the Cookbook for R website
There are a lot of combinations but let’s say we are especially interested in green eyes (as compared to non-green eyes). We can use the variable specification Eye=Green to do this:
The Titanic dataset is a 4-dimensional array of counts. First, let’s convert it to a dataframe of individuals:
We’ll specify sameline=TRUE so that the variable tree is a bit more compact:
The mtcars data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
The rownames of the data set contain the names of the cars. Let’s move that information into a column. To do that, we’ll make a slightly altered version of the data frame which we’ll call mt:
Now let’s look at the mean and standard deviation of horsepower (HP) by number of carburetors, nested within number of gears, and in turn nested within number of cylinders:
The above shows the mean and SD of horsepower by (1) number of cylinders; (2) number of gears (within number of cylinders); and (3) number of carburetors (within number of gears nested within number of cylinders). That’s a lot of information. Suppose instead that we are only interested in number 3 above, i.e. all combinations of number of cylinders, number of gears, and number of carburetors.
In that case, we can specify ptable=TRUE, To make the table a little easier to read, set the number of digits for the mean and SD to be zero, and relabel the variables.
vtree(mt,"cyl gear carb",summary="hp mean (SD) HP %mean% (%SD%)",
  cdigits=0,labelvar=c(cyl="# cylinders",gear="# gears",carb="# carburetors"),
  ptable=TRUE)We might also like to list the names of cars by number of carburetors nested within number of gears:
The UCBAdmissions data is consists of aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. According to the data set Details, “This data set is frequently used for illustrating Simpson’s paradox, see Bickel et al. (1975). At issue is whether the data show evidence of sex bias in admission practices. There were 2691 male applicants, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom 557 (30.4%) were admitted.” Furthermore, “the apparent association between admission and sex stems from differences in the tendency of males and females to apply to the individual departments (females used to apply more to departments with higher rejection rates).”
First, we’ll convert the crosstab data to a data frame of cases, ucb:
Next, let’s look at admission rates by Gender, nested within department:
The ChickWeight data set is from an experiment on the effect of diet on early growth of chicks. Let’s look at the mean weight of chicks at birth (0 days of age) and 4 days of age, nested within type of diet. A simple variable tree can be produced like this:
To make the display a little easier to read, relabel the nodes and the Time variable:
The InsectSprays data set contains counts of insects in agricultural experimental units treated with different insecticides. Let’s look at those counts by insecticide.
The ToothGrowth data set contains the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Let’s examine the percentage with length > 20 by dose nested within delivery method:
To make the display a little easier to read, relabel the nodes and the Time variable: