Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other fields. Trees are especially useful when we are facing hierarchical data. For example, trees are used:
Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no high-level hierarchical data structure that could be used as conveniently and generically as, say, data.frame.
As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames (or - perish the thought! - in Excel sheets). But hierarchies don’t marry with tables and various workarounds are usually required.
This package offers an alternative. The tree package allows you to create hierarchies with the Node object. Node provides basic traversal, search, and sort operations. You can decorate Nodes with attributes and methods, extending the package to your needs.
The package also provides convenient methods for neatly printing trees, and converting trees to data.frames for integration with other packages.
The example in this vignette revolves around decision trees.
Let’s start by creating a tree of Nodes. In our example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 0) is the company. On level 1, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:
library(data.tree)
acme <- Node$new("Acme Inc.")
accounting <- acme$AddChild("Accounting")
software <- accounting$AddChild("New Software")
standards <- accounting$AddChild("New Accounting Standards")
research <- acme$AddChild("Research")
newProductLine <- research$AddChild("New Product Line")
newLabs <- research$AddChild("New Labs")
it <- acme$AddChild("IT")
outsource <- it$AddChild("Outsource")
agile <- it$AddChild("Go agile")
goToR <- it$AddChild("Switch to R")
print(acme)
## levelName
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software
## 4 ¦ °--New Accounting Standards
## 5 ¦--Research
## 6 ¦ ¦--New Product Line
## 7 ¦ °--New Labs
## 8 °--IT
## 9 ¦--Outsource
## 10 ¦--Go agile
## 11 °--Switch to R
Note that Node is an R6 reference class. Essentially, this has two implications:
Node in OO styleNode that modify it, without having to re-assign to a new variable. This is different from the value semantics, which is much more widely used in R.For example, we can check if a Node is the root:
acme$isRoot
## [1] TRUE
Now, let’s associate some costs with the projects. We do this by setting custom attributes on the leaf Nodes:
software$cost <- 1000000
standards$cost <- 500000
newProductLine$cost <- 2000000
newLabs$cost <- 750000
outsource$cost <- 400000
agile$cost <- 250000
goToR$cost <- 50000
Also, we set the probabilities that the projects will be executed in the next year:
software$p <- 0.5
standards$p <- 0.75
newProductLine$p <- 0.25
newLabs$p <- 0.9
outsource$p <- 0.2
agile$p <- 0.05
goToR$p <- 1
data.frameWe can now convert the tree into a data.frame. Note that we always call such methods on the root Node:
acmedf <- as.data.frame(acme)
The same can be achieved by using the OO-style method Node$ToDataFrame:
acme$ToDataFrame()
Adding the project cost to our data.frame is easy to do with the Get method. We’ll explain the Get method in more detail below.
acmedf$level <- acme$Get("level")
acmedf$cost <- acme$Get("cost")
We could have achieved the same result in one go, using the OO-style ToDataFrame method:
acme$ToDataFrame("level", "cost")
## levelName level cost
## 1 Acme Inc. 0 NA
## 2 ¦--Accounting 1 NA
## 3 ¦ ¦--New Software 2 1000000
## 4 ¦ °--New Accounting Standards 2 500000
## 5 ¦--Research 1 NA
## 6 ¦ ¦--New Product Line 2 2000000
## 7 ¦ °--New Labs 2 750000
## 8 °--IT 1 NA
## 9 ¦--Outsource 2 400000
## 10 ¦--Go agile 2 250000
## 11 °--Switch to R 2 50000
Internally, the same is called when printing a tree:
print(acme, "level", "cost")
Get when converting to data.frame and for printingAbove, we saw how we can add the name of an attribute to the ellipsis argument of the as.data.frame. We can also add the results of the Get method directly to the as.data.frame . This allows, for example, formatting the column in a specific way. Details of the Get method are explained in the next section.
acme$ToDataFrame("level",
probability = acme$Get("p", format = FormatPercent)
)
## levelName level probability
## 1 Acme Inc. 0
## 2 ¦--Accounting 1
## 3 ¦ ¦--New Software 2 50.00 %
## 4 ¦ °--New Accounting Standards 2 75.00 %
## 5 ¦--Research 1
## 6 ¦ ¦--New Product Line 2 25.00 %
## 7 ¦ °--New Labs 2 90.00 %
## 8 °--IT 1
## 9 ¦--Outsource 2 20.00 %
## 10 ¦--Go agile 2 5.00 %
## 11 °--Switch to R 2 100.00 %
Get method (Tree Traversal)Tree traversal is one of the core concepts of trees. See, for example, here: Tree Traversal on Wikipedia. The Get method traverses the tree and collects values from each node. It then returns a vector containing the collected values.
Additional features of the Get method are:
Node method on each node, and append the method’s return value to the returned vectorNode’s attributeThe Get method can traverse the tree in various ways. This is called traversal order.
The default traversal mode is pre-order.
This is what is used e.g. in as.data.frame and its OO-style counterpart Node$ToDataFrame:
acme$ToDataFrame("level")
## levelName level
## 1 Acme Inc. 0
## 2 ¦--Accounting 1
## 3 ¦ ¦--New Software 2
## 4 ¦ °--New Accounting Standards 2
## 5 ¦--Research 1
## 6 ¦ ¦--New Product Line 2
## 7 ¦ °--New Labs 2
## 8 °--IT 1
## 9 ¦--Outsource 2
## 10 ¦--Go agile 2
## 11 °--Switch to R 2
The post-order traversal mode returns children first, returning parents only after all children have been traversed:
We can use it like this on the Get method:
data.frame(level = acme$Get('level', traversal = "post-order"))
## level
## New Software 2
## New Accounting Standards 2
## Accounting 1
## New Product Line 2
## New Labs 2
## Research 1
## Outsource 2
## Go agile 2
## Switch to R 2
## IT 1
## Acme Inc. 0
This is useful if your parent’s value depends on the children, as we’ll see below.
This is a non-standard traversal mode that does not traverse the entire tree. Instead, the ancestor mode starts from a Node, then walks the tree along the path from ancestor to ancestor, up to the root.
data.frame(level = agile$Get('level', traversal = "ancestor"))
## level
## Go agile 2
## IT 1
## Acme Inc. 0
Get using a functionGet methodYou can pass a standard R function to the Get method. For example:
ExpectedCost <- function(node) {
result <- node$cost * node$p
if(length(result) == 0) result <- NA
return (result)
}
data.frame(acme$Get(ExpectedCost))
## acme.Get.ExpectedCost.
## Acme Inc. NA
## Accounting NA
## New Software 500000
## New Accounting Standards 375000
## Research NA
## New Product Line 500000
## New Labs 675000
## IT NA
## Outsource 80000
## Go agile 12500
## Switch to R 50000
The requirements for the function (ExpectedCost in the above example) are the following:
NodeIn the following examples, we use magrittr to enhance readability of the code.
library(magrittr)
ExpectedCost <- function(node) {
result <- node$cost * node$p
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>% sapply(ExpectedCost) %>% sum -> result
}
}
return (result)
}
data.frame(ec = acme$Get(ExpectedCost))
## ec
## Acme Inc. 2192500
## Accounting 875000
## New Software 500000
## New Accounting Standards 375000
## Research 1175000
## New Product Line 500000
## New Labs 675000
## IT 142500
## Outsource 80000
## Go agile 12500
## Switch to R 50000
The Traverse method accepts an ellipsis (...). Any additional parameters with which Get is called will be passed on to the ExpectedCost function. This gives us more flexibility. For instance, we don’t have to hard-code the sum function into ExpectedCost, but we can leave it to the caller to provide the function to use:
ExpectedCost <- function(node, fun = sum) {
result <- node$cost * node$p
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>% sapply(function(x) ExpectedCost(x, fun = fun)) %>% fun -> result
}
}
return (result)
}
data.frame(ec = acme$Get(ExpectedCost, fun = mean))
## ec
## Acme Inc. 357500
## Accounting 437500
## New Software 500000
## New Accounting Standards 375000
## Research 587500
## New Product Line 500000
## New Labs 675000
## IT 47500
## Outsource 80000
## Go agile 12500
## Switch to R 50000
GetWe can tell the Get method to assign the value to a specific attribute for each Node it traverses. This is especially useful if the attribute parameter is a function, as in the previous examples. For instance, we can store the calculated expected cost for later use and printing:
acme$Get(function(x) x$p * x$cost, assign = "expectedCost")
## Acme Inc. Accounting New Software
## NA NA 500000
## New Accounting Standards Research New Product Line
## 375000 NA 500000
## New Labs IT Outsource
## 675000 NA 80000
## Go agile Switch to R
## 12500 50000
print(acme, "p", "cost", "expectedCost")
## levelName p cost expectedCost
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting NA NA NA
## 3 ¦ ¦--New Software 0.50 1000000 500000
## 4 ¦ °--New Accounting Standards 0.75 500000 375000
## 5 ¦--Research NA NA NA
## 6 ¦ ¦--New Product Line 0.25 2000000 500000
## 7 ¦ °--New Labs 0.90 750000 675000
## 8 °--IT NA NA NA
## 9 ¦--Outsource 0.20 400000 80000
## 10 ¦--Go agile 0.05 250000 12500
## 11 °--Switch to R 1.00 50000 50000
In the above recursion example, we iterate - for each node - to all descendants straight to the leaf, repeating the very same calculations various times.
We can avoid repeating calculations by piggy-backing on precalculated values. Obviously, this requires us to traverse the tree in post-order mode: We want to start calculating at the leaves, cache the results for later use, then walk back towards the root.
In the following example, we calculate the average expected cost, just as above. As this now depends only on a Node’s children, and because we walk the tree in post-order mode, we can be sure that our children have the value calculated when we traverse the parent.
ExpectedCost <- function(node, variableName = "avgExpectedCost", fun = sum) {
#if the "cache" is filled, I return it. This stops the recursion
if(!is.null(node[[variableName]])) return (node[[variableName]])
#otherwise, I calculate from my own properties
result <- node$cost * node$p
#if the properties are not set, I calculate the mean from my children
if(length(result) == 0) {
if (node$isLeaf) result <- NA
else {
node$children %>%
sapply(function(x) ExpectedCost(x, variableName = variableName, fun = fun)) %>%
fun -> result
}
}
return (result)
}
We can use our method like this:
invisible(
acme$Get(ExpectedCost, fun = mean, traversal = "post-order", assign = "avgExpectedCost")
)
print(acme, "cost", "p", "avgExpectedCost")
## levelName cost p avgExpectedCost
## 1 Acme Inc. NA NA 357500
## 2 ¦--Accounting NA NA 437500
## 3 ¦ ¦--New Software 1000000 0.50 500000
## 4 ¦ °--New Accounting Standards 500000 0.75 375000
## 5 ¦--Research NA NA 587500
## 6 ¦ ¦--New Product Line 2000000 0.25 500000
## 7 ¦ °--New Labs 750000 0.90 675000
## 8 °--IT NA NA 47500
## 9 ¦--Outsource 400000 0.20 80000
## 10 ¦--Go agile 250000 0.05 12500
## 11 °--Switch to R 50000 1.00 50000
GetWe can pass a formatting function to the Get method, which will convert the returned value to a human-readable string for printing.
PrintMoney <- function(x) {
format(x, digits=10, nsmall=2, decimal.mark=".", big.mark="'", scientific = FALSE)
}
print(acme, cost = acme$Get("cost", format = PrintMoney))
## levelName cost
## 1 Acme Inc. NA
## 2 ¦--Accounting NA
## 3 ¦ ¦--New Software 1'000'000.00
## 4 ¦ °--New Accounting Standards 500'000.00
## 5 ¦--Research NA
## 6 ¦ ¦--New Product Line 2'000'000.00
## 7 ¦ °--New Labs 750'000.00
## 8 °--IT NA
## 9 ¦--Outsource 400'000.00
## 10 ¦--Go agile 250'000.00
## 11 °--Switch to R 50'000.00
Note that the format is not used for assignment with the assign parameter, but only for the values returned by Get:
acme$Get("cost", format = PrintMoney, assign = "cost2")
## Acme Inc. Accounting New Software
## "NA" "NA" "1'000'000.00"
## New Accounting Standards Research New Product Line
## "500'000.00" "NA" "2'000'000.00"
## New Labs IT Outsource
## "750'000.00" "NA" "400'000.00"
## Go agile Switch to R
## "250'000.00" "50'000.00"
print(acme, cost = acme$Get("cost2"))
## levelName cost
## 1 Acme Inc. NA
## 2 ¦--Accounting NA
## 3 ¦ ¦--New Software 1000000
## 4 ¦ °--New Accounting Standards 500000
## 5 ¦--Research NA
## 6 ¦ ¦--New Product Line 2000000
## 7 ¦ °--New Labs 750000
## 8 °--IT NA
## 9 ¦--Outsource 400000
## 10 ¦--Go agile 250000
## 11 °--Switch to R 50000
The format function is useful not only for formatting numbers, but also for displaying a printable representation of a Node field that is not a numeric (but e.g. a matrix).
Set methodThe Set method is the counterpart to the Get method. The Set method takes a vector or a single value as an input, and traverses the tree in a certain order. Each Node is assigned a value from the vector, one after the other.
employees <- c(NA, 52, NA, NA, 78, NA, NA, 39, NA, NA, NA)
acme$Set(employees)
print(acme, "employees")
## levelName employees
## 1 Acme Inc. NA
## 2 ¦--Accounting 52
## 3 ¦ ¦--New Software NA
## 4 ¦ °--New Accounting Standards NA
## 5 ¦--Research 78
## 6 ¦ ¦--New Product Line NA
## 7 ¦ °--New Labs NA
## 8 °--IT 39
## 9 ¦--Outsource NA
## 10 ¦--Go agile NA
## 11 °--Switch to R NA
The Set method can take multiple vectors as an input, and, optionally, you can define the name of the attribute:
secretaries <- c(NA, 5, NA, NA, 6, NA, NA, 2, NA, NA, NA)
acme$Set(secretaries, secPerEmployee = secretaries/employees)
print(acme, "employees", "secretaries", "secPerEmployee")
## levelName employees secretaries secPerEmployee
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting 52 5 0.09615385
## 3 ¦ ¦--New Software NA NA NA
## 4 ¦ °--New Accounting Standards NA NA NA
## 5 ¦--Research 78 6 0.07692308
## 6 ¦ ¦--New Product Line NA NA NA
## 7 ¦ °--New Labs NA NA NA
## 8 °--IT 39 2 0.05128205
## 9 ¦--Outsource NA NA NA
## 10 ¦--Go agile NA NA NA
## 11 °--Switch to R NA NA NA
Just as for the Get method, the traversal order is important for the Set.
Often, it is useful to use Get and Set together:
ec <- acme$Get(function(x) x$p * x$cost)
acme$Set(expectedCost = ec)
print(acme, "p", "cost", "expectedCost")
## levelName p cost expectedCost
## 1 Acme Inc. NA NA NA
## 2 ¦--Accounting NA NA NA
## 3 ¦ ¦--New Software 0.50 1000000 500000
## 4 ¦ °--New Accounting Standards 0.75 500000 375000
## 5 ¦--Research NA NA NA
## 6 ¦ ¦--New Product Line 0.25 2000000 500000
## 7 ¦ °--New Labs 0.90 750000 675000
## 8 °--IT NA NA NA
## 9 ¦--Outsource 0.20 400000 80000
## 10 ¦--Go agile 0.05 250000 12500
## 11 °--Switch to R 1.00 50000 50000
This is equivalent to using Get with the assign parameter.
The Set method can also be used to assign a single value directly to all Nodes traversed. For example, to remove the avgExpectedCost, we assign NULL on each node:
acme$Set(avgExpectedCost = NULL)
Note that unassigned values also have NULL:
acme$newAttribute
## NULL
As Node is an R6 reference object, we can chain the arguments:
acme$Set(avgExpectedCost = NULL)$Set(expectedCost = NA)
print(acme, "avgExpectedCost", "expectedCost")
## levelName avgExpectedCost expectedCost
## 1 Acme Inc. NA NA
## 2 ¦--Accounting NA NA
## 3 ¦ ¦--New Software NA NA
## 4 ¦ °--New Accounting Standards NA NA
## 5 ¦--Research NA NA
## 6 ¦ ¦--New Product Line NA NA
## 7 ¦ °--New Labs NA NA
## 8 °--IT NA NA
## 9 ¦--Outsource NA NA
## 10 ¦--Go agile NA NA
## 11 °--Switch to R NA NA
This is equivalent to:
acme$Set(avgExpectedCost =NULL, expectedCost = NA)
Null and NAAlso note that setting a value to NA or to NULL looks equivalent when printing to a data.frame, but internally it is not:
acme$avgExpectedCost
## NULL
acme$expectedCost
## [1] NA
The reason is that NULL is always converted to NA for printing, and when using the Get method.
Aggregate methodFor simple cases, you don’t have to write your own function to be passed along to the Get method. For example, the Aggregate method provides a shorthand for the oft-used case when a parent is the aggregate of its children values:
acme$Aggregate("cost", sum)
## [1] 4950000
We can use this in the Get method:
acme$Get("Aggregate", "cost", sum)
## Acme Inc. Accounting New Software
## 4950000 1500000 1000000
## New Accounting Standards Research New Product Line
## 500000 2750000 2000000
## New Labs IT Outsource
## 750000 700000 400000
## Go agile Switch to R
## 250000 50000
This is the equivalent of:
GetCost <- function(node) {
result <- node$cost
if(length(result) == 0) {
if (node$isLeaf) stop(paste("Cost for ", node$name, " not available!"))
else {
node$children %>% sapply(GetCost) %>% sum -> result
}
}
return (result)
}
acme$Get(GetCost)
## Acme Inc. Accounting New Software
## 4950000 1500000 1000000
## New Accounting Standards Research New Product Line
## 500000 2750000 2000000
## New Labs IT Outsource
## 750000 700000 400000
## Go agile Switch to R
## 250000 50000
Sort methodYou can sort an entire tree by using the Sort method on the root. The method will sort recursively and, for each Node, sort children by a child attribute. As before, the child attribute can also be a function or a method (e.g. of a sub-class of Node, see below).
acme$Get(ExpectedCost, assign = "expectedCost")
## Acme Inc. Accounting New Software
## 2192500 875000 500000
## New Accounting Standards Research New Product Line
## 375000 1175000 500000
## New Labs IT Outsource
## 675000 142500 80000
## Go agile Switch to R
## 12500 50000
acme$Sort("expectedCost", decreasing = TRUE)
print(acme, "expectedCost")
## levelName expectedCost
## 1 Acme Inc. 2192500
## 2 ¦--Research 1175000
## 3 ¦ ¦--New Labs 675000
## 4 ¦ °--New Product Line 500000
## 5 ¦--Accounting 875000
## 6 ¦ ¦--New Software 500000
## 7 ¦ °--New Accounting Standards 375000
## 8 °--IT 142500
## 9 ¦--Outsource 80000
## 10 ¦--Switch to R 50000
## 11 °--Go agile 12500
Naturally, you can also sort a sub-tree by calling Sort on the sub-tree’s parent node.
NodeWe can create a subclass of Node, and add custom methods to our subclass. This comes naturally to users with experience in OO languages such as Java, Python or C#:
library(R6)
MyNode <- R6Class("MyNode",
inherit = Node,
lock = FALSE,
#public fields and function
public = list(
p = NULL,
cost = NULL,
AddChild = function(name) {
child <- MyNode$new(name)
invisible (self$AddChildNode(child))
}
),
#active
active = list(
expectedCost = function() {
if ( is.null(self$p) || is.null(self$cost)) return (NULL)
self$p * self$cost
}
)
)
The AddChild utility function in the subclass allows us to construct the tree just as before.
The expectedCost function is now a Method, and we can call it in a more R6-ish way.