| Title: | Extensible, Parallelizable Implementation of the Random Forest Algorithm |
| Version: | 0.3-11 |
| Date: | 2025-2-2 |
| Maintainer: | Mark Seligman <mseligman@suiji.org> |
| BugReports: | https://github.com/suiji/Arborist/issues |
| Description: | Scalable implementation of classification and regression forests, as described by Breiman (2001), <doi:10.1023/A:1010933404324>. |
| URL: | https://github.com/suiji/Rborist.CRAN, https://github.com/suiji/Arborist |
| License: | MPL version 2.0 | GPL-2 | GPL-3 | file LICENSE [expanded from: MPL (≥ 2) | GPL (≥ 2) | file LICENSE] |
| LazyLoad: | yes |
| Depends: | R(≥ 3.3) |
| Imports: | Rcpp (≥ 0.12.2), data.table (≥ 1.9.8), digest |
| Suggests: | testthat, knitr, rmarkdown, markdown |
| VignetteBuilder: | knitr |
| LinkingTo: | Rcpp |
| NeedsCompilation: | yes |
| Packaged: | 2025-02-02 20:41:06 UTC; mseligman |
| Author: | Mark Seligman [aut, cre] |
| Repository: | CRAN |
| Date/Publication: | 2025-02-02 23:10:16 UTC |
Exportation Format for rfArb Training Output
Description
Formats training output into a form suitable for illustration of feature contributions.
Usage
## Default S3 method:
Export(arbOut)
Arguments
arbOut |
an object of type |
Value
An object of type Export.
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
data(iris)
rb <- Rborist(iris[,-5], iris[,5])
ffe <- Export(rb)
## End(Not run)
Rapid Decision Tree Construction and Evaluation
Description
Legacy entry for accelerated implementation of the
Random Forest (trademarked name) algorithm. Calls the suggested
entry, rfArb.
Usage
## Default S3 method:
Rborist(x,
y,
...)
Arguments
x |
the design matrix expressed as a |
y |
the response (outcome) vector, either numerical or
categorical. Row count must conform with |
... |
specific to |
Value
an object of class rfArb, as documented in command of the
same name.
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
# Classification example:
data(iris)
# Generic invocation:
rb <- Rborist(x, y)
## End(Not run)
NEWS Displayer for Rborist
Description
Displays NEWS associated with Rborist releases.
Usage
RboristNews()
Value
None.
Reducing Memory Footprint of Trained Decision Forest
Description
Clears fields deemed no longer useful.
Usage
## S3 method for class 'rfArb'
Streamline(arbOut)
Arguments
arbOut |
Trained forest object of class |
Value
an object of class rfArb with sample data cleared.
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
## Trains.
rs <- Rborist(x, y)
...
## Replaces trained object with streamlined copy.
rs <- Streamline(rs)
## End(Not run)
Expands forest values into front-end readable vectors.
Description
Formats training output into a form suitable for illustration of feature contributions.
Usage
## Default S3 method:
expandfe(arbOut)
Arguments
arbOut |
an object of type |
Value
An object of type ExpandReg or ExpandCtg containing
human-readable representations of the trained forest.
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
data(iris)
rb <- Rborist(iris[,-5], iris[,5])
ffe <- expandfe(rb)
# An rfTrain counterpart is NYI.
## End(Not run)
Meinshausen forest weights
Description
Normalized observation counts across a prediction set.
Usage
## Default S3 method:
forestWeight(objTrain, prediction, sampler=objTrain$sampler,
nThread=0, verbose = FALSE, ...)
Arguments
objTrain |
an object of class |
prediction |
an object of class |
sampler |
an object of class |
nThread |
specifies a prefered thread count. |
verbose |
whether to output progress of weighting. |
... |
not currently used. |
Value
a numeric matrix having rows equal to the Meinshausen weight of each new datum.
Author(s)
Mark Seligman at Suiji.
References
Meinshausen, N. (2016) Quantile Random Forests. Journal of Machine Learning Research 17(1), 1-68.
See Also
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
rb <- Rborist(x,y)
newdata <- data.frame(replace(6, rnorm(nRow)))
# Performs separate prediction on new data, saving indices:
pred <- predict(rb, newdata, indexing=TRUE)
weights <- forestWeight(rb, pred)
obsIdx <- 215 # Arbitrary observation index (zero-based row number)
# Inner product should equal prediction, modulo numerical vagaries:
yPredApprox <- weights[obsIdx,] %*% y
print((yPredApprox - pred$yPred[obsIdx])/yPredApprox)
## End(Not run)
predict method for arbTrain result
Description
Prediction and test using Rborist.
Usage
## S3 method for class 'arbTrain'
predict(object, newdata, sampler, yTest=NULL,
keyedFrame = FALSE, quantVec=numeric(0), quantiles = length(quantVec) > 0,
ctgCensus = "votes", indexing = FALSE, trapUnobserved = FALSE,
bagging = FALSE, nThread = 0, verbose = FALSE, ...)
Arguments
object |
an object of class |
newdata |
a design frame or matrix containing new data, with the same signature of predictors as in the training command. |
sampler |
an object of class |
yTest |
a response vector against which to test the new predictions. |
keyedFrame |
whether the columns of |
quantVec |
a vector of quantiles to predict. |
quantiles |
whether to predict quantiles. |
ctgCensus |
whether/how to summarize per-category predictions. "votes" specifies the number of trees predicting a given class. "prob" specifies a normalized, probabilistic summary. "probSample" specifies sample-weighted probabilities, similar to quantile histogramming. |
indexing |
whether to record the final node index, typically terminal, of tree traversal. |
trapUnobserved |
reports score for nonterminal upon encountering values not observed during training, such as missing data. |
bagging |
whether prediction is restricted to out-of-bag samples. |
nThread |
suggests ans OpenMP-style thread count. Zero denotes default processor setting. |
verbose |
whether to output progress of prediction. |
... |
not currently used. |
Value
an object of one of two classes:
-
SummaryRegsummarizing regression, consisting of:-
predictionan object of classPredictRegconsisting of:-
yPredthe estimated numerical response. -
qPredquantiles of prediction, if requested. -
qEstquantile of the estimate, if quantiles requested. -
indicesfinal index of prediction, if requested.
-
-
validationif validation requested, an object of classValidRegconsisting of:-
msethe mean-squared error of the estimate. -
rsqthe r-squared statistic of the estimate. -
maethe mean absolute error of the estimate.
-
-
importanceif permution importance requested, an object of classimportanceReg, containing multiple instances of:-
namesthe predictor names. -
msethe per-predictor mean-squared error, under permutation.
-
-
-
SummaryCtgsummarizing classification, consisting of:-
PredictCtgconsisting of:-
yPredestimated categorical response. -
censusfactor-valued matrix of the estimate, by category, if requested. -
probmatrix of estimate probabilities, by category, if requested. -
indicesfinal index of prediction, if requested.
-
-
validationif validation requested, an object of classValidCtgconsisting of:-
confusionthe confusion matrix. -
mispredictionthe misprediction rate. -
oobErrorthe out-of-bag error.
-
-
importanceif permution importance requested, an object of classimportanceCtg, consisting of:-
mispredthe misprediction rate, by predictor. -
oobErrthe out-of-bag error, by predictor.
-
-
Author(s)
Mark Seligman at Suiji.
See Also
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
pf <- preformat(x)
sp <- presample(y)
rb <- arbTrain(pf, sp, y)
# Performs separate prediction on new data:
xx <- data.frame(replace(6, rnorm(nRow)))
pred <- predict(rb, xx)
yPred <- pred$yPred
rb <- Rborist(x,y)
# Performs separate prediction on new data:
xx <- data.frame(replacate(6, rnorm(nRow)))
pred <- predict(rb, xx)
yPred <- pred$yPred
# As above, but also records final indices of each tree walk:
#
pred <- predict(rb, xx, indexing=TRUE)
print(pred$indices[c(1:2), ])
# As above, but predicts over \code{newdata} with unobserved values.
# In the case of numerical data, only missing values are considered
# unobserved. Missing values are encoded as \code{NaN}, which are
# incomparable, precipitating \code{false} on every test. Prediction
# therefore takes the \code{false} branch when encountering missing
# values:
#
xxMissing <- xx
xxMissing[6, c(15, 32, 87, 101)] <- NA
pred <- predict(rb, xxMissing)
# As above, but returns a nonterminal score upon encountering
# unobserved values. Neither the true nor the false branch from the
# testing node is taken. Instead, the score returned is derived
# from all leaf nodes (terminals) reached by the testing
# (nonterminal) node.
#
pred <- predict(rb, xxMissing, trapUnobserved = TRUE)
# Performs separate prediction, using original response as test
# vector:
pred <- predict(rb, xx, y)
mse <- pred$mse
rsq <- pred$rsq
# Performs separate prediction with (default) quantiles:
pred <- predict(rb, xx, quantiles="TRUE")
qPred <- pred$qPred
# Performs separate prediction with deciles:
pred <- predict(rb, xx, quantVec = seq(0.1, 1.0, by = 0.10))
qPred <- pred$qPred
# Classification examples:
data(iris)
rb <- Rborist(iris[-5], iris[5])
# Generic prediction using training set.
# Census as (default) votes:
pred <- predict(rb, iris[-5])
yPred <- pred$yPred
census <- pred$census
# Using the \code{keyedFrame} option allows the columns of
# \code{newdata} to appear in arbitrary order, so long as the
# columns present during training appear as a subset:
#
pred <- predict(rb, iris[c(2, 4, 3, 1)], keyedFrame=TRUE)
# As above, but validation census to report class probabilities:
pred <- predict(rb, iris[-5], ctgCensus="prob")
prob <- pred$prob
# As above, but with training reponse as test vector:
pred <- predict(rb, iris[-5], iris[5], ctgCensus = "prob")
prob <- pred$prob
conf <- pred$confusion
misPred <- pred$misPred
# As above, but predicts nonterminal when encountering categories
# not observed during training. That is, prediction returns a score
# derived from all terminal nodes (leaves) reached from the
# (nonterminal) testing node.
#
# In this case, "unobserved" refers to categories not present in
# the subpartition over which a splitting is performed. As training
# partitions the data into smaller and smaller regions, a given
# category becomes less likely to appear in a region.
#
# More generally, unobserved data can include missing predictors as
# well as categories appearing in \code{newdata} which were not
# present during training.
#
pred <- predict(rb, trapUnobserved=TRUE)
## End(Not run)
predict method for rfArb result
Description
Prediction and test using Rborist.
Usage
## S3 method for class 'rfArb'
predict(object, newdata, sampler, yTest=NULL,
keyedFrame = FALSE, quantVec=numeric(0), quantiles = length(quantVec) > 0,
ctgCensus = "votes", indexing = FALSE, trapUnobserved = FALSE,
bagging = FALSE, nThread = 0, verbose = FALSE, ...)
Arguments
object |
an object of class |
newdata |
a design frame or matrix containing new data, with the same signature of predictors as in the training command. |
sampler |
an object of class |
yTest |
a response vector against which to test the new predictions. |
keyedFrame |
whether the columns of |
quantVec |
a vector of quantiles to predict. |
quantiles |
whether to predict quantiles. |
ctgCensus |
whether/how to summarize per-category predictions. "votes" specifies the number of trees predicting a given class. "prob" specifies a normalized, probabilistic summary. "probSample" specifies sample-weighted probabilities, similar to quantile histogramming. |
indexing |
whether to record the final node index, typically terminal, of tree traversal. |
trapUnobserved |
reports score for nonterminal upon encountering values not observed during training, such as missing data. |
bagging |
whether prediction is restricted to out-of-bag samples. |
nThread |
suggests ans OpenMP-style thread count. Zero denotes default processor setting. |
verbose |
whether to output progress of prediction. |
... |
not currently used. |
Value
an object of one of two classes:
-
SummaryRegsummarizing regression, consisting of:-
predictionan object of classPredictRegconsisting of:-
yPredthe estimated numerical response. -
qPredquantiles of prediction, if requested. -
qEstquantile of the estimate, if quantiles requested. -
indicesfinal index of prediction, if requested.
-
-
validationif validation requested, an object of classValidRegconsisting of:-
msethe mean-squared error of the estimate. -
rsqthe r-squared statistic of the estimate. -
maethe mean absolute error of the estimate.
-
-
importanceif permution importance requested, an object of classimportanceReg, containing multiple instances of:-
namesthe predictor names. -
msethe per-predictor mean-squared error, under permutation.
-
-
-
SummaryCtgsummarizing classification, consisting of:-
PredictCtgconsisting of:-
yPredestimated categorical response. -
censusfactor-valued matrix of the estimate, by category, if requested. -
probmatrix of estimate probabilities, by category, if requested. -
indicesfinal index of prediction, if requested.
-
-
validationif validation requested, an object of classValidCtgconsisting of:-
confusionthe confusion matrix. -
mispredictionthe misprediction rate. -
oobErrorthe out-of-bag error.
-
-
importanceif permution importance requested, an object of classimportanceCtg, consisting of:-
mispredthe misprediction rate, by predictor. -
oobErrthe out-of-bag error, by predictor.
-
-
Author(s)
Mark Seligman at Suiji.
See Also
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
pf <- preformat(x)
sp <- presample(y)
rb <- rfArb(pf, sp, y)
# Performs separate prediction on new data:
xx <- data.frame(replace(6, rnorm(nRow)))
pred <- predict(rb, xx)
yPred <- pred$yPred
rb <- Rborist(x,y)
# Performs separate prediction on new data:
xx <- data.frame(replacate(6, rnorm(nRow)))
pred <- predict(rb, xx)
yPred <- pred$yPred
# As above, but also records final indices of each tree walk:
#
pred <- predict(rb, xx, indexing=TRUE)
print(pred$indices[c(1:2), ])
# As above, but predicts over \code{newdata} with unobserved values.
# In the case of numerical data, only missing values are considered
# unobserved. Missing values are encoded as \code{NaN}, which are
# incomparable, precipitating \code{false} on every test. Prediction
# therefore takes the \code{false} branch when encountering missing
# values:
#
xxMissing <- xx
xxMissing[6, c(15, 32, 87, 101)] <- NA
pred <- predict(rb, xxMissing)
# As above, but returns a nonterminal score upon encountering
# unobserved values. Neither the true nor the false branch from the
# testing node is taken. Instead, the score returned is derived
# from all leaf nodes (terminals) reached by the testing
# (nonterminal) node.
#
pred <- predict(rb, xxMissing, trapUnobserved = TRUE)
# Performs separate prediction, using original response as test
# vector:
pred <- predict(rb, xx, y)
mse <- pred$mse
rsq <- pred$rsq
# Performs separate prediction with (default) quantiles:
pred <- predict(rb, xx, quantiles="TRUE")
qPred <- pred$qPred
# Performs separate prediction with deciles:
pred <- predict(rb, xx, quantVec = seq(0.1, 1.0, by = 0.10))
qPred <- pred$qPred
# Classification examples:
data(iris)
rb <- Rborist(iris[-5], iris[5])
# Generic prediction using training set.
# Census as (default) votes:
pred <- predict(rb, iris[-5])
yPred <- pred$yPred
census <- pred$census
# Using the \code{keyedFrame} option allows the columns of
# \code{newdata} to appear in arbitrary order, so long as the
# columns present during training appear as a subset:
#
pred <- predict(rb, iris[c(2, 4, 3, 1)], keyedFrame=TRUE)
# As above, but validation census to report class probabilities:
pred <- predict(rb, iris[-5], ctgCensus="prob")
prob <- pred$prob
# As above, but with training reponse as test vector:
pred <- predict(rb, iris[-5], iris[5], ctgCensus = "prob")
prob <- pred$prob
conf <- pred$confusion
misPred <- pred$misPred
# As above, but predicts nonterminal when encountering categories
# not observed during training. That is, prediction returns a score
# derived from all terminal nodes (leaves) reached from the
# (nonterminal) testing node.
#
# In this case, "unobserved" refers to categories not present in
# the subpartition over which a splitting is performed. As training
# partitions the data into smaller and smaller regions, a given
# category becomes less likely to appear in a region.
#
# More generally, unobserved data can include missing predictors as
# well as categories appearing in \code{newdata} which were not
# present during training.
#
pred <- predict(rb, trapUnobserved=TRUE)
## End(Not run)
Preformatting for Training with Warm Starts
Description
Presorts and formats training frame into a form suitable for
subsequent training by rfArb caller or rfTrain
command. Wraps this form to spare unnecessary recomputation when
iteratively retraining, for example, under parameter sweep.
Usage
## Default S3 method:
preformat(x,
nThread = 0,
verbose=FALSE,
...)
Arguments
x |
the design frame expressed as either a |
nThread |
number of cores to run in parallel, if available. |
verbose |
indicates whether to output progress of preformatting. |
... |
unused. |
Value
an object of class Deframe consisting of:
-
rleFramerun-length encoded representation of classRLEFrameconsisting of:-
rankedFramerun-length encoded representation of classRankedFrameconsisting of:-
nRowthe number of observations encoded. -
runValthe run-length encoded values. -
runRowthe corresponding row indices. -
rleHeightthe number of encodings, per predictor. -
topIdxthe accumulated end index, per predictor.
-
-
numRankedpacked representation of sorted numerical values of classNumRankedconsisting of:-
numValdistinct numerical values. -
numHeightvalue offset per predictor.
-
-
facRankedpacked representation of sorted factor values of classFacRankedconsisting of:-
facValdistinct factor values, zero-based. -
facHeightvalue offset per predictor.
-
-
-
nRowthe number of training observations. -
signaturean object of typeSignatureconsisting of:-
predFormpredictor class names. -
levelper-predictor levels, regardless whether realized. -
factorper-predictor realized levels. -
colNamespredictor names. -
rowNamesobservation names.
-
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
data(iris)
pt <- preformat(iris[,-5])
ppTry <- seq(0.2, 0.5, by= 0.3/10)
nIter <- length(ppTry)
rsq <- numeric(nIter)
for (i in 1:nIter) {
rb <- Rborist(pt, iris[,5], predProb=ppTry[i])
rsq[i] = rb$validiation$rsq
}
## End(Not run)
Forest-wide Observation Sampling
Description
Observations sampled for each tree to be trained. In the case of the Random Forest algorithm, this is the bag.
Usage
## Default S3 method:
presample(y,
samplingWeight = numeric(0),
nSamp = 0,
nRep = 500,
withRepl = TRUE,
nHoldout = 0,
nFold = 1,
verbose = FALSE,
nTree = 0,
...)
Arguments
y |
A vector to be sampled, typically the response. |
samplingWeight |
Per-observation sampling weights. Default is uniform. |
nSamp |
Size of sample draw. Default draws |
nRep |
Number of samples to draw. Replaces deprecated |
withRepl |
true iff sampling is with replacement. |
nHoldout |
Number of observations to omit from sampling. Augmented by unobserved response values. |
nFold |
Number of collections into which to partition the respone. |
verbose |
true iff tracing execution. |
nTree |
Number of samples to draw. Deprecated. |
... |
not currently used. |
Value
an object of class Sampler consisting of:
-
yTrainthe sampled vector. -
nSampthe sample sizes drawn. -
nRepthe number of independent samples. -
nTreesynonymous withnRep. Deprecated. -
samplesa packed data structure encoding the observation index and corresponding sample count. -
hasha hashed digest of the data items.
References
Tille, Yves. Sampling algorithms. Springer New York, 2006.
Examples
## Not run:
y <- runif(1000)
# Samples with replacement, 500 vectors of length 1000:
ps <- presample(y)
# Samples, as above, with 63 observations held out:
ps <- presample(y, nHoldout = 63)
# Samples without replacement, 250 vectors of length 500:
ps2 <- presample(y, nTree=250, nSamp=500, withRepl = FALSE)
## End(Not run)
Rapid Decision Tree Construction and Evaluation
Description
Accelerated implementation of the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R. Invocation is similar to that provided by randomForest package.
Usage
## Default S3 method:
rfArb(x,
y,
autoCompress = 0.25,
ctgCensus = "votes",
classWeight = numeric(0),
discardState = FALSE,
impPermute = 0,
indexing = FALSE,
maxLeaf = 0,
minInfo = 0.01,
minNode = if (is.factor(y)) 2 else 3,
nHoldout = 0,
nLevel = 0,
nSamp = 0,
nThread = 0,
nTree = 500,
noValidate = FALSE,
predFixed = 0,
predProb = 0.0,
predWeight = numeric(0),
quantVec = numeric(0),
quantiles = length(quantVec) > 0,
regMono = numeric(0),
rowWeight = numeric(0),
samplingWeight = numeric(0),
splitQuant = numeric(0),
streamline = FALSE,
thinLeaves = streamline || (is.factor(y) && !indexing),
trapUnobserved = FALSE,
treeBlock = 1,
verbose = FALSE,
withRepl = TRUE,
...)
Arguments
x |
the design matrix expressed as a |
y |
the response (outcome) vector, either numerical or
categorical. Row count must conform with |
autoCompress |
plurality above which to compress predictor values. |
ctgCensus |
report categorical validation by vote or by probability. |
classWeight |
proportional weighting of classification categories. |
discardState |
minimizes storage by discarding primary training output. Useful for parameter sweeps and cross-validation, in which only validation may be of interest. |
impPermute |
number of importance permutations: 0 or 1. |
indexing |
whether to report final index, typically terminal, of validation tree traversal. |
maxLeaf |
maximum number of leaves in a tree. Zero denotes no limit. |
minInfo |
information ratio with parent below which node does not split. |
minNode |
minimum number of distinct row references to split a node. |
nHoldout |
number of observations to omit from sampling. Augmented by missing response values. |
nLevel |
maximum number of tree levels to train, including terminals (leaves). Zero denotes no limit. |
nSamp |
number of rows to sample, per tree. |
nThread |
suggests an OpenMP-style thread count. Zero denotes the default processor setting. |
nTree |
the number of trees to train. |
noValidate |
whether to train without validation. |
predFixed |
number of trial predictors for a split ( |
predProb |
probability of selecting individual predictor as trial splitter. |
predWeight |
relative weighting of individual predictors as trial splitters. |
quantVec |
quantile levels to validate. |
quantiles |
whether to report quantiles at validation. |
regMono |
signed probability constraint for monotonic regression. |
rowWeight |
row weighting for initial sampling of tree. Deprecated |
samplingWeight |
row weighting for initial sampling of tree. |
splitQuant |
(sub)quantile at which to place cut point for numerical splits |
.
streamline |
whether to streamline sampler contents to save space. |
thinLeaves |
bypasses creation of leaf state in order to reduce storage footprint. |
trapUnobserved |
reports score for nonterminal upon encountering values not observed during training, such as missing data. |
treeBlock |
maximum number of trees to train during a single level (e.g., coprocessor computing). |
verbose |
indicates whether to output progress of training. |
withRepl |
whether row sampling is by replacement. |
... |
not currently used. |
Value
an object sharing classes rfArb, a supplementary collection
consisting of the following items:
-
sampleran object of classSampler, as described in the documentation for thepresamplecommand, that summarizes the bagging structure. -
traininga list summarizing the training task, consisting of the following fields:-
callthe calling invocation. -
infoa vector of forest-wide Gini (classification) or weighted variance (regression), by predictor. -
versionthe version of theRboristpackage used to train. -
diagdiagnostics accumulated over the training task. -
samplerHashhash value of theSamplerobject used to train. Recorded for consistency of subsequent commands.
-
-
predictionan object of classPredictRegorPredictCtg, as described by the documention for commandpredict. -
validationan object of classValidRegorValidCtg, as described by the documention for commandvalidate, if validation is requested. -
importancean object of classImportanceRegorImportanceCtg, as described by the documention for commandpredict, if permutation performance has been requested.
Author(s)
Mark Seligman at Suiji.
References
Breiman, L. (2001) Random Forests, Machine Learning 45(1), 5-32.
See Also
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
# Classification example:
data(iris)
# Generic invocation:
rb <- rfArb(x, y)
# Causes 300 trees to be trained:
rb <- rfArb(x, y, nTree = 300)
# Causes rows to be sampled without replacement:
rb <- rfArb(x, y, withRepl=FALSE)
# Causes validation census to report class probabilities:
rb <- rfArb(iris[-5], iris[5], ctgCensus="prob")
# Applies table-weighting to classification categories:
rb <- rfArb(iris[-5], iris[5], classWeight = "balance")
# Weights first category twice as heavily as remaining two:
rb <- rfArb(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))
# Does not split nodes when doing so yields less than a 2% gain in
# information over the parent node:
rb <- rfArb(x, y, minInfo=0.02)
# Does not split nodes representing fewer than 10 unique samples:
rb <- rfArb(x, y, minNode=10)
# Trains a maximum of 20 levels:
rb <- rfArb(x, y, nLevel = 20)
# Trains, but does not perform subsequent validation:
rb <- rfArb(x, y, noValidate=TRUE)
# Chooses 500 rows (with replacement) to root each tree.
rb <- rfArb(x, y, nSamp=500)
# Chooses 2 predictors as splitting candidates at each node (or
# fewer, when choices exhausted):
rb <- rfArb(x, y, predFixed = 2)
# Causes each predictor to be selected as a splitting candidate with
# distribution Bernoulli(0.3):
rb <- rfArb(x, y, predProb = 0.3)
# Causes first three predictors to be selected as splitting candidates
# twice as often as the other two:
rb <- rfArb(x, y, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))
# Causes (default) quantiles to be computed at validation:
rb <- rfArb(x, y, quantiles=TRUE)
qPred <- rb$validation$qPred
# Causes specfied quantiles (deciles) to be computed at validation:
rb <- rfArb(x, y, quantVec = seq(0.1, 1.0, by = 0.10))
qPred <- rb$validation$qPred
# Constrains modelled response to be increasing with respect to X1
# and decreasing with respect to X5.
rb <- rfArb(x, y, regMono=c(1.0, 0, 0, 0, -1.0, 0))
# Causes rows to be sampled with random weighting:
rb <- rfArb(x, y, samplingWeight=runif(nRow))
# Suppresses creation of detailed leaf information needed for
# quantile prediction and external tools.
rb <- rfArb(x, y, thinLeaves = TRUE)
# Directs prediction to take a random branch on encountering
# values not observed during training, such as NA or an
# unrecognized category.
predict(rb, trapUnobserved = FALSE)
# Directs prediction to silently trap unobserved values, reporting a
# score associated with the current nonterminal tree node.
predict(rb, trapUnobserved = TRUE)
# Sets splitting position for predictor 0 to far left and predictor
# 1 to far right, others to default (median) position.
spq <- rep(0.5, ncol(x))
spq[0] <- 0.0
spq[1] <- 1.0
rb <- rfArb(x, y, splitQuant = spq)
## End(Not run)
Rapid Decision Tree Training
Description
Accelerated training using the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R.
Usage
## Default S3 method:
rfTrain(preFormat,
sampler,
y,
autoCompress = 0.25,
ctgCensus = "votes",
classWeight = numeric(0),
maxLeaf = 0,
minInfo = 0.01,
minNode = if (is.factor(y)) 2 else 3,
nLevel = 0,
nThread = 0,
predFixed = 0,
predProb = 0.0,
predWeight = numeric(0),
regMono = numeric(0),
splitQuant = numeric(0),
thinLeaves = FALSE,
treeBlock = 1,
verbose = FALSE,
...)
Arguments
y |
the response (outcome) vector, either numerical or categorical. |
preFormat |
Compressed, presorted representation of the predictor
values. Row count must conform with |
sampler |
Compressed representation of the sampled response. |
autoCompress |
plurality above which to compress predictor values. |
ctgCensus |
report categorical validation by vote or by probability. |
classWeight |
proportional weighting of classification categories. |
maxLeaf |
maximum number of leaves in a tree. Zero denotes no limit. |
minInfo |
information ratio with parent below which node does not split. |
minNode |
minimum number of distinct row references to split a node. |
nLevel |
maximum number of tree levels to train, including terminals (leaves). Zero denotes no limit. |
nThread |
suggests an |
predFixed |
number of trial predictors for a split ( |
predProb |
probability of selecting individual predictor as trial splitter. |
predWeight |
relative weighting of individual predictors as trial splitters. |
regMono |
signed probability constraint for monotonic regression. |
splitQuant |
(sub)quantile at which to place cut point for numerical splits |
.
thinLeaves |
bypasses creation of leaf state in order to reduce memory footprint. |
treeBlock |
maximum number of trees to train during a single level (e.g., coprocessor computing). |
verbose |
indicates whether to output progress of training. |
... |
Not currently used. |
Value
an object of class arbTrain, containing:
-
versionthe version of theRboristpackage used to train. -
samplerHashhash value of theSamplerobject used to train. Recorded for consistency of subsequent commands. -
predInfoa vector of forest-wide Gini (classification) or weighted variance (regression), by predictor. -
forestan object of classForestcontaining:-
nTreethe number of trees trained. -
nodean object of classNodeconsisting of:-
treeNodeforest-wide vector of packed node representations. -
extentper-tree node counts. -
scoresnumeric vector of scores, for all terminals and nonterminals. -
factoran object of classFactorconsisting of:-
facSplitforest-wide vector of packed factor bits. -
extentper-tree extent of factor bits. -
observedforest-wide vector of observed factor bits.
-
-
-
Leafan object of classLeafcontaining:-
extentforest-wide vector of leaf populations, i.e., counts of unique samples. -
indexforest-wide vector of sample indices.
-
-
-
diagdiagnostics accumulated over the training task.
Author(s)
Mark Seligman at Suiji.
See Also
Examples
## Not run:
# Regression example:
nRow <- 5000
x <- data.frame(replicate(6, rnorm(nRow)))
y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.
# Classification example:
data(iris)
# Generic invocation:
rt <- rfTrain(y)
# Causes 300 trees to be trained:
rt <- rfTrain(y, nTree = 300)
# Causes validation census to report class probabilities:
rt <- rfTrain(iris[-5], iris[5], ctgCensus="prob")
# Applies table-weighting to classification categories:
rt <- rfTrain(iris[-5], iris[5], classWeight = "balance")
# Weights first category twice as heavily as remaining two:
rt <- rfTrain(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))
# Does not split nodes when doing so yields less than a 2% gain in
# information over the parent node:
rt <- rfTrain(y, preFormat, sampler, minInfo=0.02)
# Does not split nodes representing fewer than 10 unique samples:
rt <- rfTrain(y, preFormat, sampler, minNode=10)
# Trains a maximum of 20 levels:
rt <- rfTrain(y, preFormat, sampler, nLevel = 20)
# Trains, but does not perform subsequent validation:
rt <- rfTrain(y, preFormat, sampler, noValidate=TRUE)
# Chooses 500 rows (with replacement) to root each tree.
rt <- rfTrain(y, preFormat, sampler, nSamp=500)
# Chooses 2 predictors as splitting candidates at each node (or
# fewer, when choices exhausted):
rt <- rfTrain(y, preFormat, sampler, predFixed = 2)
# Causes each predictor to be selected as a splitting candidate with
# distribution Bernoulli(0.3):
rt <- rfTrain(y, preFormat, sampler, predProb = 0.3)
# Causes first three predictors to be selected as splitting candidates
# twice as often as the other two:
rt <- rfTrain(y, preFormat, sampler, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))
# Constrains modelled response to be increasing with respect to X1
# and decreasing with respect to X5.
rt <- rfTrain(x, y, preFormat, sampler, regMono=c(1.0, 0, 0, 0, -1.0, 0))
# Suppresses creation of detailed leaf information needed for
# quantile prediction and external tools.
rt <- rfTrain(y, preFormat, sampler, thinLeaves = TRUE)
spq <- rep(0.5, ncol(x))
spq[0] <- 0.0
spq[1] <- 1.0
rt <- rfTrain(y, preFormat, sampler, splitQuant = spq)
## End(Not run)
Separate Validation of Trained Decision Forest
Description
Permits trained decision forest to be validated separately from training.
Usage
## Default S3 method:
validate(train, preFormat, sampler = NULL, ctgCensus
= "votes", impPermute = 0, quantVec = numeric(0), quantiles =
length(quantVec) > 0, indexing = FALSE, trapUnobserved = FALSE, nThread = 0, verbose =
FALSE, ...)
Arguments
train |
an object of class |
sampler |
summarizes the response and its per-tree samplgin. |
preFormat |
internal representation of the design matrix, of
class |
ctgCensus |
report categorical validation by vote or by probability. |
impPermute |
specifies the number of importance permutations: 0 or 1. |
quantVec |
quantile levels to validate. |
quantiles |
whether to report quantiles at validation. |
indexing |
whether to report final index, typically terminal, of tree traversal. |
trapUnobserved |
indicates whether to return a nonterminal for values unobserved during training, such as missing data. |
nThread |
suggests an OpenMP-style thread count. Zero denotes the default processor setting. |
verbose |
indicates whether to output progress of validation. |
... |
not currently used. |
Value
either of two pairs of objects:
-
SummaryRegsummarizing regression, as documented with the commandpredict.arbTrain. -
validationan object of classValidRegconsisting of:-
msethe mean-square error of the estimate. -
rsqthe r-squared statistic of the estimate. -
maethe mean absolute error of the estimate.
-
-
SummaryCtgsummarizing classification, as documented with the commandpredict.arbTrain. -
validationan object of classValidCtgconsisting of:-
confusionthe confusion matrix. -
mispredictionthe misprediction rate. -
oobErrorthe out-of-bag error.
-
Author(s)
Mark Seligman at Suiji.
Examples
## Not run:
## Trains without validation.
rb <- Rborist(x, y, novalidate=TRUE)
...
## Delayed validation using a preformatted object.
pf <- preformat(x)
v <- validate(rb, pf)
## End(Not run)