Title: | Builds Trees by Sampling Variables in Groups |
---|---|
Description: | Random Forest-like tree ensemble that works with groups of predictor variables. When building a tree, a number of variables is taken randomly from each group separately, thus ensuring that it considers variables from each group for the splits. Useful when rows contain information about different things (e.g. user information and product information) and it's not sensible to make a prediction with information from only one group of variables, or when there are far more variables from one group than the other and it's desired to have groups appear evenly on trees. Trees are grown using the C5.0 algorithm rather than the usual CART algorithm. Supports parallelization (multithreaded), missing values in predictors, and categorical variables (without doing One-Hot encoding in the processing). Can also be used to create a regular (non-stratified) Random Forest-like model, but made up of C5.0 trees and with some additional control options. As it's built with C5.0 trees, it works only for classification (not for regression). |
Authors: | David Cortes <[email protected]> |
Maintainer: | David Cortes <[email protected]> |
License: | GPL-3 |
Version: | 0.2.2 |
Built: | 2024-11-16 04:19:41 UTC |
Source: | https://github.com/david-cortes/stratifiedrf |
Make predictions from a stratified_rf model on new data.
## S3 method for class 'stratified_rf' predict(object, data, type = "class", agg_type = "prob", vote_type = "simple", na.action = na.pass, threshold = NULL, ...)
## S3 method for class 'stratified_rf' predict(object, data, type = "class", agg_type = "prob", vote_type = "simple", na.action = na.pass, threshold = NULL, ...)
object |
A stratified_rf model. |
data |
New data on which to make predictions (data.frame only). Must have the same names as the data used to build the model. |
type |
Prediction type. Either "class" to get the predicted class or "prob" to get the voting scores for each class. |
agg_type |
How to combine the predictions from individual trees. Either "prob" to average the probabilities output from each tree or "class" to count the final predictions from each. |
vote_type |
How to weight the outputs from each tree. Either "simple" to average them, or "weighted" for a weighted average according to their OOB classification accuracy. |
na.action |
Function indicating how to handle missing values (see the 'C50' documentation for details). |
threshold |
Count only votes from trees whose out-of-bag classification accuracy is above this threshold. Must be a number between 0 and 1. |
... |
other options (not currently used) |
Note that by default, for classification models the predictions are made quite differently from the original Random Forest algorithm.
'C50' library: https://cran.r-project.org/package=C50
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) predict(m,iris)
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) predict(m,iris)
Print summary statistics from a model
## S3 method for class 'stratified_rf' print(x, ...)
## S3 method for class 'stratified_rf' print(x, ...)
x |
A stratified_rf model. |
... |
other options (not currently used) |
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) print(m)
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) print(m)
Random Forest that works with groups of predictor variables. When building a tree, a number of variables is taken from each group separately. Useful when rows contain information about different things (e.g. user information and product information) and it's not sensible to make a prediction with information from only one group of variables, or when there are far more variables from one group than the other and it's desired to have groups appear evenly on trees.
stratified_rf(df, targetvar, groups, mtry = "auto", ntrees = 500, multicore = TRUE, class_quotas = NULL, sample_weights = NULL, fulldepth = TRUE, replacement = TRUE, c50_control = NULL, na.action = na.pass, drop_threshold = NULL)
stratified_rf(df, targetvar, groups, mtry = "auto", ntrees = 500, multicore = TRUE, class_quotas = NULL, sample_weights = NULL, fulldepth = TRUE, replacement = TRUE, c50_control = NULL, na.action = na.pass, drop_threshold = NULL)
df |
Data to build the model (data.frame only). |
targetvar |
String indicating the name of the target or outcome variable in the data. Character types will be coerced to factors. |
groups |
Unnamed list, containing at each entry a group of variables (as a string vector with their names). |
mtry |
A numeric vector indicating how many variables to take from each group when building each tree. If set to "auto" then, for each group, mtry=round(sqrt(m_total)*len(m_group)/len(m_total)) (with a minimum of 1 for each group). |
ntrees |
Number of trees to grow. When setting multicore=TRUE, the number of trees should be a multiple of the number of cores, otherwise it will get rounded downwards to the nearest multiple. |
multicore |
Whether to use multiple CPU cores to parallelize the construction of trees. Parallelization is done with the 'parallel' library's default settings. |
class_quotas |
How many rows from each class to use in each tree (useful when there is a class imbalance). Must be a numeric vector or a named list with the number of desired rows to sample for each level of the target variable. Ignored when sample_weights is passed. Note that using more rows than the data originally had might result in incorrect out-of-bag error estimates. |
sample_weights |
Probability of sampling each row when building a tree. Must be a numeric vector. If not defined, then all rows have the same probability. Note that, depending on the structure of the data, setting this might result in incorret out-of-bag error estimates. |
fulldepth |
Whether to grow the trees to full depth. Ignored when passing c50_control. |
replacement |
Whether to sample rows with replacement. |
c50_control |
Custom parameters for growing trees. Must be a C5.0Control object compatible with the 'C50' package. |
na.action |
A function indicating how to handle NAs. Default is to include missing values when building a tree (see 'C50' documentation). |
drop_threshold |
Drop a tree whenever its resulting out-of-bag classification accuracy falls below a certain threshold specified here. Must be a number between 0 and 1. |
Note that while this algorithm forces each tree to consider possible splits with variables from all groups, it doesn't guarantee that they will end up having splits with variables from different groups.
The original Random Forest algorithm recommends a total number of sqrt(n_features), but this might not work so well when there are unequal groups of variables.
Implementation of everything outside the tree-building is in native R code, thus might be slow. Trees are grown using the C5.0 algorithm from the 'C50' library, thus it can be used for classification only (not for regression). Refer to the 'C50' library for any documentation about the tree-building algorithm.
'C50' library: https://cran.r-project.org/package=C50
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) summary(m)
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) summary(m)
Calculates error statistics for out-of-bag samples from a stratified_rf model.
## S3 method for class 'stratified_rf' summary(object, ...)
## S3 method for class 'stratified_rf' summary(object, ...)
object |
A stratified_rf model. |
... |
other options (not currently used) |
Predictions for a class are made by averaging class probabilities across trees rather than by a majority vote. All trees are weighted equally.
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) summary(m)
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) summary(m)
Heuristic on variable importance, taken as averages from the variable importances calculated for each tree.
varimp_stratified_rf(model, metric = "usage", agg_type = "simple")
varimp_stratified_rf(model, metric = "usage", agg_type = "simple")
model |
A stratified_rf model. |
metric |
How to calculate the variable importance from each tree. Either "usage" or "splits". |
agg_type |
How to aggregate the variable importances obtained from each tree. Either "simple" for a simple average, or "weighted" for an average weighted by each tree's accuracy. |
Methods are taken directly from the C5.0 trees. Currently doesn't support permutation tests.
A named data frame with the importance score of each variable, sorted from largest to smallest.
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) varimp_stratified_rf(m)
data(iris) groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width")) mtry <- c(1,1) m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE) varimp_stratified_rf(m)