随机森林发表评论(0) 编辑词条

Random forest
From Wikipedia, the free encyclopedia
Jump to: navigation, search
This article is about a machine learning technique. For other kinds of random tree, see random tree.
Random forest is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman[1] and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho [2][3] and Amit and Geman [4] in order to construct a collection of decision trees with controlled variation.

The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement stochastic discrimination proposed by Eugene Kleinberg.
随机森林是一种数据挖掘算法，但有比较多的类的时候用

Contents编辑本段 回目录

1 Learning algorithm
2 Advantages
3 Disadvantages
4 See also
5 References
6 Open source implementations
7 External links

Learning algorithm 编辑本段 回目录

Each tree is constructed using the following algorithm:

Let the number of training cases be N, and the number of variables in the classifier be M.
We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Advantages 编辑本段 回目录

The advantages of random forest are:

For many data sets, it produces a highly accurate classifier
It handles a very large number of input variables
It estimates the importance of variables in determining classification
It generates an internal unbiased estimate of the generalization error as the forest building progresses
It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing
It provides an experimental way to detect variable interactions
It can balance error in class population unbalanced data sets
It computes proximities between cases, useful for clustering, detecting outliers, and (by scaling) visualizing the data
Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views
Learning is fast

Disadvantages 编辑本段 回目录

Random forest are prone to overfitting for some datasets. This is even more pronounced in noisy classification/regression tasks.[6]
Random Forest does not handle large numbers of irrelevant features as well as ensembles of entropy-reducing decision trees.[7]
It is more efficient to select a random decision boundary than an entropy-reducing decision boundary, thus making larger ensembles more feasible. Although this may seem to be an advantage at first, it has the effect of shifting the computation from training time to evaluation time, which is actually a disadvantage for most applications.

经管百科已经为您找到更多关于“随机森林”的相关信息，点击查看>>

本词条由以下会员参与贡献

→如果您认为本词条还有待完善，请编辑词条

词条内容仅供参考，如果您需要解决具体问题
（尤其在法律、医学等领域），建议您咨询相关领域专业人士。本词条对我有帮助4

收藏到:

同义词: 暂无同义词

关于本词条的评论 (共0条）发表评论>>

随机森林 发表评论(0) 编辑词条

Contents编辑本段回目录

Learning algorithm 编辑本段回目录

Advantages 编辑本段回目录

Disadvantages 编辑本段回目录

本词条由以下会员参与贡献

附件列表

随机森林发表评论(0) 编辑词条