ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Ensemble, Bagging, and Random Forest
    Data/Machine learning 2021. 1. 6. 11:46

    Bagging (Bootstrap aggregating) is one of the Ensemble approaches.

    Ensemble

    An ensemble of predictive models is averaging (or taking majority votes) of multiple predictive models. The simple version of ensembles is as follows:

    1. Develop a predictive model and record the predictions for a given dataset.
    2. Repeat for multiple models, on the same data.
    3. For each record to be predicted, take an average (or weighted average, or a majority vote) of the predictions.

    The Ensemble methods have been applied most systematically and effectively to decision trees.

    There are two main variants of ensemble models: bagging and boosting. In this posting, we address the bagging.

    Bagging

    Suppose we have a response $Y$ and $P$ predictor variables $X=X_1,X_2, \cdots , X_P$ with $n$ records. Bagging is like the basic algorithm for ensembles, except that, instead of fitting the various models to the same data, each new model is fit to a bootstrap resample. Here is the algorithm presented more formally:

    1. Initialize $M$, the number of models to be fit, and $n$, the number of records to choose $(n < N)$. Set the iteration $m = 1$.
    2. [important] Take a bootstrap resample (i.e., with replacement) of $n$ records from the training data to form a subsample $Y_m$ and $X_m$ (the bag).
    3. Train a model using $Y_m$ and $X_m$ to create a set of decision rules $\hat{f}_m(X)$.
    4. Increment the model counter $m = m+1$. If $m <= M$, go to step 1.

    In the case where $\hat{f}_m$ predicts the probability $Y=1$, the bagged estimate is given by:

    $$\hat{f} = \frac{1}{M}(\hat{f}_1(X) + \hat{f}_2(X) + \cdots + \hat{f}_M(X))$$

    Random Forest

    The random forest is based on applying the bagging to decision trees with one important extension: in addition to sampling the records, the algorithm also samples the variables. Compared to the basic tree algorithm, the random forest algorithm adds two more steps: the bagging discussed earlier, and the bootstrap sampling of variables at each split:

    1. Take a bootstrap (with replacement) subsample from the records.
    2. For the first split, sample $p<P$ variables at random without replacement. 
      - How many variables to sample at each step? A rule of thumb is to choose $\sqrt{P}$
    3. For each of the sampled variables $X_{j(1)}, X_{j(2)}, \cdots, X_{j(p)}$, apply the splitting algorithm:
      1. For each splitting value $s_{j(k)}$ of $X_{j(k)}$:
        1. Split the records in partition $A$ with $X_{j(k)} < s_{j(k)}$ as one partition, and the remaining records where $X_{j(k)} \geq s_{j(k)}$ as another partition.
        2. Measure the homogeneity of classes within each subpartition of $A$.
      2. Select the value of $s_{j(k)}$ that produces maximum within-partition homogeneity of class.
    4. Select the variable $X_{j(k)}$ and the split value $s_{j(k)}$ that produces maximum within-partition homogeneity of class.
    5. Proceed to the next split and repeat the previous steps, starting with step 2.
    6. Continue with additional splits following the same procedure until the tree is grown.
    7. Go back to step 1, take another bootstrap subsample, and start the process over again.

     

    'Data > Machine learning' 카테고리의 다른 글

    K-means implementation in GoogleColab  (0) 2021.01.06
    Bayesian neural network  (0) 2020.10.22
    Graph Convolution Network  (0) 2020.10.19

    Comments