# Random Forest: intro and an example

* All samples will start from the root.
* At each node, one feature will split the samples.

If you check tools like Weka, ID3 (Iterative Dichotomiser 3) algo is useful, it builds a decision tree from the top down repeatedly. (By Ross Quinlan). It is a predictive classification algorithm which is IG-based.

Key Steps:

Inputs: Sample set $S$, Attributes set $A$.
Output: ID3 Decision Tree

1. If all attributes are processed, return; unless go to step 2.
2. Find out the attribute $a$ which has the max IG, set it as a node.
If $a$ can classify the samples, return; unless go to step 3.
3. For all possible values of $a$ , do :
1) Subset ${ S }_{ v }$ is a collection of the samples that meet $a=v$.
2) Generate attribute set: ${ A }_{ T }=A-\{ a\}$.
3) Set ${ S }_{ v }$ as sample set and ${ A }_{ T }$ as new inputs, iteratively ran ID3.

Statistical Methods for Prediction and Understanding.
If we build more than one decision trees using different subsets of the data, then they can vote on a new input, which will raise the accuracy (better than only on dtree).
Generally, the dtrees in the forest are independent. When a new sample is given as an input, we let every tree give a result (classification). Then based on the numbers, choose a max value of that class as the final result.
For each tree, the training set is a subset with replacements. Which means, in one tree, a record will appear more than one times in a tree, or a record never appears in one tree. When training a node, the attributes are randomly selected without replacements following a porportion, might be $\sqrt { M }$, $\sqrt{ M }$ or $\sqrt[1/2] { M }$.

•High training speed.
•Performs well on high features (no feature selection).
•Gives conclusions on features (shows the importance).
•Easy for implementation.
•Easy to be transferred to parallel process.

Algo:

Inputs:
• Training set: N examples represented by M features.
• Ensemble size T (i.e. number of trees).
• Reduced dimension size m ≪ M (i.e. number of features used to calculate the best split at each node).

Steps:

for i = 1 to T:
• Selecting N samples randomly as training set with replacement.
• Train a decision tree. For each node, randomly choose m features (not all of them) and calculate the best split (IG value).
• Use majority voting to make predictions based on all T trees.

To be continued…