One of the important considerations when training a machine learning model is "are we using appropriate complexity in our representation of the data." There is often excitement exhibited at using the latest and greatest methods without proper consideration of how more simplistic algorithms hold up. In general, there are advantages to using more simplistic models over complex models such as interpretability, generalization, etc.
Within machine learning, there are two broad problems that we try to address, those being classification and regression. A classification problem will look at identifying categorical trends in data. This could be a binary classification problem, resulting in a simple yes/no (1/0) prediction. Or a multiclass problem, where the outcome could be based on an arbitrary number of classes. Regression problems tackle predictions on a continuous output.
Binary Classification can be SPAM or HAM
In this post, we will consider binary classification. A popular example of binary classification is deciding if an email is identified as spam.
There are many classifiers that could be used in a binary classification problem. These range from relatively simple models such as a decision tree, to more complex ones such as a deep artificial neural network classifier. The best model will need to be established through a robust assessment trying out a number of different methods/setups/parameters etc.
However, it is always beneficial to start with something that is more simplistic and interpretable before progressing to a more complicated opaque methodology. By starting with simple algorithms it is possible to gain a much better understanding of the underlying data and the impact it may have on the accuracy of any proposed method.
In a classification tree, there are internal nodes representing features, the tree splits into branches based on these nodes and at the end of the branch is the decision (or leaf). A greedy algorithm is employed to locate the desired splits, forming the branch structure. This is achieved using recursive binary splitting where different split points are attempted prior to applying a cost function to the model accuracy. The split with the lowest cost is then selected before the procedure is repeated.
A popular cost function is the Gini impurity, giving a measure of how often a randomly chosen element from the group would be incorrectly labeled. The probability is related to class purity within a group created by the split. Where there is a single class represented within the group, perfect class purity has been achieved as opposed to a 50:50 split. When high purity is achieved, the chances of incorrectly labeling an element are minimized.
To help build accurate models, it is important to avoid overfitting. This can partly be achieved by terminating the recursive splitting procedure. This is usually based on some metric such as the minimum number of elements in a split, the minim number of elements at a leaf or a maximum depth of the tree. When branches utilize features of low importance it can be beneficial to prune those branches, simplifying the tree and reducing the potential to overfit.
Ensemble learning methods can enhance the underlying algorithms. An example of this is a random forest, which constructs many decision trees when training and outputs the class that is the mode (most frequent) of the classes of the individual trees. Whilst deep decision trees tend to overfit, random forests provide a way of averaging multiple deep trees, trained on different parts of the same training set, thus reducing the potential for overfitting.
Bootstrap aggregating is used to select random samples with replacement upon which to train the ensemble of decision trees. When training the decision trees, instead of using all features, a random subset of the features is also selected when training the algorithm. Further randomization can be employed, giving rise to extremely randomized trees. This is achieved by randomizing the splitting, instead of using the e.g. Gini impurity for fitting.
Decision trees provide an intuitive and interpretable method upon which it is possible to progress to ensemble methods that are likely to provide greater model precision. Whilst this comes at the expense of some interpretability, the robustness, and ease of training has meant random forests have become popular. This is because they can produce good results in many applications and can be a precursor to training more complex models with potentially greater accuracy.
RecruitSumo Inc, sharing our passion for machine learning and artificial intelligence. We specialize in predictive analytics for human capital adding value by helping build the right organization, culture, team, and talent to succeed.