Picking an algorithm
Table of contents
- Picking an algorithm
Picking an algorithm
Once we have established if we are predicting a Number or a Class, and told MagicSheets where our data is, we need to select the algorithm for the task.
Picking the right algorithm for your data and the task at hand is as much art as it is science.
Fortunately, we can follow some simple rules of thumb to select the algorithm from the available options.
Rules of thumb for picking an algorithm
Important note Applying the “rules of thumb” below do not guarantee your model will be “better” or “more correct”. It takes experience to understand the pros and cons of different algorithms, but for many applications any of available methods might give reliable results.
1. Just go with the default
For many applications the default regression (Linear Regression) and classification (Logistic Classification) method will be good enough.
2. Look at the score
You can try different algorithms and pick the one that gives the best score. (You can read more about scores here.)
3. Use the tips in algorithm intro below
In the brief intro to each algorithm below you will find some simple tips for when to pick each algorithm.
Regression algorithms (predicting a number)
Linear Regression
In a nutshell
Linear Regression model assumes the simple linear relationship between your features and values. For example, if you only have 1 feature, your model might look like
\[\text{value} = \text{feature} \times \text{some number} + \text{some other number}\]When one feature goes up, the value should go up as well, and vice versa.
When to use
- When you have good reasons to believe the relationship between features and the value (label) is linear (when features go up, so does the value, and vice versa).
- When your dataset is small (~20-50 points).
- When you are unsure which model is best and are interested in a “quick and dirty” prediction to continue your work.
When not to use
- When you believe some features in the model might be impacting the value “much, much more” than others.
- When you are unsure which model to use, but need a solid, reliable forecast.
SVM Regression
In a nutshell
SVM (Support Vector Machines) regression works a little similar to Linear Regression, but it is much more powerful, as it allows more complicated relationships between features and values, for example:
\[\text{value} = \text{feature}^3 \times \text{some_number} + \text{some_other_number}\]As such, it is much more flexible and can learn more complicated patterns in your data.
When to use
- When you are unsure which algorithm to use and want a more reliable prediction.
- When you have “enough” data (50+ points)
When not to use
- When your data set is very small (20-50 points)
Random Forest Regression
In a nutshell
Random Forest creates lots of “baby models” called trees, each of which predicts a value for the data. The prediction given by the model is then the average value of predictions created by all trees in the forest.
When to use
- When you have “enough” data (100+ points)
When not to use
- When your dataset is small (<100 points)
Classification algorithms (predicting a class)
Logistic Classification
In a nutshell
Logistic Classification model aims to separate your dataset into classes using a straight line.
When to use
- When your dataset is relatively uncomplicated and you only have 2 classes in your data.
When not to use
- When you have more classes than 2, it’s better to use SVM Classification or Random Forest Classification model.
- For more complicated datasets that seem difficult to split into classes, SVM Classification or Random Forest Classification model could prove to work better.
SVM Classification
In a nutshell
SVM (Supported Vector Machines) classifier aims to create a line separating different classes in your data. SVM has the ability to adjust itself and use a different curves to separate your data depending on how complicated your dataset is.
When to use
- For more complicated data sets, when you believe separating the data into classes might be non-obvious.
When not to use
- When your data set is very small (20-50 points)
Random Forest Classification
In a nutshell
Random Forest creates lots of “baby models” called trees, each of which predicts a class for the data. The prediction given by the model is then the class predicted by most trees in the forest.
When to use
- When you have “enough” data (100+ points)
When not to use
- When your dataset is small (<100 points)