Link Search Menu Expand Document

Picking an algorithm

Table of contents

  1. Picking an algorithm
    1. Rules of thumb for picking an algorithm
      1. 1. Just go with the default
      2. 2. Look at the score
      3. 3. Use the tips in algorithm intro below
    2. Regression algorithms (predicting a number)
      1. Linear Regression
        1. In a nutshell
        2. When to use
        3. When not to use
      2. SVM Regression
        1. In a nutshell
        2. When to use
        3. When not to use
      3. Random Forest Regression
        1. In a nutshell
        2. When to use
        3. When not to use
    3. Classification algorithms (predicting a class)
      1. Logistic Classification
        1. In a nutshell
        2. When to use
        3. When not to use
      2. SVM Classification
        1. In a nutshell
        2. When to use
        3. When not to use
      3. Random Forest Classification
        1. In a nutshell
        2. When to use
        3. When not to use

Picking an algorithm

Once we have established if we are predicting a Number or a Class, and told MagicSheets where our data is, we need to select the algorithm for the task.

Picking the right algorithm for your data and the task at hand is as much art as it is science.

Fortunately, we can follow some simple rules of thumb to select the algorithm from the available options.

Rules of thumb for picking an algorithm

Important note Applying the “rules of thumb” below do not guarantee your model will be “better” or “more correct”. It takes experience to understand the pros and cons of different algorithms, but for many applications any of available methods might give reliable results.

1. Just go with the default

For many applications the default regression (Linear Regression) and classification (Logistic Classification) method will be good enough.

2. Look at the score

You can try different algorithms and pick the one that gives the best score. (You can read more about scores here.)

3. Use the tips in algorithm intro below

In the brief intro to each algorithm below you will find some simple tips for when to pick each algorithm.

Back to top

Regression algorithms (predicting a number)

Linear Regression

In a nutshell

Linear Regression model assumes the simple linear relationship between your features and values. For example, if you only have 1 feature, your model might look like

\[\text{value} = \text{feature} \times \text{some number} + \text{some other number}\]

When one feature goes up, the value should go up as well, and vice versa.

When to use

  1. When you have good reasons to believe the relationship between features and the value (label) is linear (when features go up, so does the value, and vice versa).
  2. When your dataset is small (~20-50 points).
  3. When you are unsure which model is best and are interested in a “quick and dirty” prediction to continue your work.

When not to use

  1. When you believe some features in the model might be impacting the value “much, much more” than others.
  2. When you are unsure which model to use, but need a solid, reliable forecast.

SVM Regression

In a nutshell

SVM (Support Vector Machines) regression works a little similar to Linear Regression, but it is much more powerful, as it allows more complicated relationships between features and values, for example:

\[\text{value} = \text{feature}^3 \times \text{some_number} + \text{some_other_number}\]

As such, it is much more flexible and can learn more complicated patterns in your data.

When to use

  1. When you are unsure which algorithm to use and want a more reliable prediction.
  2. When you have “enough” data (50+ points)

When not to use

  1. When your data set is very small (20-50 points)

Random Forest Regression

In a nutshell

Random Forest creates lots of “baby models” called trees, each of which predicts a value for the data. The prediction given by the model is then the average value of predictions created by all trees in the forest.

When to use

  1. When you have “enough” data (100+ points)

When not to use

  1. When your dataset is small (<100 points)

Classification algorithms (predicting a class)

Logistic Classification

In a nutshell

Logistic Classification model aims to separate your dataset into classes using a straight line.

When to use

  1. When your dataset is relatively uncomplicated and you only have 2 classes in your data.

When not to use

  1. When you have more classes than 2, it’s better to use SVM Classification or Random Forest Classification model.
  2. For more complicated datasets that seem difficult to split into classes, SVM Classification or Random Forest Classification model could prove to work better.

SVM Classification

In a nutshell

SVM (Supported Vector Machines) classifier aims to create a line separating different classes in your data. SVM has the ability to adjust itself and use a different curves to separate your data depending on how complicated your dataset is.

When to use

  1. For more complicated data sets, when you believe separating the data into classes might be non-obvious.

When not to use

  1. When your data set is very small (20-50 points)

Random Forest Classification

In a nutshell

Random Forest creates lots of “baby models” called trees, each of which predicts a class for the data. The prediction given by the model is then the class predicted by most trees in the forest.

When to use

  1. When you have “enough” data (100+ points)

When not to use

  1. When your dataset is small (<100 points)