A broader perspective
Let us move on from the specific application of categorizing text to a broader perspective on machine learning.
Text categorization is a form of a classification task — the problem of categorizing instances into pre-defined classes.
With just two categories (spam/not-spam), it is more specifically a binary classification problem.
Classification tasks involving more than two categories are known as multi-class classification problems.
In some scenarios, the same example may simultaneously belong to multiple classes. Such tasks are known as multi-labeled classification problems.
Whether binary or multi-class, the categorical outputs are discrete variables.
If the desired outputs are real-valued numbers, then we are dealing with the task of regression.
For example, predicting housing prices, credit scores, mortgage risk, and stock market movements are all formulated as regression problems in machine learning.
Training regression models is similar to training classifiers.
Use a training set of instances paired with their expected outputs.
Then, train the regression model to accurately predict those expected outputs.
Regression and classification both utilize training examples with known categories or expected outputs.
These form of examples are known as supervised examples, with supervision referring to the expected outputs.
The particular machine learning paradigm is known as supervised learning.
In machine learning, it is typically the case that more training data implies better predictive performance.
Sometimes, it is particularly challenging to acquire supervision on numerous examples.
For example, acquiring manually assigned categories for emails requires human involvement.
This manual assignment can be prohibitively expensive for tasks that involve scientific experiments using expensive equipment or laborious observations to arrive at a label.
In such cases, machine learners typically resort to techniques following the semi-supervised learning paradigm — learning from partially supervised examples.
Alternatively, another machine learning strategy to deal with the difficulty of acquiring enough training data involves being selective in choosing examples to supervise.
This paradigm is known as active learning — instead of passively using the provided training set, solicit supervision on intelligently chosen examples when faced with a limited supervision budget.
Some machine learning tasks do not utilize supervision.
For example, instead of assigning emails to predefined categories, we may just wish to automatically discover their natural groupings, maybe based on the similarity of their content.
The task of discovering groupings in a set of examples is known as a clustering problem.
The discovered groups are known as clusters.
Because we do not use any supervision to perform clustering, this learning paradigm is known as unsupervised learning.
Clustering reduces the original multi-dimensional data to a single dimension, with all similar examples being assigned the same value.
A more general approach is that of dimensionality reduction — the challenge of representing the instances with fewer dimensions than their original representation, while still retaining the important pieces of information in each example.
An example application of dimensionality reduction could be easier visualization of multivariate data while still retaining their nuances.
Another example of unsupervised learning is that of density estimation — the task of estimating the probability density or likelihood of certain observations.
Density estimation may be useful to identify if certain observations are commonly expected to occur or if they are rare occurrences, hence supporting tasks such as anomaly detection.
Yet another broad machine learning paradigm involves identifying the action that will result in the maximum reward, under given conditions.
For example, driverless cars, autopiloting drones, robotic vacuum cleaners, and game-playing bots are all agents that need to decide on the next steps that will lead to the best outcomes in their particular scenario.
This is the field of reinforcement learning.
This is just the tip of the iceberg.
For more details refer our comprehensive overview of types of tasks in machine learning.