Introduction to machine learning

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Introduction
        Machine Learning
      

Introduction

The Oxford English dictionary defines learning as The acquisition of knowledge or skills through experience, study, or being taught. Machine learning (ML), as the name implies, is the science of learning by machines — the computational kind. To that end, ML involves the acquisition of knowledge or skills through the study of examples — the so-called data.

In this article, we will provide a broad introduction to the field of machine learning. The goal of the article is to inform the reader of the variety of applications possible with machine learning, their shared ideas, and general machine learning terminology. It is intended towards readers starting their journey in machine learning, deep learning, or data science.

If you already have a fair understanding of general ideas in machine learning, then go to our machine learning curriculum to read specific topics in further detail.

Applications of ML

Before we go into the specifics of learning by machines, let's list some applications of machine learning to whet our appetite for what's to come.

Here's a list of some popular applications of machine learning.

Tagging faces, objects, and places in pictures.
Captioning images and videos.
Driverless cars, drones, robotic vacuum cleaners, and intelligent game-playing bots.
Recognizing hand-written text.
Transcribing human speech to text.
Translating speech or text in one language into another.
Recommending shopping items, movies, books, songs, or new stories based on a person's interest.
Filtering out email spam.
Categorizing email, text-documents, movies, songs, new stories or books into pre-defined categories or genres.
Predicting the next earthquake or weather event such as rain, thunderstorm, drought, or snowfall.
Discovering faulty industrial components or predicting the time-to-failure.
Estimating housing prices, credit scores, mortgage risk, and stock market movements.
Gauging customer sentiment based on textual reviews or social interactions.
Recommending candidates suitable for a job opening.
Predicting social connections.
Personalized search, education, healthcare, advertisements, travel, and shopping experiences.
Smart home automation
And, many, more ...

Machine learning is gradually becoming ubiquitous and part of pretty much everything we do. Even if you do not intend to be a practitioner of machine learning, it is still an essential concept to understand and appreciate.

An example application

Among the many applications outlined in the previous section, consider a popular application of machine learning — spam classification. As the name implies, the goal in spam classification is to categorize incoming email into the categories Spam and Not Spam.

Sender	Subject	Body	Category
Lorem Ipsum	Register now	Today only, $ \ldots $	Spam
Dolor Sit	Minutes of the meeting	All, we discussed $ \ldots $	Not Spam
Amet Consectetur	Best insurance	Save hundreds $ \ldots $	Spam
Adipiscing Elit	Last opportunity	Do not miss $ \ldots $	Spam
The Learning Machine	Study variational autoencoders	Improve your machine learning $ \ldots $	Not Spam
Sed Do	Meeting request	Can we meet on $ \ldots $	Not Spam
Eiusmod Tempor	Analytical report	Completed the analysis of $ \ldots $	Not Spam
$ \vdots $	$ \vdots $	$ \vdots $	$ \vdots $

Email content

Let's see one way of addressing this challenge using machine learning. Intuitively, if an email contains content that is irrelevant to the email recipient, then it must be spam. Irrelevant content can be identified by watching out for sentences, phrases, or even words within the text of the email. (The sender's email address is a crucial information in identifying spam. We will ignore that for the moment and focus on textual aspects of the email).

Example: Spam

Sender: Lorem Ipsum

Subject: Register now

Hi there,

Once in a lifetime opportunity. Amazing talks by amazing professionals. Learn to make money without investing time, money, or energy. Large passive income. Take incredible vacations. Become the richest person in the world. Click here to register today. Only today. Last chance. You will regret if you do not. Act fast.

Your wellwisher,

Lorem Ipsum

Example: Not Spam

Sender: Dolor Sit

Subject: Minutes of the meeting

All,

We discussed the next steps to make the project ORION happen. We specifically discussed our strategy for market analysis, requirements gathering, delivering within the timelines, staying under the budget ...

Sincerely,

Dolor Sit

Spammy and Not spammy content

But what words constitute irrelevant content? By monitoring the actions of the recipient on past emails, we can surely infer this. Taken collectively, words/phrases/sentences that mostly appear in previous emails that the user chose to mark as spam are more likely to indicate irrelevant content. Intuitively, such past emails will can be used to identify the differentiating aspects of the content that typically appears in spam emails vs relevant emails. And this is what machine learning models do! In the following examples, we have marked the words that appear in collections of Spam and Not Spam emails.

Example: Spam

Sender: Lorem Ipsum

Subject: Register now

Hi there,

Your wellwisher,

Lorem Ipsum

Example: Not Spam

Sender: Dolor Sit

Subject: Minutes of the meeting

All,

Sincerely,

Dolor Sit

A simple strategy to classify spam could involve measuring the relative proportion of words and phrases that are commonly found in spam emails.

Alas, life is not so simple. Advanced machine learning models take into account sequence of terms, their relative context, and their semantics to assign an email to the spam folder. But hey, this is an illustrative example!

Predictive model

Spam classification based on the textual content of emails is a sub-field of machine learning, known as text classification or text categorization. The predictive model that achieves such categorization is known as a text classifier. For spam classification, the input to the predictive model is the email content. The output or target of the text-classifier is the category of that particular input email content.

$$ \text{email} ~~~ \overset{\text{input}}{\longrightarrow} ~~~ \text{Text classifier} ~~~ \overset{\text{output}}{\longrightarrow} ~~~ \text{category (spam/no spam)} $$

To be able to arrive at accurate outputs, we first need to train the text classifier.

Training set

To train a text classifier we use examples from the past. Each example consists of an email and its manually assigned category, the so-called label of the email. Such examples are known as labeled examples, shown here.

Email Content	Label
Today only, $ \ldots $	Spam
All, we discussed $ \ldots $	Not Spam
Improve your machine learning $ \ldots $	Not Spam
Save hundreds $ \ldots $	Spam
$ \vdots $	$ \vdots $

Such a collection of examples used to identify the differentiating aspects of categories is known as the training set.

Preparing the input

Textual content may not be directly amenable to learning, because much of machine learning involves mathematical operations. To facilitate numerical computation, some approaches to text-classification convert the text into vectors of word counts, or some scaled proportion of their occurrences in the email versus rest of the corpus. Similarly, textual categorical labels are also converted to the numerical form, say $ -1 $ for spam and $ 1 $ for not-spam. Such steps to arrive at a suitable input format for machine learning are known as preprocessing the input.

After preprocessing, each input email is represented in terms of features (the words) and their values (the counts of those words). The space spanned by these features is known as the feature space. Preprocessing converts each input example into an instance — a tuple of feature-values and corresponding labels.

In the following table, each row is an instance, and each column (except the label) represents the feature value for that column.

Word 1 feature	Word 2 feature	$ \ldots $	Label
0.5	0.73	$ \ldots $	-1
0.25	0.21	$ \ldots $	1
0.85	0.32	$ \ldots $	-1
0.12	0.95	$ \ldots $	-1
$ \vdots $	$ \vdots $	$ \vdots $	$ \vdots $

Some approaches may choose to arrive at intelligent transformations of the input, instead of mere word counts. For example, an approach may choose to associate counts not to individual words but to groups of words and their synonyms that most distinguish the categories. Such intelligent input transformation strategies that are guided by the classifier model are known as feature extraction.

Some words within the content may be irrelevant to the classification task. For example, common words such as the articles (the, an, a), or prepositions (at, for, in, off, on, over, under, ...) may be meaningless in the context of classification because they are equally likely to appear in any communication, spam or otherwise. Removing irrelevant features from the input examples, thereby retaining only the relevant information is known as feature selection.

Training the predictive model

With the training dataset ready after preprocessing/feature-selection/feature-extraction, it is time to actually train the model.

Some predictive models work by comparing feature values across the categories and identifying those features and their values that are most different between the two categories.

Alternatively, some approaches may utilize a parametric form for modeling the classifier. The parameters are chosen such that when combined with input features, they lead to the accurate prediction. For example, the parameters could be a real-valued vector, the parameter vector. We can check the sign of the inner product of this parameter vector with the input vector. If it is positive, then the email represented by the input is not spam. If the output is negative, then the corresponding email is most likely spam.

Irrespective of the particular modeling approach, the goal of any classification approach is the same — accurately predict the category of the input instance. Training classifiers involves discovering patterns in the input that make such accurate predictions possible.

This means, the parameters or the discerning features should be discovered such that they maximally distinguish between the examples in the training set. The process of identifying such differentiating patterns patterns is known learning or training the classifier. The classifier is itself considered a trained model at this point.

Evaluating the trained model

Before we can be confident that the trained model is effective at automatic categorization, we also need to also understand the performance of the model. The ability of the model to perform well on examples that were not part of the training set is known as generalization ability of the model. To estimate generalization ability, we evaluate or test the model on examples that the model has not seen during training. Unseen examples used to evaluate the generalization ability of the model is known as the testing set. In the case of email classification, the testing set consists of more emails with manually assigned categories. To test, we let the classifier predict categories for these test emails without revealing the actual categories for those emails. Then, we evaluate the classifier performance by comparing the predictions to the corresponding actual categories.

Email Content	Actual Label	Predicted Label	Is prediction correct?
Today only, $ \ldots $	Spam	Spam	Yes
All, we discussed $ \ldots $	Not Spam	Spam	No
Improve your machine learning $ \ldots $	Not Spam	Not Spam	Yes
Save hundreds $ \ldots $	Spam	Spam	Yes
$ \vdots $	$ \vdots $	$ \vdots $	$ \vdots $

A simple score for evaluating classification performance is the accuracy score, calculated as the fraction of predictions that were correct. We have a comprehensive article on numerous other approaches to measure classification performance.

A trained model with good generalization performance is ready to deploy.

A broader perspective

Let us move on from the specific application of categorizing text to a broader perspective on machine learning.

Text categorization is a form of a classification task — the problem of categorizing instances into pre-defined classes. With just two categories (spam/not-spam), it is more specifically a binary classification problem. Classification tasks involving more than two categories are known as multi-class classification problems. In some scenarios, the same example may simultaneously belong to multiple classes. Such tasks are known as multi-labeled classification problems.

Whether binary or multi-class, the categorical outputs are discrete variables. If the desired outputs are real-valued numbers, then we are dealing with the task of regression. For example, predicting housing prices, credit scores, mortgage risk, and stock market movements are all formulated as regression problems in machine learning. Training regression models is similar to training classifiers. Use a training set of instances paired with their expected outputs. Then, train the regression model to accurately predict those expected outputs.

Regression and classification both utilize training examples with known categories or expected outputs. These form of examples are known as supervised examples, with supervision referring to the expected outputs. The particular machine learning paradigm is known as supervised learning.

In machine learning, it is typically the case that more training data implies better predictive performance. Sometimes, it is particularly challenging to acquire supervision on numerous examples. For example, acquiring manually assigned categories for emails requires human involvement. This manual assignment can be prohibitively expensive for tasks that involve scientific experiments using expensive equipment or laborious observations to arrive at a label. In such cases, machine learners typically resort to techniques following the semi-supervised learning paradigm — learning from partially supervised examples.

Alternatively, another machine learning strategy to deal with the difficulty of acquiring enough training data involves being selective in choosing examples to supervise. This paradigm is known as active learning — instead of passively using the provided training set, solicit supervision on intelligently chosen examples when faced with a limited supervision budget.

Some machine learning tasks do not utilize supervision. For example, instead of assigning emails to predefined categories, we may just wish to automatically discover their natural groupings, maybe based on the similarity of their content. The task of discovering groupings in a set of examples is known as a clustering problem. The discovered groups are known as clusters. Because we do not use any supervision to perform clustering, this learning paradigm is known as unsupervised learning.

Clustering reduces the original multi-dimensional data to a single dimension, with all similar examples being assigned the same value. A more general approach is that of dimensionality reduction — the challenge of representing the instances with fewer dimensions than their original representation, while still retaining the important pieces of information in each example. An example application of dimensionality reduction could be easier visualization of multivariate data while still retaining their nuances.

Another example of unsupervised learning is that of density estimation — the task of estimating the probability density or likelihood of certain observations. Density estimation may be useful to identify if certain observations are commonly expected to occur or if they are rare occurrences, hence supporting tasks such as anomaly detection.

Yet another broad machine learning paradigm involves identifying the action that will result in the maximum reward, under given conditions. For example, driverless cars, autopiloting drones, robotic vacuum cleaners, and game-playing bots are all agents that need to decide on the next steps that will lead to the best outcomes in their particular scenario. This is the field of reinforcement learning.

This is just the tip of the iceberg. For more details refer our comprehensive overview of types of tasks in machine learning.

Where to next?

To continue your journey to machine learning expertise, we recommend one of the following two paths.

For those interested in applications of machine learning, we recommend a complete walk-through of specific examples

If you wish to understand the internals of machine learning models, we recommend you study a simple model for each task family.

Classification: Logistic regression classifier
Regression: Linear linear squares regression
Clustering: K-means
Density estimation: Kernel density estimation

In case you choose to take the more mathematical path, we highly recommend familiarity with foundational concepts in mathematics. We have the following comprehensive articles to make you comfortable.

Follow the above links to get acquainted with the corresponding concepts.

Resources

For study

For implementation

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Sender	Subject	Body	Category
Lorem Ipsum	Register now	Today only, \( \ldots \)	Spam
Dolor Sit	Minutes of the meeting	All, we discussed \( \ldots \)	Not Spam
Amet Consectetur	Best insurance	Save hundreds \( \ldots \)	Spam
Adipiscing Elit	Last opportunity	Do not miss \( \ldots \)	Spam
The Learning Machine	Study variational autoencoders	Improve your machine learning \( \ldots \)	Not Spam
Sed Do	Meeting request	Can we meet on \( \ldots \)	Not Spam
Eiusmod Tempor	Analytical report	Completed the analysis of \( \ldots \)	Not Spam
\( \vdots \)	\( \vdots \)	\( \vdots \)	\( \vdots \)

Word 1 feature	Word 2 feature	\( \ldots \)	Label
0.5	0.73	\( \ldots \)	-1
0.25	0.21	\( \ldots \)	1
0.85	0.32	\( \ldots \)	-1
0.12	0.95	\( \ldots \)	-1
\( \vdots \)	\( \vdots \)	\( \vdots \)	\( \vdots \)