Today no one is surprised when a social network automatically recognizes their friends on an uploaded photo or when a mail service automatically gets rid of junk emails. But do you know how this became possible?
The short answer is - Machine Learning (or just ML, to be more shorter).
In this article, I want to make an introduction into a beautiful world of machine learning and provide a simple example of its implementation.
What is machine learning?

Speaking formally, Machine Learning is a part of Computer Science which studies algorithms that can automatically adjust their underlaying model using a training subset of data for further analysis of the whole set of data. In other words, it is about algorithms which designed to generate rules for solving a specific task without being explicitly programmed or adopted for that task.
What does it mean?
Let't say you want your computer to become a florist, so that it could distinguish different types of flowers. First of all, you have to define a set of characteristics (so-called "features") which differ from one flower to another. It could be petal colour and shape, flower's height, price (on a market), habitat, smell or even taste :)

But keep in mind that features must:
- be as much different as possible for different types of flowers (as for me, most flowers have the same taste :)
- be independent from each other (the pot size always related to the height of a flower, so one of these parameters will be enough)
After you finish with choosing the features, you'll have to make the main choice. You can write a program which knows all these attributes for every flower in the world and can classify the input flower by its parameters. Or you can write a program which uses a set of {features -> type of flower} matches to automatically generate a set of "rules" to distinguish on flower from another.
The first approach could be a good idea if there were only few types of flowers in the world and rules to distinguish them were very simple (e.g.: all red flowers are roses, all tasty flowers are chamomiles). In another case, you have to use --magic-- Statistics (which is almost the same :).
In terms of Statistics, here we have a classical "classification" problem. To solve it, we need a special "self-learning" classification algorithm. For the simplicity sake, let's take a look at k-nearest neighbours approach.
First of all, we have to represent each flower (i.e. the set of its features) as a point in N-dimensional space (where N is the number of features). Let's consider a case when we use only two features (height and price) and only two classes (red roses and blue hydrangeas). In such situation, a 10cm rose for $2 can be represented as a red dot with coordinates (x = 10, y = 2).
Then, during the "learning" step we'll collect "points" of different classes in different "groups".

After that, we can classify a new sample (which is just another point) by finding the nearest point from the sets that we formed during the "learning" phase. The class of the current point will be the class of the nearest one.

That's it! Alternatively, you can build a decision tree, calculate Bayes probability, find function or hyperplane for splitting the points, train Hidden Markov Model or Neuron Network, etc.
What else can Machine Learning be used for?
Of course, flowers isn't the only problem which machine learning solves. In general, we can highlight the following major directions in machine learning:
- regression analysis
- classification
- finding associations
- clusterization
These directions can be also divided into two groups: supervised and unsupervised learning. In the first case (regression, classification, associations), we always have a set of "right answers" (function values) during our training. In the second one, we have to make decisions relying only on the data itself.
Let's spent a few minutes and superficially consider the directions mentioned above.
Regression analysis
In general, regression analysis allows us to understand how arguments of a function affects on its (function) result.
E.g. we have a dataset with the prices of houses in a city. Each house has its attributes: the amount of bedrooms, prestige of the district, distance to the centre of the city, the presence of the underground and, of course, the price :)
Regression analysis allows us to build dependency between Price and Other attributes: \(X_1 * \theta_1 + ... + X_n * \theta_n = Price\), so that we could:
- predict the price for any possible values of the attribute;
- measure the affect of each attribute on the price;
- find correlation among attributes (e.g. prestige and location in the center might mean the same, so we can exclude one of them);
Classification
The most famous example of classification is face recognition. In this case, the features are elements of a face (eyes, hair, nose, ears, mouth) and the classes are people.
The same for speech recognition - your features are vectors of sound characteristics, and classes are words from a predefined dictionary.
Also we can mention Biometrics as a way to solve "authentication" problem. You use person's blueprints or retina fragments as features and people as classes.
Finding associations
Here we have a classical "basket analysis" problem when given a set of all the purchases in a supermarket, we want to predict what else we can suggest to a customer with a specific goods in his basket. In other words, we want to calculate probability of buying product B if A is already bought: \(P(B \mid A)\), so that we can suggest B to a customer if we see A in his basket.
E.g. a customer who has bought a bottle of milk might be interested in buying cookies.
Clusterization
Let's say we have a company which wants to 'know its customers'. For that our company can open its CRM, define characteristics of customers that it interested in (age, gender, etc) and using one of clusterization algorithms find out which kind of people it mostly deals with.
Or maybe you want to display all 10K places where you've checked in on the map of your city in your personal blog. If you render all the objects on the same screen, nor browser neither your friends will be happy. Instead you should group the places so that you have only 5-10 points on the current zoom level.
Also, you can use clustering for image compression. Let's say you have an image with a large groups of pixels placed in the same area. Why don't cluster them and replace with a special marker?
Finally, you could have an email client which should sort similar documents into autogenerated folders. In this case, you can analyze the words from the emails and use them as metrics for clusterization. If you receive emails mostly from internet shops and food delivery sites, the system will be able create two clusters: one with words "buy" and "discount" in the center, and another one with - "chicken" and "tasty" :)
What about a code example?
The simplest way to start with ML is Python. In my humble opinion, as Java is the language for enterprise, as Python is de-facto the language for data science :) And what is more important, it has platforms like Anaconda which provide you with all the necessary libraries and tools for data science (and machine learning as a part of it).
Once you download and install Anaconda, we can write a simple script which solves the "classification" problem.
First thing that we need here - is data. For simplicity sake we will use a predefined set of data, which is already included in Anaconda. Do you like Irises? I do, but if you don't - feel free to use any other pre-loaded data set.
# Import dataset with flowers
from sklearn import datasets
iris = datasets.load_iris()
# Print a few random lines from the dataset
import pandas as pd
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['type of flower'] = pd.Categorical.from_codes(iris.target, iris.target_names)
pd.set_option('display.width', 100)
print(df.sample(n=5), sep="\n", end="\n\n")
"""
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type of flower
102 7.1 3.0 5.9 2.1 virginica
117 7.7 3.8 6.7 2.2 virginica
96 5.7 2.9 4.2 1.3 versicolor
61 5.9 3.0 4.2 1.5 versicolor
41 4.5 2.3 1.3 0.3 setosa
"""
# Extract matrix with features
X = iris.data
print(X[:3])
"""
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]]
"""
# Extract vector with observations for the features (the types of the flowers)
y = iris.target
print(y[:3])
"""
[0 0 0]
"""
So, we have three types of Iris flowers:

For each flower from the dataset we know the length and the width of its sepal and petal:

Hence, a flower with sepal length = 5.1cm, sepal width = 3.5cm, petal length = 1.4cm and petal width = 0.2cm can be represented as a point/vector with the following coordinates: [5.1, 3.5, 1.4, 0.2]
.
The next and very important step is splitting our data into "training" and "test" subsets. Notice, that both subsets must be uniformly selected from the source data to be absolutely independent from the distribution of the source data. In other words, we want to avoid affect of situations when the first half of the source data consists of A-class elements and the second half is B-class elements. Splitting such data set just in the middle into "training" and "test" subsets lead to the situation when our algorithm is trained on samples from only one class and tested on elements from another one... which is a kind of fail :)
So, we'll use a special function for splitting. But you always can write your own one:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
Now it's time to choose a classifier that we will use. Scikit-learn (a python package that we're using) has a dozen pre-defined algorithms which you can use for classification.
All the classes that implement these algorithms have the same simple interface:
- fit(X_train, y_train) - a method for training the classifier;
- predict(X_test) - a method for predicting values using the trained classifier;
Since we've already used KNN, let's try another simple (at least for understanding) approach - DecisionTree:
from sklearn import tree
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)
That's it - we've just learned our classifier to distinguish the flowers! :)
Now we can use it to predict a type of a flower basing on its features. Let's check our classifier on the test set:
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(predictions, y_test))
"""
0.973333333333
"""
Well, more than 97% from the test set were predicted correctly! So, we can rely on our classifier and predict something outside of our source dataset. E.g. what is the type of iris with "sepal length" = 7.0, "sepal width" = 2.0, "petal length" = 1.5 and "petal width" = 1.8?
prediction = classifier.predict([7.0, 2.0, 1.5, 1.8])
print(prdiction[0])
print(iris.target_names[prediction[0]])
"""
2
virginica
"""
Seems that the machine thinks that our flower has type "virginica". I have no reason not to believe it after all those dozens of milliseconds that it has spent studying our source data :)
Putting all together
Finally, here is our simple example of Machine Learning:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
from sklearn import tree
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(predictions, y_test))
You can play with it using different classifiers and changing the source dataset.
Writing your own classifier
Do you feel like we've missed something? Yes, the example above doesn't contain the implementation of "learning" logic. In order to fill up this gap let's implement the simplest case of k-nearest neighbours algorithm (when N=1) that we mentioned in the beginning of this article.
Our data set represents a flower using 4 attributes, hence we can consider it as a point in N=4 dimensional space. Let's write a code that groups the points of the same type (of flower) during the training process and resolve the type of a flower as the type of the "closest" group during "prediction"/"test" phase?
Technically, the distance in N-dimensional space doesn't differ from the distance in 2-dimensional space (e.g. on a list in your workbook) and can be calculated by the Euclid's formula:
$$dist(a, b) = \sqrt{(a_1 - b_1)^2 + ... + (a_n - b_n)^2}$$
So, here is our classifier:
from scipy.spatial import distance
class DummyKnnClassifier:
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train
def predict(self, X_test):
predictions = []
for sample in X_test:
neighbor = self.find_nearest(sample)
predictions.append(neighbor)
return predictions
def find_nearest(self, sample):
nearest_distance = distance.euclidean(sample, self.X_train[0])
nearest_index = 0
for i in range(1, len(self.X_train)):
current_distance = distance.euclidean(sample, self.X_train[i])
if current_distance < nearest_distance:
nearest_distance = current_distance
nearest_index = i
return self.y_train[nearest_index]
Let's try to use it!
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)
classifier = DummyKnnClassifier()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(predictions, y_test))
"""
0.946666666667
"""
As you can see, it shows approximately the same results as DecisionTree classifier. Of course, the reason of that is the simplicity of the dataset. But anyway - we did it! :)