Introduction
One of my common arguments is that just about anyone is developing applications – especially web-applications – because there are development frameworks that make it so easy to do so. Unfortunately, this leaves a lot to be desired due to the lack of understanding of the fundamentals.
The same argument above applies to Machine Learning. There is so much code on the web that do Machine Learning modelling, so anyone can go out there, copy the code, and run it – no matter the nature of the dataset they are using, or the level of understanding of the fundamentals. Machine Learning modelling is a grueling task by any measure.
It is all too important to understand the fundamentals of any technology that we use. In the case of Machine Learning, there is a lot of Statistics knowledge that is required as a foundation. Generally, one needs to be an expert in data to be able to claim that they have done anything sensible with Machine Learning.
Choosing the right algorithm and cost function (for error) to use, determining the appropriate data size, cleansing and engineering the dataset, identifying model overfitting and underfitting, looking for a good model fit, differentiating between static and time series data, model evaluation techniques, adjusting hyper-parameters, and interpreting the results are just some of the important minimum standards. These are not easy for the average person by any stretch of imagination.
It is not easy to be taught Machine Learning to the full extent needed. At some point, you will be left on your own to think, think, and think. It is this thinking machinery that makes ML really hard. Nobody will teach you how to think.
A little about Machine Learning
A field of Artificial Intelligence (imparting intelligence to machines), ML is a system that learns from data and improves itself in making predictions without the need for being programmed explicitly to predict outcomes. ML systems look for patterns in data and use them for making decisions. ML algorithms use historical data as input to predict new output values.
Some examples of ML systems are: spam detection, image recognition, self-driving cars, etc.
At the core of ML is the concept of dataset, which is a collection of data. A dataset has Features and Response.
• Features: predictors, inputs or attributes
o Feature matrix: It is the collection of features, in case there are more than one.
o Feature Names: It is the list of all the names of the features.
• Response: target, output or label
o Response Vector: It is used to represent response column. Generally, we have just one response column.
o Target Names: It represent the possible values taken by a response vector.
Types of Machine Learning
- Supervised Learning (Regression and Classification problems). Regression - continuous response and continuous features; Classification - discreet response and continuous features. How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression (Linear regression), and Classification (Decision Tree, Random Forest, KNN, Logistic Regression), etc.
- Unsupervised Learning (no response or labels). How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.
- Reinforcement Learning: How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.
Machine Learning Life Cycle
Below, I present the key steps (the life cycle) required for a Machine Learning Project. I will be more specific to Python (SciKit Learn) where programming examples are required.
1. Data collection: get the right quality and size of data for accurate model results.
2. Data cleansing, wrangling, feature engineering: consider cleansing missing and duplicate data; perform necessary data type conversions; understand and visualize the data. Pandas is good at this.
3. Split the data into training and testing: your model learns from the training set, while the testing set is used to evaluate the accuracy of your model after training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
4. Model identification: scientists have developed models that are suitable for different tasks, so choose one for your particular task.
5. Model training (fitting): pass the prepared data to your model to find patterns and make predictions. With further training and retraining, the model can get better at predicting. The example below fits a K-Nearest Neighbor classification model.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
#our model is given thr name classifier_knn
classifier_knn = KNeighborsClassifier(n_neighbors=3)
classifier_knn.fit(X_train, y_train)
6. Model testing (evaluation): check how the fitted model is performing by using the testing data (previously unseen). This gives you an accurate measure of how your model will perform.
# Evaluate Model: Finding accuracy and confusion matrix by comparing actual response values(y_test)with predicted response value(y_pred)
y_pred = classifier_knn.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Accuracy:", metrics.confusion_matrix(y_test, y_pred))
print("Accuracy:", metrics.classification_report(y_test, y_pred))
7. Model improvement: try to tune the parameters (e.g. n_neighbors in model fitting) to see if the model improves. Use the values of the parameters that yield maximum model accuracy.
8. Model deployment: save/persist your model and deploy to production for prediction, etc.
#persist model
from sklearn.externals import joblib
joblib.dump(classifier_knn, 'my_classifier_knn.joblib')
#predict X_unseen
model = joblib.load('my_classifier_knn.joblib')
y_ unseen = model.predict(X_ unseen)