2022

Tuesday, 4 October 2022

The Journey to Machine Learning Engineer

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that seeks to learn from known data in order to be able to make predictions (inference) based on previously unseen data.

Traditionally, we have been programming computers using a programming language so that the computer can blindly and religiously follow the instructions developed by a human. In AI, the computer mimics human intelligence without being explicitly programmed in the traditional sense alluded to above.

Today, we talk specifically about how to learn ML. Although there is a vast array of literature on ML on the web, it is not a passport for learning ML. In my experience, there is an arsenal of knowledge or skillset that facilitates learning ML. If you are equipped with them, then learning ML becomes much easier compared to someone without the prior skills.

It is important to first summarize the key steps in ML: i) Gather/collect data, ii) Prepare the data, iii) Choose and define a model, iv) Train the model, v) Evaluate the model, vi) Hyper-parameter tuning, vii) Deployment for prediction.

I find the following skillset really instrumental in the learning process:

Data skills - if you have generally worked with electronic data in various ways before, right from collection, storage, analysis, reporting of data, then you have a great asset to start with. This means you are likely (or even certainly) familiar with tools such as SQL, Python (or R). And when it comes to Python, then you are likely familiar with the Python fundamentals, Pandas, Numpy, Matplotlib, etc. If you have these under your belt, thank God for putting you on the right path.
Programming skills - ML has a lot to do with programming. Knowledge of the general programming fundamentals, constructs, data structures, etc can significantly lower the learning curve as you learn ML. Imagine that you are supposed to scrape the web in order to extract some data for ML modelling. Surely, you won't do that without programming knowledge. The ML pipeline alone is a collection of code, full of standard programming constructs, which means there is no short cut without understanding programming.
Applied Statistics - this goes without saying. To believe what I mean, just try to read through a ML modelling exercise and see for yourself.

Let's assume that you want to learn ML using Python. Then the best way, in my view, is to follow these steps:

Read about the theory of AI and ML.
Learn and understand Python, Pandas, Numpy, Matplotlib.
Learn and understand some basic statistics.
Learn SQL (even though just basic SQL), especially if you are going to be dealing with structured data from relational databases.
Install Jupyter Notebooks (or Anaconda as a whole) to allow you play with code for existing ML projects.
Pick some good sample ML projects using Scikit-learn library for supervised learning (regression and classification) and unsupervised learning (clustering).
Pick some good sample Deep Learning projects using Tensorflow (Keras) library.
Every time you pick a project to play with, spend a lot of time and manually run the code step-by-step in Jupyter Notebook. You will realize how much the skillset I listed above matter. As much as possible, try changing the code to see the effects on the model. If you can dwell on this stage for as long as you can, then ML modelling will start making a lot of sense.

If you can go through the above stages, I believe you will be on your way to becoming a ML Engineer, the much coveted job title in the data space today. But just imagine that you already have the prior skillsets of Data skills, Programming skills and Applied Statistics skills, then life could not have been any easier for you. As they say, you have just fallen into things.

Thursday, 1 September 2022

Machine Learning – everyone is modelling

Introduction

One of my common arguments is that just about anyone is developing applications – especially web-applications – because there are development frameworks that make it so easy to do so. Unfortunately, this leaves a lot to be desired due to the lack of understanding of the fundamentals.

The same argument above applies to Machine Learning. There is so much code on the web that do Machine Learning modelling, so anyone can go out there, copy the code, and run it – no matter the nature of the dataset they are using, or the level of understanding of the fundamentals. Machine Learning modelling is a grueling task by any measure.

It is all too important to understand the fundamentals of any technology that we use. In the case of Machine Learning, there is a lot of Statistics knowledge that is required as a foundation. Generally, one needs to be an expert in data to be able to claim that they have done anything sensible with Machine Learning.

Choosing the right algorithm and cost function (for error) to use, determining the appropriate data size, cleansing and engineering the dataset, identifying model overfitting and underfitting, looking for a good model fit, differentiating between static and time series data, model evaluation techniques, adjusting hyper-parameters, and interpreting the results are just some of the important minimum standards. These are not easy for the average person by any stretch of imagination.

It is not easy to be taught Machine Learning to the full extent needed. At some point, you will be left on your own to think, think, and think. It is this thinking machinery that makes ML really hard. Nobody will teach you how to think.

A little about Machine Learning

A field of Artificial Intelligence (imparting intelligence to machines), ML is a system that learns from data and improves itself in making predictions without the need for being programmed explicitly to predict outcomes. ML systems look for patterns in data and use them for making decisions. ML algorithms use historical data as input to predict new output values.

Some examples of ML systems are: spam detection, image recognition, self-driving cars, etc.

At the core of ML is the concept of dataset, which is a collection of data. A dataset has Features and Response.

•   Features: predictors, inputs or attributes
   o   Feature matrix: It is the collection of features, in case there are more than one.
   o   Feature Names: It is the list of all the names of the features.
•   Response: target, output or label
   o   Response Vector: It is used to represent response column. Generally, we have just one response column.
   o   Target Names: It represent the possible values taken by a response vector.

Types of Machine Learning

Supervised Learning (Regression and Classification problems). Regression - continuous response and continuous features; Classification - discreet response and continuous features. How it works: This algorithm consist of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression (Linear regression), and Classification (Decision Tree, Random Forest, KNN, Logistic Regression), etc.
Unsupervised Learning (no response or labels). How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.
Reinforcement Learning: How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.

Machine Learning Life Cycle

Below, I present the key steps (the life cycle) required for a Machine Learning Project. I will be more specific to Python (SciKit Learn) where programming examples are required.

1.   Data collection: get the right quality and size of data for accurate model results.

2.   Data cleansing, wrangling, feature engineering: consider cleansing missing and duplicate data; perform necessary data type conversions; understand and visualize the data. Pandas is good at this.

3.   Split the data into training and testing: your model learns from the training set, while the testing set is used to evaluate the accuracy of your model after training.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

4. Model identification: scientists have developed models that are suitable for different tasks, so choose one for your particular task.

5. Model training (fitting): pass the prepared data to your model to find patterns and make predictions. With further training and retraining, the model can get better at predicting. The example below fits a K-Nearest Neighbor classification model.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
#our model is given thr name classifier_knn
classifier_knn = KNeighborsClassifier(n_neighbors=3)
classifier_knn.fit(X_train, y_train)

6. Model testing (evaluation): check how the fitted model is performing by using the testing data (previously unseen). This gives you an accurate measure of how your model will perform.

# Evaluate Model: Finding accuracy and confusion matrix by comparing actual response values(y_test)with predicted response value(y_pred)
y_pred = classifier_knn.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Accuracy:", metrics.confusion_matrix(y_test, y_pred))
print("Accuracy:", metrics.classification_report(y_test, y_pred))

7. Model improvement: try to tune the parameters (e.g. n_neighbors in model fitting) to see if the model improves. Use the values of the parameters that yield maximum model accuracy.

8. Model deployment: save/persist your model and deploy to production for prediction, etc.

#persist model
from sklearn.externals import joblib
joblib.dump(classifier_knn, 'my_classifier_knn.joblib')

#predict X_unseen
model = joblib.load('my_classifier_knn.joblib')
y_ unseen = model.predict(X_ unseen)

Saturday, 16 July 2022

Database Management Systems: ACID Properties

Introduction

Have you ever sent someone money using mobile money (or just transferred money electronically), and the recipient claims that they have not received it, yet your account has been debited? If so, you may want to know more about ACID properties in databases.

ACID stands for Atomicity Consistency Isolation Durability, and is based on the concept of DBMS Transaction.

Look at this screenshot for some reflection about the seriousness of a related incident from some country called Nigeria (which I have never been to). I just hear about Nigeria and the stories. Money lost, JUST LIKE THAT!!! Note that this may not have been caused exactly by the lack of ACIDity in the database, but that cannot be ruled out as well.

Transaction and ACID

A Transaction is a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data). For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

A Transaction is a single logical unit of work that accesses and/or modifies the contents of a database. Transactions access data using read and write SQL operations.

If a transaction consists of T1 and T2, then the pseudocode is:

Start transaction
T1 (e.g. debit sender’s account)
T2 (e.g. credit recipient’s account)
Commit transaction (or Rollback)

ACID plays an important role in maintaining database consistency before and after execution of transactions. Developers are aware that the transactions they implement in a database must follow ACID properties, otherwise we would end up in chaos.

Atomicity

Atomicity means that either the entire transaction takes place at once or doesn’t take place at all – All or Nothing. A transaction is considered as one unit and must therefore run successfully to completion (commit), or else is does not get executed at all (rollback).

A Transaction is often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single unit, which either succeeds completely or fails completely:

So, when you send mobile money, your account cannot be debited (T1) without the recipient’s account being credited (T2).

Consistency (Correctness)

This means that integrity constraints must be maintained so that the database is consistent before and after the transaction.

Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. A possible rule could be that the total value of all accounts must be 100, which must be maintained before and after a transaction. This prevents database corruption by an illegal transaction, but does not guarantee that a transaction is correct.

Inconsistency can occur if T1 completes but T2 fails. As a result, the transaction is incomplete.

Isolation

Isolation ensures that multiple transactions can occur concurrently without leading to the inconsistency of the database state. Transactions occur independently without interference.

Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.

Consequently, changes occurring in a particular transaction will not be visible to any other transaction until that particular change in that transaction has been committed.

Durability

Durability is about persistence. It means that once the transaction has been committed, the updates and modifications to the database are stored permanently and they persist even in spite of system failures.

Orama's Data Blog

sub-title