May 2022

Introduction

If your routine work involves working with SQL queries, as is the case for many, then there is no doubt that you come face to face with both simple and complex queries.

Query performance becomes a real issue in production, because it directly affects delivery and user-experience. There are some really low-hanging fruits when it comes to query tuning or optimization, yet they are often overlooked.

This post will show you some of those low-hanging fruits with respect to MySQL, but the tips also generally apply to other Relational Database Management Systems (RDBMS) such as Oracle, MS SQL Server, PostgreSQL, etc.

The Low-Hanging Fruits

Inefficient or slow queries may take a toll on system resources, i.e. memory, processor, disk and network. Such queries need to be considered for optimization. You can try to profile slow queries by adding the EXPLAIN keyword in front of a SELECT statement. It will show you some details, including the execution path, and whether the query is using an index or the whole table scan. Generate query plan as follows:

EXPLAIN QUERY PLAN SELECT ...

When writing SQL queries in MySQL, the following basic tips will help to optimize your queries. These basic tips are generally extensible to any other RDBMS, not just MySQL.

The following list of top 10 "AVOID" tips assumes that the reader is conversant with SQL, and has written SQL queries before. As an individual, you will often have to think outside the box to find suitable alternatives as you try to avoid some of these pitfalls.

SELECT *: Avoid extravagance. Select only the required fields rather than SELECT *.
DISTINCT: Avoid using SELECT DISTINCT if possible. You rather creatively look out for other alternatives.
UNION: Avoid using UNION, instead use UNION ALL.
WILDCARDS: Avoid using wildcards at the beginning of a predicate. A predicate is a condition or expression that evaluates to a Boolean i.e. True or False, used for filtering in WHERE clause. A wildcard at the beginning of a search does not apply indexing, but runs a full table scan – which is costly.
FUNCTIONS: Avoid using functions in predicates e.g. WHERE len(surname) > 10. Here, len(surname) will not use index, hence slow. The solution is to ensure that you also index on the function. For example, CREATE INDEX len_idx ON table_name(LEN(suname)). This is called function-index or expression-index.
OR: Avoid using too many instances of “OR” in predicates.
WHERE: Avoid using WHERE for joining tables, Use JOIN rather than WHERE for joining tables. You need to understand INNER, LEFT, RIGHT, and CROSS JOINS very well.
JOINS: Avoid using too many JOINS in a query. The same also applies to too many subqueries.
ORDER BY: Avoid composite ORDER BY with different orders i.e. a mix of ASC and DESC. You may also combine this with INDEX to speed up the sorting.
INDEX: Avoid using non-indexed fields in predicates. Create INDEX for fields that you use for searching or filtering (i.e. where you use WHERE), JOINs and ORDER BY. In the same vein, avoid keeping unused INDEXES since they will unnecessarily slow down saves in a transactional system (OLTP). Note that indexes are updated every time a record is inserted, deleted or updated.

Conclusion

While the above tips may look trivial to some – and often taken for granted – they are indeed powerful tools in the day-to-day SQL assignments. They are pretty easy to implement as long as you can make them part of your practice.

If properly implemented, you may gradually notice that your users will complain less about poor system performance.

Important hint about Index: Using index speeds up record retrieval ... it allows us to locate the exact data without scanning the whole table, just like in a text book.. Clustered index creates physical ordering of rows. Every primary key is automatically a clustered index, and there can be only one clustered index on a table at any one time. It is like a Dictionary, without a separate index page. To create another clustered index, first remove primary key and previous index. Non-Clustered index is an index structure separate from the data stored in a table that reorders one or more selected columns. The non-clustered index is created to improve the performance of frequently used queries not covered by a clustered index. It’s like a textbook, where the index page is created separately at the beginning of that book.

Introduction

It’s amazing how many different answers you can get when you ask candidates in a data-related job interview this simple question: What does SQL stand for?

You can get responses such as Simple Query Language, Standard Query Language, etc, etc ... not forgetting the many "I don't know" responses.

What’s even more amazing are the answers they would give if you ask them what actually SQL does, or what it is used for.

Spanning over several years, I have interviewed many candidates for Data Engineering and Software Engineering positions. Many of the candidates have no clue about SQL, but present a promising outlook as quick learners with an orientation of logical-thinking. Very few candidates, if any, would get it right about SQL.

What strikes me even more is that with good mentorship and hand-holding, many of such candidates turn out to excel and make very meaningful contribution as Data Engineers (or Software Engineers). There is a great sense of satisfaction for having been instrumental in turning around those lives.

So what is this SQL thing?

To cut the long story short, SQL stands for Structured Query Language, and it is the language of Relational Databases. Period!

SQL is a domain-specific or special-purpose programming language used for querying or manipulating relational databases such as MySQL, Microsoft SQL Server, Oracle, PostgreSQL, SQLite, etc. By querying, we mean:

i) retrieving, inserting, updating, deleting data (records)
ii) creating, altering, dropping the database and its objects (tables, views, procedures, etc); and
iii) basically any sort of manipulation of the database

You can contrast SQL with general-purpose programming languages such as Python, C#, Java, etc.

It is clear that one has to be proficient in relational databases before jumping in the SQL wagon. At least one has to know the fundamentals of relational databases, such as tables, records, fields, keys, relationships, etc. In this post, I will assume knowledge of these fundamentals.

If SQL is the language of relational databases, then what about non-relational databases? Don't ask that question now because it is another big subject for another day, falling under what is called NoSQL.

Causes of Distraction

In my interaction with Data Engineers or Software Engineers, there are some major sources of distraction which, when used, make some engineers think that there is an easy substitute for SQL.

The sources of distraction are:

i) Tools/Frameworks that use GUIs or Wizards – this is common with reporting or visualization tools/frameworks
ii) Object Relational Mapper (ORM); and
iii) Data Analysis libraries such as Pandas

By removing SQL complexity, the above are somewhat like high-level abstractions of SQL. They may appear to be possible substitutes for SQL, but nothing could be farthest from the truth. They can never be absolute substitutes.

I would urge anyone in the Relational Database space to not get derailed by the availability of the above detractors. Knowledge of pure SQL is still paramount, in fact it is inevitable for all intents and purposes.

This talk of “I can do without SQL” does not hold any water. SQL is a must as long as you interact with relational databases at the backend level. Put another way, SQL is a must for every Data or Software Professional.

If you cannot "speak SQL", then we would all wonder how you would handle structured data in a relational database. Things would just not add up. I remember back in the day when I would write at least some SQL queries every single day for over several years. SQL was, and is, just part of my life.

Must I really care about SQL?

Yes, for as long as I interface with a relational database backend. The more the interaction, the more you need to understand SQL.

You start seeing the real value of SQL when you do things like report development using some reporting framework; or when you do database programming, where your job is to write SQL programs – which reminds me of the Oracle database programming days.

Imagine that you work for a national Tax Authority, with millions of tax payers. If your manager wants some data, e.g. a list of tax payers based on some criteria, how would you get that?Of course you use SQL. The list of use-cases of SQL is endless. SQL is everywhere.

One should appreciate that SQL exists in the background of every data-driven application. For some perspective, every time you access some data on your mobile or web application, SQL is the one doing the querying work in the background. This only underscores the importance of knowing SQL.

SQL Must-Know Basics

Below, I list fundamental SQL topics which must be in the toolkit of every data or software professional. These are standard (ANSI) SQL, which apply to any relational database management system (RDBMS).

Each RDBMS has its own additional extension (also called dialect) to ANSI SQL – for instance T-SQL for MS SQL Server and PL/SQL for Oracle.

The fundamental SQL topics to allow you start off are:

Data Definition Language (DDL) – managing the data structure/model; auto-committed 
create
alter
drop

Data Manipulation Language (DML) – manipulating/modifying the actual data itself; not auto-committed, can be rolled back
insert
update
delete

Data Query Language (DQL) – retrieving/fetching the actual data itself
select

Data Control Language (DCL) - grant and take back authority from any database user
grant
revoke

Transaction Control Language (TCL) - works with DML commands like INSERT, DELETE and UPDATE only
commit
rollback
save point

specifying columns (SELECT columns)
filtering rows (WHERE clause)
grouping or aggregating rows (GROUP BY clause)
sorting (ORDER BY)
joining tables (JOIN)
subquery (query within a query)

Wrap-up with a nice SQL query example (aggregating across two tables)

The following two examples are easy and hard respectively. They are meant to motivate you, especially if you are at SQL beginner level of expertise. They demonstrate nearly all elements of the DML listed above.

We assume that we have a table named contact_room_payment, with fields contact_id, payment_date, and amount_paid, among other fields.

In summary:

table contact_room_payment  (contact_id, payment_date, amount_paid, ...)

If we use the following simple select query to list the records:

select cr.contact_id
	, cr.payment_date
	, cr.amount_paid
from contact_room_payment cr
order by cr.contact_id, payment_date desc;

we get the following result:

Question 1

The first question is: for each contact_id, show the maximum (latest) payment_date.

Question 2

The second question is: for each contact_id, show the latest payment_date and latest amount_paid, i.e. the amount_paid corresponding to the latest (or maximum) payment_date.

Solution to Question 1: for each contact_id, show the maximum payment_date.

select cr.contact_id
	, max(cr.payment_date)
from contact_room_payment cr
group by cr.contact_id
order by cr.contact_id;

Query Result 1:

Solution to Question 2: for each contact_id, show the latest amount_paid. This is an interesting case where we have to use a subquery (see bold) and join. If you look at this more critically, it has very important use cases (and it is not trivial).

select cr.contact_id
	, cr.payment_date
	, cr.amount_paid
from contact_room_payment cr 
inner join (select cr1.contact_id cr1_id, max(cr1.payment_date) cr1_date from contact_room_payment cr1 group by cr1_id) s1 on s1.cr1_id=cr.contact_id and s1.cr1_date=cr.payment_date
order by cr.contact_id;

Query Result 2:

Some References

To understand the above topics, it’s easy to find appropriate material on the web. A simple Google Search will give you many choices. I will, therefore, not concern myself with the details in this post.

A related reference is my earlier post on Beginner’s Guide to MySQL, available here. Try this out although it is more MySQL-specific.

Orama's Data Blog

sub-title

Thursday, 26 May 2022

MySQL Query Tuning: the low-hanging fruits

Wednesday, 18 May 2022

SQL – Structured Query Language: are there any substitutes?