Introduction
It’s amazing how many different answers you can get when you ask candidates in a data-related
job interview this simple question: What does SQL stand for?
You can get responses such as Simple Query Language, Standard Query Language, etc,
etc ... not forgetting the many "I don't know" responses.
What’s even more amazing are the answers they would give if you ask them what
actually SQL does, or what it is used for.
Spanning over several years, I have interviewed many candidates for Data Engineering and Software Engineering positions. Many of the candidates have no clue about SQL, but present a promising outlook as quick learners with an orientation of logical-thinking. Very few candidates, if any, would get it right about SQL.
What strikes me even more is that with good mentorship and hand-holding, many of such candidates turn out to excel and make very meaningful contribution as Data Engineers (or Software Engineers). There is a great sense of satisfaction for having been instrumental in turning around those lives.
So what is this SQL thing?
To cut the long story short, SQL stands for Structured Query Language, and it
is the language of Relational Databases. Period!
SQL is a domain-specific or special-purpose programming language used for querying or manipulating relational databases such as MySQL, Microsoft
SQL Server, Oracle, PostgreSQL, SQLite, etc. By querying, we mean:
i) retrieving,
inserting, updating, deleting data (records)
ii) creating, altering, dropping the database and its objects (tables, views, procedures, etc); and
iii) basically any sort of manipulation
of the database
You can contrast SQL with general-purpose programming languages such as Python, C#, Java, etc.
It is clear that one has to be proficient in relational databases before
jumping in the SQL wagon. At least one has to know the fundamentals of
relational databases, such as tables, records, fields, keys, relationships,
etc. In this post, I will assume knowledge of these fundamentals.
If SQL is the language of relational databases, then what about non-relational databases? Don't ask that question now because it is another big subject for another day, falling under what is called NoSQL.
Causes of Distraction
In my interaction with Data Engineers or Software Engineers, there are some major sources of distraction which, when used, make some engineers think that there is an easy substitute for SQL.
The sources of distraction are:
i) Tools/Frameworks that use GUIs or Wizards – this is common with reporting or visualization tools/frameworks
ii) Object Relational Mapper (ORM); and
iii) Data
Analysis libraries such as Pandas
By removing SQL complexity, the above are somewhat like high-level abstractions of SQL. They may appear to be possible substitutes
for SQL, but nothing could be farthest from the truth. They can never be absolute substitutes.
I would urge anyone in the Relational Database space to not get derailed by the
availability of the above detractors. Knowledge of pure SQL is still
paramount, in fact it is inevitable for all intents and purposes.
This talk of “I
can do without SQL” does not hold any water. SQL is a must as long as you
interact with relational databases at the backend level. Put another way, SQL
is a must for every Data or Software Professional.
If you cannot "speak SQL", then we would all wonder how you would handle structured data in a relational database. Things would just not add up. I remember back in the day when I would write at least some SQL queries every single day for over several years. SQL was, and is, just part of my life.
Must I really care about SQL?
Yes, for as long as I interface with a relational database backend. The more the interaction, the more you need to understand SQL.
You start seeing the real value of SQL when you do things like report development using some reporting framework; or when you do database programming, where your job is to write SQL programs – which reminds me of the Oracle database programming days.
Imagine that you work for a national Tax Authority, with millions of tax payers. If your manager wants some data, e.g. a list of tax payers based on some criteria, how would you get that?Of course you use SQL. The list of use-cases of SQL is endless. SQL is everywhere.
One should appreciate that SQL exists in the background of every data-driven application. For some perspective, every time you access some data on your mobile or web application, SQL is the one doing the querying work in the background. This only underscores the importance of knowing SQL.
SQL Must-Know Basics
Below, I list fundamental SQL topics which must be in the toolkit of every data
or software professional. These are standard (ANSI) SQL, which apply to any
relational database management system (RDBMS).
Each RDBMS has its own additional
extension (also called dialect) to ANSI SQL – for instance T-SQL for MS SQL Server and PL/SQL for Oracle.
The fundamental SQL topics to allow you start off are:
Data Definition Language (DDL) – managing the data structure/model; auto-committed create
alter
drop Data Manipulation Language (DML) – manipulating/modifying the actual data itself; not auto-committed, can be rolled back
insert update delete
Data Query Language (DQL) – retrieving/fetching the actual data itself select
Data Control Language (DCL) - grant and take back authority from any database user grant
revoke
Transaction Control Language (TCL) - works with DML commands like INSERT, DELETE and UPDATE only
commit
rollback
save point specifying columns (SELECT columns) filtering rows (WHERE clause) grouping or aggregating rows (GROUP BY clause) sorting (ORDER BY) joining tables (JOIN) subquery (query within a query)
Wrap-up with a nice SQL query example (aggregating across two tables)
The following two examples are easy and hard respectively. They are meant to motivate you, especially if you are at SQL beginner level of expertise. They demonstrate nearly all elements of the DML listed above.
We assume that we have a table named contact_room_payment, with fields contact_id, payment_date, and amount_paid, among other fields.
In summary:
table contact_room_payment (contact_id, payment_date, amount_paid, ...)
If we use the following simple select query to list the records:
select cr.contact_id
, cr.payment_date
, cr.amount_paid
from contact_room_payment cr
order by cr.contact_id, payment_date desc;
we get the following result:
Question 1
The first question is: for each contact_id, show the maximum (latest) payment_date.
Question 2
The second question is: for each contact_id, show the latest payment_date and latest amount_paid, i.e. the amount_paid corresponding to the latest (or maximum) payment_date.
Solution to Question 1: for each contact_id, show the maximum payment_date.
select cr.contact_id
, max(cr.payment_date)
from contact_room_payment cr
group by cr.contact_id
order by cr.contact_id;
Query Result 1:
Solution to Question 2: for each contact_id, show the latest amount_paid. This is an interesting case where we have to use a subquery (see bold) and join. If you look at this more critically, it has very important use cases (and it is not trivial).
select cr.contact_id
, cr.payment_date
, cr.amount_paid
from contact_room_payment cr
inner join (select cr1.contact_id cr1_id, max(cr1.payment_date) cr1_date from contact_room_payment cr1 group by cr1_id) s1 on s1.cr1_id=cr.contact_id and s1.cr1_date=cr.payment_date
order by cr.contact_id;
Query Result 2:
Some References
To understand the above topics, it’s easy to find appropriate material on the web. A simple Google Search will give you many choices. I will, therefore, not concern myself with the details in this post.
A related reference is my earlier post on Beginner’s Guide to MySQL, available here. Try this out although it is more MySQL-specific.
No comments:
Post a Comment