Introduction
An operational database used by a business is optimized for quick
updates due to the fact that transactions that occur on a day-to-day basis. This is
referred to as OLTP (Online Transaction Processing) database.
OLTP database is not suitable for business intelligence or analysis for many reasons,
primarily:
-
The fact that it is highly normalized, hence the queries are complex to write, and they run
inefficiently; and
- Secondly, it has multiple concurrent users. Therefore, running business intelligence or analysis tools on an operational database
would exert additional strain on the database - hence slowing it down, or even bringing it down if using very heavy queries; and
- Thirdly, operational databases may be disparate and may not communicate well in their original form, making analysis difficult.
Therefore, businesses need to find a way of separating analytical systems (OLAP) from
transactional systems (OLTP). This is where the data warehouse comes in.
Some
business choose to simply create an offline copy of the operational databases and
use this copy to perform business intelligence. In such a case, the offline database only
removes the stress away from the operational database, but still suffers
heavily in other respects.
The advent of Data Warehouse
The DW is an important component of Business Intelligence (BI). It is just a
database derived from operational databases (source transaction systems) for
purposes of analysis and reporting. DW is a central repository of current and
historical data from disparate databases. A DW makes it easy to write queries
as it does not have complex normalization compared to the operational database.
A DW is a single data repository where multiple data sources are integrated for
online analytical processing (OLAP).
Where a DW focuses on only one function or department within an organization, it is
instead referred to as a Data Mart.
Designing a DW
The process of designing a DW involves:
- Requirements specification: identifying business needs from users and
stakeholders to come up with a requirements specification document. This is similar to the Requirements Specification of the software development process.
- Modelling: organizing the business needs in terms of facts and dimensions,
and relationships. Facts are business measures or metrics, while dimensions are the business categories or references used for analysis of those facts. Star, Snowflake and Galaxy models are the three main
modelling techniques used in DW. This stage is similar to the Design stage in
software development life cycle.
- Extract, Transform, Load (ETL) or its variant ELT: Extract data from various sources and put them in a data lake. Then
Transform and integrate the data before loading into the DW. Some ETL tools are
SSIS, SAS Data Management, IBM Information Server, Hive, etc. In this stage,
the ETL pipeline is developed using an appropriate ETL tool.
- OLAP cubes: OLAP cube (or multidimensional cube or hypercube) is a data
structure optimized for fast data analysis by multiple dimensions that define a
business problem.
- User Interface (UI): to allow the end-user to use the DW, we can use front-end UI that is developed
based on tools such as Tableau, PowerBI, BigQuery, etc.
- Testing and Deployment: The DW needs to be tested before deployment.
Testing ensures that user requirements are met by the system, and that there
are no bugs. Once users have accepted the system, then it can be deployed.
- Training: Users need to be trained in using the system, specifically the
front-end UI by end-users. Also, technical teams may be trained in system
maintenance.
- Maintenance: On an ongoing basis, we may need to update the data model, UI applications, etc to meet new user requirements.
Updating a DW
Data is uploaded into a DW from operational systems (OLTP), which are - by design - not
optimized for reporting and analysis. In a DW, data is optimized for analysis
and reporting (OLAP).
A DW is built using two approaches:
- Extract, Transform, Load (ETL) and
- Extract, Load, Transform (ELT)
In ETL, there is a separate staging area where data cleaning and transformation is done before loading into the DW. The staging database
stores raw data extracted from each of the disparate source data systems. The
raw data from disparate sources is then transformed, integrated, and finally
loaded into the DW.
In the case of ELT, data if first loaded (as-is) into the DW. It is then
transformed and integrated inside the staging area, which is inside the data
warehouse itself.
Data Warehouse can be accessed by end-users using Business Intelligence tools
(reporting frameworks, visualization), SQL clients, spreadsheets, etc.
Conclusion
In this era, there is no way a business can continue to strive while using a
transactional (or OLTP) database for efficient business intelligence purposes
given the cost of doing so (refer to Introduction). A DW provides the solution
to the challenge, and it is easy to be accomplished by a team of data
professionals.
Moreover, as many businesses move to the cloud, they can now take advantage of
cloud-based DW and analysis tools such as Amazon Redshift (PostgreSQL), Microsoft Azure SQL Data Warehouse (SQL Server), Google BigQuery (ANSI SQL), Snowflake, etc.
Snowflake is a fully managed Massively Parallel Processing (MPP) cloud-based data warehouse that runs on
AWS, GCP, and Azure; and it is the only solution that doesn't run on its own cloud.
It is now easier
than ever before to leverage cloud technologies in the DW space. The cloud solutions will vary in terms of pricing, scalability, architecture, security features, speed, etc.
You either join the
DW bandwagon or ship out.