June 2022

Monday, 27 June 2022

Join the Data Warehouse Bandwagon Now

Introduction

An operational database used by a business is optimized for quick updates due to the fact that transactions that occur on a day-to-day basis. This is referred to as OLTP (Online Transaction Processing) database.

OLTP database is not suitable for business intelligence or analysis for many reasons, primarily:

The fact that it is highly normalized, hence the queries are complex to write, and they run inefficiently; and
Secondly, it has multiple concurrent users. Therefore, running business intelligence or analysis tools on an operational database would exert additional strain on the database - hence slowing it down, or even bringing it down if using very heavy queries; and
Thirdly, operational databases may be disparate and may not communicate well in their original form, making analysis difficult.

Therefore, businesses need to find a way of separating analytical systems (OLAP) from transactional systems (OLTP). This is where the data warehouse comes in.

Some business choose to simply create an offline copy of the operational databases and use this copy to perform business intelligence. In such a case, the offline database only removes the stress away from the operational database, but still suffers heavily in other respects.

The advent of Data Warehouse

The DW is an important component of Business Intelligence (BI). It is just a database derived from operational databases (source transaction systems) for purposes of analysis and reporting. DW is a central repository of current and historical data from disparate databases. A DW makes it easy to write queries as it does not have complex normalization compared to the operational database.

A DW is a single data repository where multiple data sources are integrated for online analytical processing (OLAP).

Where a DW focuses on only one function or department within an organization, it is instead referred to as a Data Mart.

Designing a DW

The process of designing a DW involves:

Requirements specification: identifying business needs from users and stakeholders to come up with a requirements specification document. This is similar to the Requirements Specification of the software development process.
Modelling: organizing the business needs in terms of facts and dimensions, and relationships. Facts are business measures or metrics, while dimensions are the business categories or references used for analysis of those facts. Star, Snowflake and Galaxy models are the three main modelling techniques used in DW. This stage is similar to the Design stage in software development life cycle.
Extract, Transform, Load (ETL) or its variant ELT: Extract data from various sources and put them in a data lake. Then Transform and integrate the data before loading into the DW. Some ETL tools are SSIS, SAS Data Management, IBM Information Server, Hive, etc. In this stage, the ETL pipeline is developed using an appropriate ETL tool.
OLAP cubes: OLAP cube (or multidimensional cube or hypercube) is a data structure optimized for fast data analysis by multiple dimensions that define a business problem.
User Interface (UI): to allow the end-user to use the DW, we can use front-end UI that is developed based on tools such as Tableau, PowerBI, BigQuery, etc.
Testing and Deployment: The DW needs to be tested before deployment. Testing ensures that user requirements are met by the system, and that there are no bugs. Once users have accepted the system, then it can be deployed.
Training: Users need to be trained in using the system, specifically the front-end UI by end-users. Also, technical teams may be trained in system maintenance.
Maintenance: On an ongoing basis, we may need to update the data model, UI applications, etc to meet new user requirements.

Updating a DW

Data is uploaded into a DW from operational systems (OLTP), which are - by design - not optimized for reporting and analysis. In a DW, data is optimized for analysis and reporting (OLAP).

A DW is built using two approaches:

Extract, Transform, Load (ETL) and
Extract, Load, Transform (ELT)

In ETL, there is a separate staging area where data cleaning and transformation is done before loading into the DW. The staging database stores raw data extracted from each of the disparate source data systems. The raw data from disparate sources is then transformed, integrated, and finally loaded into the DW.

In the case of ELT, data if first loaded (as-is) into the DW. It is then transformed and integrated inside the staging area, which is inside the data warehouse itself.

Data Warehouse can be accessed by end-users using Business Intelligence tools (reporting frameworks, visualization), SQL clients, spreadsheets, etc.

Conclusion

In this era, there is no way a business can continue to strive while using a transactional (or OLTP) database for efficient business intelligence purposes given the cost of doing so (refer to Introduction). A DW provides the solution to the challenge, and it is easy to be accomplished by a team of data professionals.

Moreover, as many businesses move to the cloud, they can now take advantage of cloud-based DW and analysis tools such as Amazon Redshift (PostgreSQL), Microsoft Azure SQL Data Warehouse (SQL Server), Google BigQuery (ANSI SQL), Snowflake, etc.

Snowflake is a fully managed Massively Parallel Processing (MPP) cloud-based data warehouse that runs on AWS, GCP, and Azure; and it is the only solution that doesn't run on its own cloud.

It is now easier than ever before to leverage cloud technologies in the DW space. The cloud solutions will vary in terms of pricing, scalability, architecture, security features, speed, etc.

You either join the DW bandwagon or ship out.

Tuesday, 14 June 2022

Cloud Computing: A better way to understand Deployment Types and Models

Introduction

According to Microsoft: “… Simply put, cloud computing is the delivery of computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet (“the cloud”) to offer faster innovation, flexible resources, and economies of scale. You typically pay only for cloud services you use, helping you lower your operating costs, run your infrastructure more efficiently, and scale as your business needs change …”

Now, when it comes to getting information from the web, there can be a lot of confusion because it depends on whose opinion you seek. With technical stuff, even industry experts usually have divergent opinions. That is why it can be so hard to understand terminologies relating to cloud computing … for example, the different deployments types (Public, Private, Hybrid) and cloud models (IaaS, PaaS, SaaS, etc) can really get frustrating to understand.

Deployment types of Cloud Computing

Before getting to Cloud models, let’s make a soft landing with Cloud deployment types.

1. Public Cloud: Here, the provider of cloud computing enables users to access the shared infrastructure via an online service. Users are not responsible for providing servers, infrastructure, and bandwidth – this is entirely the responsibility of the provider.

Users pay only for used features and can change their subscription plans at any given moment. Public cloud computing services are easy to use and navigate; they are highly scalable and accessible.

2. Private Cloud: A private cloud is created and managed by the user for secure data storage and management. It is used exclusively by one user. The user may install on-premise or outsource a provider for remote service. User is fully responsible for maintenance. On-premise is a private cloud if it is connected to the internet.

Here, the company doesn’t need to give data away to a third-party provider. An in-house team can manage the data centers. The key advantage is security of confidential data, but it is expensive to set up (just like on-premise anyway).

3. Hybrid Cloud: This is a mix of private and public deployments. You can use public cloud to store data for everyday operations, high-volume tasks like software development or maintenance. Then a private cloud for confidential data and backups. This is very common.

Companies can benefit from the expertise of the third-party provider while still keeping control over crucial data. However, the business still needs to invest in in-house infrastructure, since private cloud has to be supported with local resources.

Models of Cloud Computing (Iaas, PaaS, SaaS)

There is a very easy way to understand the three cloud models. You approach it from a completely on-premise angle (where you have nothing to do with the cloud), and then you move upwards towards the cloud.

On-premise: Here all the following three aspects are managed locally on-premise:

i) Infrastructure (servers/hardware/virtualization, networking, security, storage, datacentre building)
ii) Platform (platform for operating system, development tools)
iii) Software (application, data)

Beyond on-premise, we are in the cloud. The only variation is what is managed by the provides vs what is managed by us.

1. Infrastructure as a Service (IaaS): Moving one step up from on-premise takes us to Infrastructure as a Service (IaaS), where the provider manages the infrastructure, leaving us to manage the Platform and Software. We rent the infrastructure from a cloud provider and pay on a pay-as-you-go basis. E.g. Microsoft Azure, which provides services for software development, administration, and deployment, and tools for working with innovative technologies (big data, machine learning, Internet of Things), etc. Azure may be considered as IaaS, PaaS and SaaS at the same time.

2. Platform as a Service (PaaS): Moving one more step up takes us to Platform as a Service (PaaS), where the provider manages the infrastructure and Platform, leaving us to manage the Software. We rent the infrastructure and platform from a cloud provider and pay on a pay-as-you-go basis. E.g. AWS Elastic Beanstalk: a web platform for software deployment and management, powered by the AWS Cloud. Users upload their applications to the service.

3. Software as a Service (SaaS): Moving one more step up takes us to Software as a Service (SaaS), where the provider manages the infrastructure, Platform and Software, leaving us to manage nothing. We just use everything. We rent the infrastructure, platform and software from a cloud provider and pay on a pay-as-you-go basis. E.g. Microsoft Office 365, Salesforce, Google’s G Suite.

I believe that the above approach to understanding the Cloud models makes more natural sense, and hopefully clears the confusion that comes with the web.

Benefits of Cloud Computing

Lastly, see the benefits in the info-graphics below (courtesy of Microsoft):

Orama's Data Blog

sub-title

Monday, 27 June 2022

Join the Data Warehouse Bandwagon Now

Tuesday, 14 June 2022

Cloud Computing: A better way to understand Deployment Types and Models