In this article, We will take a look at AWS Redshift and understand it using a real-time example.
Table of Contents
- Data Warehousing
- What is Redshift?
- AWS Redshift Configuration
- What makes Redshift Special?
- Key Features of Redshift
- AWS Redshift vs AWS RDS
- A Real-Life Application of Redshift
A data warehouse is a central repository that stores the current and historical data. They are used for making many business intelligence decisions, analytical reporting, etc.
Data warehousing is the storage of data from all across an organization in a single place such that the data can be queried and analyzed easily.
What is Redshift?
AWS Redshift is a data warehousing solution offered by Amazon Web Services. Redshift is used when the data to be handled is very huge, usually petabytes and exabytes in size.
Redshift is an Online Analytics Processing System(OLAP) and column-oriented database. A column-oriented database is similar to a normal database but stores data by columns rather than rows. This allows for efficient querying of data.
It is built using a technology called Massive Parallel Processing(MPP) built by the company ParAccel(acquired by Actian) to manage large data sets and database migrations. Using this technique Redshift can perform operations on large data sets very quickly while also optimizing the database.
AWS Redshift Configuration
Redshift can have 2 types of nodes,
- Single Node
It consists of only one node of size 160GB.
A multi-node consists of several compute nodes and a leader node. The leader node receives queries from the user applications and then allocates the compute nodes for parallel execution of the query.
Once completed the client sends the results back to the leader node. The leader node aggregates the results and returns the final value to the client application. There can be up to 128 compute nodes.
Each Redshift data warehouse consists of a bunch of nodes organized a group called as Redshift cluster. Each cluster runs its Redshift engine and can contain one or many databases.
When you start Redshift, It starts with a single node of 160GB. Additional nodes can be added when you need multiprocessing.
What makes Redshift Special
Redshift is based on an older version of PostgreSQL (8.0.2). This makes Redshift compatible with regular SQL queries. Redshift is also superfast and can store petabytes of data compared to its competitors like Oracle, Teradata, Couchbase, etc.
Using Massive Parallel Processing, Redshift can run multiple queries simultaneously and very fast by making use of multiple processors running in parallel across multiple servers.
Column-based data storage is ideal for data processing and analytics since a lot of the time it involves performing aggregates over a large amount of data. Column-based data involves fewer I/O thus significantly improving performance.
Columular data is stored sequentially on the disk and can be compressed very easily compared to row-based data. While loading the data, Redshift automatically samples the data and applies appropriate compression techniques to it.
Key Features of Redshift
There key features of AWS Redshift are
AWS Redshift is 10 times faster than most of the available data warehouse software. It uses column-based data storage, compression, optimized hardware to deliver unmatched performance.
2. Fully Managed
AWS Redshift automates processes such as managing, monitoring, backing up, and scaling your data warehouse. This can help reduce the complexities of managing an onsite data warehouse.
Redshift can be deployed with a few clicks from the AWS console and it automatically handles the infrastructure for you.
Data stored in Redshift can be encrypted using AWS Key Management Service(KMS) or Hardware Security Module(HSM). Additionally, you can also use SSL to secure your data. These security features make Redshift useful for storing sensitive information.
Redshift uses a pay-as-you-go model which means you pay only for what you use. This is more economical than an on-premises data warehouse which needs to be managed separately.
Redshift costs start at $0.25 per hour with no commitment and can go up to $250 per Terabyte per year.
AWS Redshift can automatically scale up or down to add or remove more nodes as needed with a few clicks or an API call. It can also quickly run queries on exabytes of data stored in your AWS S3 without the need for loading.
6. Query S3 Data Lake
A data lake is simply a place to put all your data such that it can be queried and analyzed. S3 can be used as a data lake. Using Redshift you can perform analysis on the S3 data lake without loading the data. This allows you to work on data both in your data warehouse and your data lake together using a single service.
AWS Redshift vs RDS
AWS Redshift and Relational Database Service (RDS) sometimes get mistaken for one another. But both serve different purposes and both can be used simultaneously.
RDS is a cloud relational database. It is backed by the AWS cloud and can scale based on workload. It has a maximum database size of 64TB. It is a cloud version of your on-premise database.
Redshift is a data warehousing service that allows you to store large amounts of data from different sources and perform analysis on them. It starts at 160GB in size and can scale massively in clusters.
A Real-Life Application of Redshift
An example of Redshift being used in real-time is the very popular website Coursera, which offers many professional courses and certifications. It has a user base of more than 30 Million monthly users.
It collects terabytes of data about the courses and users through API calls from its mobile apps, website, etc. All this data is stored in a central warehouse using Redshift.
The engineering team can query the warehouse to build analytics dashboards to gain intuition about the courses. This data can also be used by the machine learning systems to recommend courses to new users, study which courses are completed most time, student dropouts, etc.
With this article at OpenGenus, you must have the complete idea of AWS Redshift in System Design.