In this article, we will understand the concept of Data Warehousing in Data Science and get an idea of some tools used for data warehousing.
Table of contents
- What is Data Warehousing?
- Tools used for data warehousing
- Google BigQuery
- MS SSIS
- IBM Db2 Warehouse
- Amazon Redshift
What is Data Warehousing?
A data warehouse is a relational database that is designed for query and analysis. It is used for aggregating data from many data sources into a single consistent format. It facilitates online analytical processing (OLAP). It is used to avoid slowdowns in transaction response time rather than transaction processing. The main advantage of a data warehouse is that it aids in analyzing data to gain better understanding of the business and to also to devise strategies to improve business. Now, you may wonder that this seems similar to big data analytics. The point of difference is that big data is a technology whereas data warehousing is a concept. It is still relevant to companies which have not adopted and implemented big data analytics due to tight finances as big data technologies are expensive. It is important to keep in our mind that even though the field of big data is growing, it is not yet time to close doors on good old data warehousing.
Let us understand data warehouse through an example. Suppose you are the general manager of a firm with 3 branches in different locations and you want the daily report of the respective branch from the branch manager. One branch manager might send the report to you as a PDF, other might send it as a word document and the last one might send it as a paper note. Now, it would be difficult for you as a general manager to compile all the data and find insights with data in various formats and it would help you if all the information are in a single place. This is where a data warehouse comes into play. It can be used to aggregate all these information into a single format.
Tools used for data warehousing
Now that we have understood what data warehousing is, we will take a look at different tools used for data warehousing:
- Google BigQuery
- MS SSIS
- IBM Db2 Warehouse
- Amazon Redshift
Google BigQuery is a cloud based data warehousing tool that does not break our credit cards. Its affordable and have some features like integrating with machine learning platforms like Cloud ML and TensorFlow. It supports analysis too, with its features like multicloud data analysis and geospatial analysis and has separate products that focuses on these like BigQuery Omni and BigQuery GIS. BigQuery supports querying using ANSI SQL and allows us to perform analysis over petabytes of data.
It has separated storage and compute that lets us manage the cost and availability of the resource. Using the Storage API, one can run large data science workloads on BigQuery as it doesn't need to have multiple copies of the same data.
It stands for Microsoft SQL Server Integration Services, which is a part of the Microsoft SQL Server database software. It is used to perform ETL (extract, transform and load) operations and can handle data from a wide variety of sources in the same package. It is integrated with SQL server and Microsoft Visual Studio. SSIS also contain graphical tools for building packages and a catalog database to store, and manage these packages. Another advantage is that in SSIS, data can be loaded in a parallel manner at various locations and also to various destinations.
IBM Db2 Warehouse
IBM Db2 Warehouse is a serverless analytics data warehouse and can be deployed on the cloud or on the premises. It is a Relational database management system with Db2 and Oracle PL/SQL compatibility this platform is suited for AI applications. It also supports Python along with SQL and facilitates the training and deployment of ML models within the ecosystem. It is client-based and offers automated scaling. Depending on the number of worker nodes selected, it automatically creates the required data warehouse environment. Db2 Warehouse offers various pricing plans for different tiers.
Amazon Redshift is another cost effective and simple data warehouse tool that can analyze almost every type of data. It is a cloud-based, and can manage petabyte-scale data. It continuously monitors the health of the cluster and allows us to scale our clusters. Also, Redshift has no up-front installation costs. Querying is done with the help of standard SQL. It monitors the workload ans uses algorithms to automatically improve the physical layout of data which enables queries to be executed faster. It also gives us the flexibility of running queries either within the console or tools like Tableau, QueryBook and other data science or SQL client tools. We can either opt for on-demand pricing plan of Redshift, which is paying by the hour with no long term commitments or choose a long term plan that fits our needs.
Snowflake is a pure cloud warehousing tool built on top of the cloud infrastructure of Microsoft Azure or Amazon web Services. The storage and compute are logically integrated but physically separated and hence can be expanded or contracted based on the requirements. This makes Snowflake more adaptable. Its multi-cluster, shared architecture allows us to clone databases and tables easily without occupying any extra space. Unlike other data warehousing tools that charge based on the amount of data processed, Snowflake has a per-second billing with a minimum of 60 seconds. The price varies according the geographic location and the pricing tier we choose.
Teradata is another cloud based data warehousing tool used by most of the competetive enterprise organizations. It provides paralled querying structure and makes analyzing data simple yet efficient. It is a relational database management system that uses SQL to connect the data warehouse to various analytical tools; both open-source and commercial. One of the eye-catching feature in Teradata is that it separates data into hot and cold data, where hot data is the frequently used data and cold data is the less frequently used data. It has a pay-as-you-go policy but does not tell its pricing.
ProgestreSQL is an open-source and a stable database management system. It supports both SQL and JSON querying and is backed up by more than twenty years of community development. It can serve as a low cost and efficient data warehousing tool and can be used as a primary database by large enterprises. It is a variant of SQL server and is a feature-rich database that can handle complex queries. After intensive analysis and certain tweaks, ProgestreSQL can perfom well in both OLAP and OLTP systems. It can be used for various purposes like working with geospatial data, web and mobile application. Since it is open-source, it is totally free of cost to use.
With this article at OpenGenus, you must have the complete idea of Data Warehousing.