This article aims to:
- Give a brief overview of the PostgreSQL relational database management system
- Provide some brief comparisons to other alternatives, such as MySQL
- Illustrate useful features in PostgreSQL
- Discuss certain systems in which PostgreSQL proves more useful
So, what is PostgreSQL?
To talk about that, we first need to talk SQL and Relational Database Management Systems.
SQL and Relational Database Management Systems
What is SQL?
SQL stands for Structured Query Language. It is a language standard for interacting with databases through data storage, retrieval, and modification. The SQL language has a set of standard functions, such as SELECT, INSERT, and DELETE.
How, then, can we interact with data using SQL? Or create tables in a database?
We use Relational Database Management Systems.
A Relational Database Management System (RDBMS) is a piece of software that manages and modifies database relations in terms of tables. In other words, it is the intermediary software between applications and databases that writes, reads, and modifies data organized into tables.
PostgreSQL is a type of RDBMS. Its documentation claims many features, such as:
- Support for the usual data types (i.e. TEXT, INTEGER, etc), as well as more advanced data types such as arrays and custom data types.
- Object-relational principles, including the ability to inherit characteristics from tables.
- Support for NoSQL queries in addition to relational queries, i.e. JSON queries.
- Support for Multiversion Concurrency Control (MVCC), guaranteeing that reading data doesn't conflict with writing data, even between concurrent operations.
- Extensibility, including the ability to define your own functions.
PostgreSQL vs Other RDBMS
If you're reading this article, you're probably already familiar with other relational database management systems, such as MySQL, SQL Server, etc. So what are some of the ways PostgreSQL is different from other database management systems?
For starters, PostgreSQL is free and open-source. Microsoft's SQL Server and Oracle DB, for instance, are closed-source and require commercial licenses. PostgreSQL also provides support for advanced data structures like arrays. In addition, PostgreSQL seems to be one of the few relational database management systems that offers support for non-relational queries.
Furthermore, due to the nature of its object-relational principles, PostgreSQL also contains many useful object-oriented features, such as the ability for a table to inherit from another table. This makes it easy for any programmer to, for instance, create a child table from a parent table without going through the trouble of creating near-duplciate tables. This feature is not available, for instance, in MySQL.
Examples of PostgreSQL's Features
So what are some of the ways programmers can utilize some of these features?
Here, we will take a look at two particularly powerful features: inheritance, and user-defined types.
In programming languages such as Java and Python, programmers may find it intuitive to make some classes inherit properties from others. For instance, the subclass "Laptop" could inherit methods from a superclass "Computer", or something along those lines.
PostgreSQL's inheritance takes this intuition and applies it in database programming. Suppose, for instance, you wanted to store a directory of computer products, along with their details. We know that laptop computers would have similar fields to other computers, such as price and os. However, a laptop computer has additional necesary fields to take into account, such as its battery life.
Without inheritance, you would have the following scenario:
-- create CREATE TABLE computers ( id INTEGER PRIMARY KEY, -- product ID name TEXT NOT NULL, -- name of product price DOUBLE PRECISION NOT NULL, -- price of product os TEXT NOT NULL -- operating system ); CREATE TABLE laptops ( id INTEGER PRIMARY KEY, -- product ID name TEXT NOT NULL, -- name of product price DOUBLE PRECISION NOT NULL, -- price of product os TEXT NOT NULL, -- operating system life INTEGER NOT NULL -- battery life (hours) ); -- insert -- Dell Optiplex, $459.99, Windows OS INSERT INTO computers VALUES(1, 'Dell Optiplex', 459.99, 'Windows'); -- Lenovo Ideapad, $359.99, Windows OS, 9 hour battery INSERT INTO laptops VALUES(2, 'Lenovo Ideapad', 359.99, 'Windows', 9); -- Macbook Air, $999.99, Apple OS, 14 hour battery INSERT INTO laptops VALUES(3, 'Macbook Air', 999.99, 'Apple OS', 14); -- iMac, $1,399.99, Apple OS INSERT INTO computers VALUES(4, 'iMac', 1399.99, 'Apple OS'); -- retrieve SELECT id, name, price FROM computers WHERE price < 500 UNION SELECT id, name, price FROM laptops WHERE price < 500;
Even just looking at table creation, the statements become messy as there are duplicate column names. In addition, suppose you wanted to find all computers, laptops or not, with prices below $500.00. The lines marked under "retrieve" are beginning to get convoltued as well, featuring the UNION clause. The output is what we expect:
id | name | price ----+----------------+-------- 2 | Lenovo Ideapad | 359.99 1 | Dell Optiplex | 459.99
The problem is, what if there were more tables that would intuitively inherit properties from another? What if, for instance, what if our hypothetical computer store expanded to general electronics? Then we'd have subtables such as "television," "laptops," "desktops," "phones," "tablets," and so on and so forth, requiring the use of more UNION statements!
Inheritance makes the process simpler. Since we intuitively say that laptops inherit from computers, we can construct our SQL as follows:
-- create CREATE TABLE computers ( id INTEGER PRIMARY KEY, -- product ID name TEXT NOT NULL, -- name of product price DOUBLE PRECISION NOT NULL, -- price of product os TEXT NOT NULL -- operating system ); CREATE TABLE laptops ( life INTEGER NOT NULL -- battery life (hours) ) INHERITS (computers); -- insert -- Dell Optiplex, $459.99, Windows OS INSERT INTO computers VALUES(1, 'Dell Optiplex', 459.99, 'Windows'); -- Lenovo Ideapad, $359.99, Windows OS, 9 hour battery INSERT INTO laptops VALUES(2, 'Lenovo Ideapad', 359.99, 'Windows', 9); -- Macbook Air, $999.99, Apple OS, 14 hour battery INSERT INTO laptops VALUES(3, 'Macbook Air', 999.99, 'Apple OS', 14); -- iMac, $1,399.99, Apple OS INSERT INTO computers VALUES(4, 'iMac', 1399.99, 'Apple OS'); -- retrieve SELECT id, name, price FROM computers WHERE price < 500; SELECT id, name, price FROM ONLY computers WHERE price < 500;
Not only does the code for table creation look significantly cleaner, but we can retrieve BOTH laptop and computer models below $500 in a single SQL statement:
id | name | price ----+----------------+-------- 1 | Dell Optiplex | 459.99 2 | Lenovo Ideapad | 359.99 (2 rows)
In addition, we can now pair our statements with the "ONLY" clause such that we can look in the parent table only without considering laptops. For instance, if we only wanted non-laptop computers with prices below $500, we could run the last statement from the code above, resulting in the following:
id | name | price ----+---------------+-------- 1 | Dell Optiplex | 459.99 (1 row)
- Flexible data options
In addition, PostgreSQL's ability to define data types makes the system very useful for programmers who need to store and query new kinds of data. PostgreSQL's documentation page for User-Defined Types provides complex numbers as an example for a user-defined type:
-- complex number --> a + (b)i CREATE TYPE complex AS ( a DOUBLE PRECISION, b DOUBLE PRECISION ); -- results from some experiment CREATE TABLE results ( trial INTEGER, result COMPLEX ); INSERT INTO results VALUES(1, (4, -9)); INSERT INTO results VALUES(2, (3, 24)); INSERT INTO results VALUES(3, (6, -12)); SELECT * FROM results;
trial | result -------+--------- 1 | (4,-9) 2 | (3,24) 3 | (6,-12) (3 rows)
As such, one can see the usefulness of PostgreSQL in scientific or statistical research that requires data processing on complex data types.
However, not all programmers may find PostgreSQL suitable for use in their applications. Due to its larger number of features, PostgreSQL may a steeper learning curve than some other relational database management systems, such as MySQL. This is paired with its tedious installation process, further steepening the learning curve.
System Use Cases
So what kinds of systems is PostgreSQL best suited for?
We can get a hint by taking a look at Stackshare, a website that showcases some of the technologies that major companies use. Some of the companies cited to have used (or are using) PostgreSQL include big names:
Instagram, for instance, wrote an article many years ago about how PostgreSQL is used to shard their databases (that is, splitting data across multiple servers and databases):
Our sharded system consists of several thousand ‘logical’ shards that are mapped in code to far fewer physical shards. Using this approach, we can start with just a few database servers, and eventually move to many more, simply by moving a set of logical shards from one database to another, without having to re-bucket any of our data. We used Postgres’ schemas feature to make this easy to script and administrate. Schemas (not to be confused with the SQL schema of an individual table) are a logical grouping feature in Postgres. Each Postgres DB can h2have several schemas, each of which can contain one or more tables. Table names must only be unique per-schema, not per-DB, and by default Postgres places everything in a schema named ‘public’.
As can be seen, Instagram cites PostgreSQL as an integral part of its solution to horizontally scaling their database.
Similarly, Spotify makes use of PostgreSQL and Cassandra (a NoSQL database management software) in horizontally scaling their databases. It states:
If the feature’s [i.e. playlist creation, follower list, etc.] data needs to be partitioned, then the squad has to implement the sharding themselves in their services, however many services rely on Cassandra doing full replicas of data between sites. Setting up a full storage cluster with replication and failover between sites is complicated so we are building infrastructure to setup and maintain multi site Cassandra or postgreSQL clusters as one unit.
What can be gleaned from this? PostgreSQL is useful for any service that wants to distribute its data among multiple servers for the sake of increased transactional performance, such as systems that rely on database sharding. This, of course, includes social media applications like Instagram or entertainment applications like Spotify.
Notably, among the companies Stackshare lists as having used PostgreSQL, Uber moved away from PostgreSQL and to MySQL. One of the biggest reasons they gave for doing so was that PostgreSQL's architecture was not efficient in dealing with database replication. By contrast, they noted that MySQL was better suited for the task:
In MySQL, only the primary index has a pointer to the on-disk offsets of rows. This has an important consequence when it comes to replication. The MySQL replication stream only needs to contain information about logical updates to rows. The replication updates are of the variety “Change the timestamp for row X from T_1 to T_2.” Replicas automatically infer any index changes that need to be made as the result of these statements.
By contrast, the Postgres replication stream contains physical changes, such as “At disk offset 8,382,491, write bytes XYZ.” With Postgres, every physical change made to the disk needs to be included in the WAL stream. Small logical changes (such as updating a timestamp) necessitate many on-disk changes: Postgres must insert the new tuple and update all indexes to point to that tuple. Thus, many changes will be put into the WAL stream. This design difference means that the MySQL replication binary log is significantly more compact than the PostgreSQL WAL stream.
Uber, however, is not the only company that acknolwedges the downsides to using PostgreSQL in database replication.
Interestingly, Twitch, a company that uses PostgreSQL as part of its backend architecture, also wrote about PostgreSQL's inefficient methods for database replication. Twitch notes that, while PostgreSQL does provide capabilities for database replication, its methods for doing so pose some problems:
The multi version concurrency control (MVCC) story on PostgreSQL is a little weak. Because the way that storage works, and perhaps other reasons, replicas must exactly match the master on disk. We used to see a lot of errors with the message,
canceling statement due to conflict with recovery, rather than getting a result set. This has not been much of an issue since aggressively limiting every role’s
setting hot_standby_feedback = onin the configuration.
As can be seen, applications that rely on database replication (i.e. to maintain and make copies of critical data, such as financial transactions) may find different database management systems, such as MySQL, to be more useful. Such applications could include financial transaction systems (i.e. banking systems) or other financially-based services (i.e. Uber).
Thus, we have discussed the following:
- PostgreSQL is one of many relational database management systems.
- PostgresQL features many handy features, such as table inheritance, user-defined types, and multiversion concurrency control.
- Many of PostgreSQL's features can be used to optimize many large and complex operations, but may pose a steep learning curve to database programmers.
- For systems that rely on distributing data across servers (i.e. database sharding), PostgreSQL is often the technology of choice.
- For systems that rely on replicating and storing data across servers (i.e. database replication), PostgreSQL may not be the best technology to use.