The database that refused to die: How Postgres survived its own creators

3 hours ago 2

FEATURE Today Postgres is one of the most widely used database systems, but its launch and subsequent development were inauspicious to say the least.

If it weren’t for a league of exceptionally devoted open source contributors, it probably would be another forgotten also-ran just like Ingres, the database system on which it was based (“Postgres” was shorthand for “Post-Ingres”).

The creator of both systems, Michael Stonebraker, is perhaps the preeminent database pioneer in the field. Earlier this month, he spoke at PGDay, a conference in Boston hosted by the U.S. PostgreSQL Association, where he detailed the complicated history of the open source database system, which actually existed long before the term "open source" was even uttered.

In a sense, “Postgres is the epitome of open source software, because it doesn't belong to anybody. It was picked up by this team of programmers without any specific affiliation,” Stonebraker said.

Stonebraker essentially abandoned Postgres in the mid-1990s. But instead of fading into obscurity, the codebase was salvaged by a fiercely-dedicated volunteer community that bolted on standard SQL while preserving Stonebraker’s revolutionary extensible architecture.

Three decades later, this stubbornly-independent database has become the bedrock of modern cloud infrastructure.

Data should be relational

When it comes to relational database systems, British computer scientist and then-IBM employee Ted Codd got the ball rolling in 1970. A database is where you store your data so it can be queried in a predictable way. A database system is the software that manages the database (don’t confuse the two).

That year, Codd decreed that all data should be stored in tables and accessed using a high-level query language. IBM implemented Codd’s idea in System R, and created SQL as the query language. The results were eventually rolled into IBM's DB2.

Stonebraker, then an assistant professor at UC Berkeley, also implemented Codd’s ideas. Stonebraker and his team of grad students created not only a working prototype, but a full-scale implementation – he later cofounded a startup, Relational Technology, to sell Ingres commercially. Ingres did not use SQL, but instead employed another query language, QUEL (Query Language), although the fundamentals were similar.

A relatively primitive version of Ingres was even released gratis for academic research. But by the early 1980s, Stonebraker had “pushed the code off a cliff” and started building something new.

Thus, Postgres was born.

Beyond Ingres: Postgres

At the time, Stonebraker explained, the business world was pushing for databases to hold additional data types beyond the integers, floats, and character strings required for basic business accounting. There was complicated CAD data and GIS data, with multiple data points that needed to be stored and reasoned against.

It was clear to Stonebraker and his colleagues that the ideal database system needed to be extended with more data types, user-defined data types, user-defined operators, and user-defined functions.

Adding more data types and such might seem simple enough, but the “devil is in the details,” he noted. “You need to be able to teach the query optimizer about new types, and that's not exactly easy.” Commutative rules had to be worked out, and they had to be optimized.

This led to what was probably Postgres’ most successful feature: support for abstract data types (ADTs).

Stonebraker had other ambitions for Postgres as well. He also wanted to incorporate new work from Chris Date on referential integrity, which brought “semantic consistency between foreign keys and primary keys” to the relational model. He wanted to add in a rules engine, which would continually monitor for changes and make decisions based on those changes. Also, he wanted crash recovery.

The crash recovery and the rules engine never quite worked out, but the ADTs took root, and now most database systems support this extensibility, pretty much exactly like they were devised by Stonebraker and Co. in 1983.

“We pretty much got it right,” he said. In fact, he reckons that his work on ADTs was probably the major reason he landed the Association for Computing Machinery’s 2014 A.M. Turing Award.

Stonebraker and his mates were eager to make money from their creation. So they rolled Postgres into a start-up, Illustra, which was eventually purchased by Informix, which promptly digested the technology into its own database server.

But they also maintained an open source version…barely. It wasn’t even called open source (which wasn’t a formal term until 1998). It was considered freely available academic software, something for fellow researchers to tinker with. And it was based on the very-permissive BSD license.

The architecture that refused to die

In 1995, two Berkeley graduate students, Andrew Yu and Jolly Chen, resurrected Postgres from the last 4.2 academic release. They jettisoned the poorly-running rules engine and disaster recovery features, and, most importantly, swapped out QUEL for the then industry standard of SQL, releasing the software as Postgre95 (and later PostgreSQL).

“I didn't know any of these people,” Stonebraker said of this all-volunteer development crew. They were “a collection of super programmers who picked up this open source project and started shepherding it forward, and they've been shepherding it for the last 30 years.”

This sovereignty made Postgres safe for anyone to use and modify. Postgres’ wire interface has been widely used as the base for building other database systems, including CockroachDB, YugoByteDB, and TimeScale. Amazon Web Services, Microsoft Azure, and Google Cloud each have their own database-as-a-service built on Postgres. Chief selling point? Each is fully Postgres compatible.

“The elephants have basically bet the ranch on Postgres,” he said.

Even AWS’ graph database service is built on Postgres (“the relational implementation of [a graph database] is almost always faster, usually substantially faster, than doing it natively,” Stonebraker quipped.)

Top of the heap

These days, Postgres sits near the top of the DB-Engines ranking of the world’s most popular database systems, just below Oracle, MySQL and Microsoft SQL Server. Unlike those rivals however, Postgres continues to steadfastly gain market share.

Tom Kincaid helped organize the PGDay meetup – and is a vice president of EDB, a Postgres service company. He offered several reasons why Postgres made such a big impact, despite its initial lack of support from any of the IT giants (unlike the fellow open source MySQL, now managed by Oracle, which many open sourcerers distrust for that reason alone).

Extensibility was a major help in adoption, especially as the role of databases expanded beyond basic business accounting. ADTs gave the database system an easy entry into an expanding geospatial market, and later, document databases.

“Postgres was quickly able to provide developers exactly what they needed for storing, retrieving and searching JSON documents,” Kincaid told The Register. “The fact that you could combine SQL with many different data types allowed it to thrive with every new trend in application development.”

Also helping was the quality of the codebase (“It is held to the highest standard of review,” Kincaid said) which attracted top developers, as did the quality of the optimizer. The permissive licensing also helped, allowing start-ups and project leaders to build derivative products without fear of legal repercussions.

Why Postgres still doesn’t have file-level encryption

Despite all the love from the open source community, Postgres is still missing features that it might need to maintain parity with commercial database systems.

This was the focus of another illuminating PGDay talk by long-time Postgres contributor (and always dapper) Bruce Momjian. He ran down a long list of missing features, most of which the development team are currently grappling with.

The database system could use 64-bit transaction IDs to accommodate very large databases. It could also use support for columnar storage, which is all the rage for large-scale data analysis. Global indexing, server-threading, internal connection pooling and sharding are also features in various stages of assembly.

The major feature Postgres currently lacks, however, is file-level encryption, or “transparent data encryption,” as it is called in the industry. TDE is supported by all commercial database vendors, and it is required by the latest Payment Card Industry (PCI DSS) specifications for storing financial transaction data. Currently, Postgres lets the operating system handle the encryption.

Current development on Postgres file-level encryption is stalled “in many ways,” Momjian said. “The code changes became too heavy for the value of the feature,” he said. Not only would the functions touching the data files themselves need modification, but all the other functions scattered through the system that write temporary files must be altered as well. This would be a “monstrous” job, he said.

Still, missing features allow commercial entities to fill in the gaps. Percona, for instance, offers the feature as part of its own Postgres commercial distribution.

Commercial database companies are very sensitive to customer requirements, whether those requirements are truly necessary in a practical or technical sense, or if they are merely external or regulatory in nature.

It’s the latter set of requirements that don’t make it to the top of the Postgres to-do list as quickly, Momjian said. “We don't want to add a feature unless it really has technical value,” he said.

Momjian pointed out that the PCI mandate itself also has questionable value purely from a technical view. Once the contents are copied into the server’s memory, the encryption protection vanishes. If an attacker can bypass a system’s file system permissions, they can probably read the raw working memory and get the encryption key.

“If we're trying to lock down the file system, we'd also have to lock down memory. We don't know how to do that,” he said.

But the missing TDE may not even be a bug at all, but an actual feature of Postgres's fundamental philosophy.

“While proprietary databases target the workloads of their largest customers, Postgres targets the workloads of general users,” he said.

And that may be the best kind of success for an open source project. ®

Read Entire Article