Transforming Your Database
What is a database? Once upon a time, it was simple. The database was a modern Bob Cratchit putting data in tables made up of very straight columns filled with one row per entry. Long, endless rectangles of information stretching on into the future.
The relational database has been the bedrock of modern computing. The vast majority of websites are just a bunch of CSS (Cascading Style Sheets – elements displayed on screen) or lipstick painted on top of SQL (Structured Query Language – used to communicate with a database).
Everything that makes us special is just another row in the big table of life.
The love affair with the big matrix of bits is slowly fading as developers are realizing that not everything fits into a simple table. And because developers are smart and obsessive about finding solutions for every need, they’ve started creating new and better places to store the information. The last few years have brought an explosion in other mechanisms for squirreling away our data.
Are these wonderful new options still databases? Does the data have to fit into some big matrix to be a database? Some like to use the word “data store” to differentiate the modern mechanisms because the word “database” is too tightly linked in our minds to the old tabular structure. We’ll leave that up to the philosophers. Data goes in and answers come out.
Here are eight ways that the database is being reinvented in new shapes and forms.
GPU Computing
Once upon a time, video cards were built to draw elaborate scenes for kids’ games, but now the so-called graphics processing units are doing plenty of non-graphical processing. Searching through data is just one of the best non-graphical operations for them to tackle. And why not? Plowing through endless piles of data looking for a match is an inherently parallel operation made up of lots of rudimentary jobs (testing equality) repeated millions of times. So it is pretty simple to turn the job over to the thousands of processors in the GPU.
The biggest wins aren’t in answering each query (which are obviously many times faster) but in the preparation work, because there is little need for preprocessing. Many databases save time by maintaining an index, which is effectively a precomputed result of every possible search.
If this index is corrupted or destroyed, rebuilding it can take hours, days, or maybe even months. If the data can fit inside a GPU’s memory, though, you can usually get by without the index. If the data is changing quickly and most of the index is never used, then skipping the preprocessing can be quite effective.
Non-volatile memory (NVRAM)
Programmers who cut their teeth 50 years ago had it easy. They didn’t have to juggle data between the RAM and the disk with elaborate protocols for ensuring consistency. That’s because the memory back then was iron core and wasn’t erased when the power was turned off. Those good times may be back again soon because chip manufacturers are talking about replacing RAM with NVRAM or non-volatile memory.
This is a big game changer for database programmers because one of their biggest challenges (and even their greatest reason for living) is disappearing. Some suggest that the databases can get much faster because the transaction semantics can be simpler. Others float the idea of building the recovery log after the data is written to the media, not before.
No one knows how the dust will settle. Will people still use a database at all if they don’t need a permanent record? Or will the searching and indexing keep them coming back? All of the algorithms and all of the architectures are up for rethinking. We’ll know the best way to use NVRAM in a decade or so.
Scale-out SQL
When the NoSQL movement began, one of the big features was the ability to spread your data storage across multiple nodes. NoSQL databases like Cassandra and MongoDB made it seem like getting all of the nice features of large-scale storage meant abandoning the comfortable world of SQL.
In reality, there doesn’t need to be a tradeoff. While the earliest experiments in large-scale databases were easier to create because they left behind all of the SQL baggage, there’s no reason why SQL can’t work well across multiple machines running at huge scale. Indeed, companies like Oracle have been doing it for years.
The newest large-scale databases let you use all of your SQL knowledge and convenience with a set of data spread out across a big cluster. CockroachDB, for instance, offers a standard SQL query engine that accesses data replicated in multiple nodes, all with ACID guarantees. Yes, you’ll pay for some of this belt-and-suspender support for data consistency, but perhaps less than you expected.
If guaranteed consistency is important to your work, start by checking out stacks like CockroachDB, Google Cloud Spanner, Clustrix, Azure SQL, and NuoDB.
Geospatial Databases
Traditional databases are built for one-dimensional data sets, not the two dimensional coordinates from geography. You can fake it and use a standard database to accomplish basic tasks with geographic coordinates. If you stick latitude and longitude in separate columns, it’s not hard to search for rows that fall within a box defined by a range of latitudes and longitudes. But once you want to go beyond this basic box, standard SQL queries just don’t cut it.
Geospatial databases add a few extra functions that make searching, sorting, and intersecting much easier in two-dimensional space. Spatial indices, for instance, usually work by adding a grid on top of the coordinate space to make it much faster to search for rows that are adjacent in two-dimensional and three-dimensional worlds.
These indices make it possible to write queries with operations like “contain,” “overlap,” and even “touch” with sets that are defined by polygons. All of this makes reasoning about the real world that much more efficient.
Check out Neo4j Spatial, GeoMesa, MapD, and PostGIS for some good places to begin.
Graph Databases
Tables are a good repository for many data structures but they don't do a great job of modeling one big, emerging data structure that has powered the last 10 years of Internet evolution: the network. As the so-called “social graph” explodes, we’re filling our computer with more and more nodes with links between them.
And the connections between the nodes are often more important than the data in them. Sure, storing and retrieving one link between one pair of nodes is easy to do in a classic relational database, but more complicated queries start to get impossible. Is Bob two or three hops away from Chris in the friendship network? Is Mary dating the ex of one of her friends?
Graph databases make queries like this easier to run. There is no endless fetching from tables because the query knows how look in the neighborhood specified by the links. Tools like Neo4J, OrientDB, and DataStax are just a few of the options that now can barely be counted on two hands and two feet. They have their own query languages too.
Cloud Databases
One of the biggest changes lies in how we buy database software. In the past, we bought our own machines and signed licensing deals to run the software on our machines. Now the cloud companies are offering services that store blobs of data off somewhere that we can’t see or touch. They just say the data will be there when we want it.
The advantages are obvious. There is no need to maintain the server or the room holding it. There is no need to worry about licensing or configuration or installing patches. Someone else deals with all of those headaches. The solution is often cheaper too — especially if you don’t have a ton of data to store. The services usually charge by the byte.
But the dangers, if there are any, are lying in the shadows. Does someone else have access to the data? Is the server protected from power surges, lightning storms, or floods? Is the data backed up to a trustworthy offsite location? You’ve got to trust the cloud provider on everything.
Major cloud service providers Google, Microsoft, and Amazon offer a long list of database services. These days Oracle, MongoDB, and DataStax also make their databases available in the cloud.
Artificial Intelligence (AI)
Some say that artificial intelligence is just a term for the latest generation of research that is just rolling out of the labs and into production. If so, there are a number of new products and solutions adorned with buzzwords like “machine learning” or “neural networks” or “deep learning.” They may not seem like a database, but you fill them with data and ask them questions. Why not?
The good news from artificial intelligence solutions is that you don’t need to know what you’re looking for. You can just wave your hand and ask for something nebulous like “most interesting” or “closest.” There is no need for the right key, the infernal reference number that the customer service folks are always asking you to write down.
The bad news is that you won’t know if you’ve gotten the right answer because you didn’t specify the question with any precision. Is that blog post really the most interesting? The biggest secret for Google’s success is that there is no absolute right answer. If you’re in the ball park no one can complain.
The list of machine learning toolkits is almost too long to contemplate. You can always ask your favorite search engine for the “most interesting” AI.
Blockchain
The word blockchain may be tangled up with the complicated economics and politics of Bitcoin, but underneath all of the talk about currency is an extremely stable and practical distributed data store. Everyone has a chance to update the data in the long table and everyone gets to share in the answer. The big excitement is the fact that everyone shares in the same answers. It’s perfect for businesses that are frenemies.
Some developers take this a bit further and talk about “smart contracts,” which is another way of saying that the bits in the database are trustworthy enough for people to base legal issues like ownership upon them. You can’t do that with a regular database, which can be tweaked by anyone with administrative privileges.
There are weak points, though. Each user must maintain an encryption key because all transactions must be digitally signed. If that key gets lost or forgotten, the data in those rows is frozen forever. If that key gets stolen, well, all bets are off. The blockchain isn’t perfect, in other words, but it’s much more reliable than the standard model.
R3, Ripple, and IBM are just three of the many competitors exploring the space. Many of the leading banks have their own internal projects. And then there are the Bitcoin and Altcoin companies themselves, which are also big parts of the ecosystem.
You Might Also Read: