Distributed data stores

While relational data stores were very efficient in handling relational datasets, soon it was realized that they may not be the best fit for other types of data storage. These types of data included semi-structured and unstructured data. Keeping relational data stores scalable at very high volumes for data storage and access also involved complicated processes and practices. These challenges were recently addressed by a range of distributed data stores which came into existence as distributed file systems and NoSQL (Not only SQL) data stores. Hadoop has been one of the most popular distributed file systems, while there have been a number of NoSQL data stores that have come into existence, each one of them solves a very specific problem. All NoSQL databases are inherently implemented on similar concepts of distributed data management, but they can be further classified into the following categories:

Figure 12: NoSQL Data Stores Classification

Shown here is a broad classification of NoSQL data stores as of their current state. Each of these types specializes in solving a particular problem related to data access and data management.

For instance, a key value store could be most appropriate while capturing ticks or machine data and where accessibility requirement can be done via key based access. Likewise columnar storage provides a denormalized storage mechanism, wherein the data is stored as columns or family of columns, instead of rows, which solves the problem of a read heavy use case and is expected to support write heavy scenarios as well. Document stores are mostly suited for storing an entire document against a key. Most often these documents are of JSON format and these stores can store JSON as is and provide a JSON-friendly query engine for supporting queries. An index store generally is preferred where there are heavy search scenarios to be implemented across large datasets in sub second, taking advantage of indexing capabilities.

Each of these stores has numerous books published as each of them has a vast landscape of capabilities when it comes to large-scale data handling and management in an enterprise. With respect to Data Lakes, we would be picking some of these data stores to demonstrate various aspects of the data serving layer in future chapters. In Chapter 5Data Acquisition of Batch Data with Apache Sqoop, we will cover HBase and in chapter 10 we will cover Elasticsearch as NoSQL stores used in our Data Lake.