书名：Data Lake for Enterprises
作者名：Tomcy John Pankaj Misra
本章字数：401字
更新时间：2025-04-04 19:11:41

Batch layer

The batch layer is where raw data is stored as is in the rawest format possible. Since no omission nor transformation happens while storing, many different use cases with different perspectives can be derived from these at different stages. This is the store where master data in an immutable state is also available and used by various analyses going forward. Since the data is immutable, update and even delete are forbidden operations. Data is always appended (added) with a timestamp so that when some data is required, it can be queried with the highest timestamp to get the latest record. Delete is also forbidden because then many analyses would require these deleted record details.

The queries, when run against raw data, would result in lot of processing time. To avoid these delays while querying the required details, in a periodic fashion, views aligning closer to the required format (result) is generated and stored, called batch views. Whenever a new batch view is regenerated (by taking in data that has come after the last batch processing), the old batch view is discarded. As one of the principles of this architecture is fault tolerance, this regeneration of batch view every time, even though it is really time consuming, takes away the various errors that could have got introduced as explained earlier. There are different approaches that can be used to make sure that this data processing takes less time as opposed to conventional batches, which take hours and even days to complete.

Figure 04: Lambda Architecture - batch layer

The persistence store requirement for a batch layer catering to a very large amount of data is that it should support high-scale random reads. However, it need not support random writes as the data is bulk-loaded in a set frequency.

With respect to the single customer view use case, the following figure shows how the batch layer can be realized, by producing a so-called batch view (intermediate view) from the customer master dataset.

Figure 05: Single customer view - batch layer

For our use case of a single customer view, the customer data flows into the batch layer, where the master dataset is maintained; and then, at a set batch process interval, batch views are created. The serving layer, when required, will query these views, merge with the speed views, and send the results across.