How Data Lake works?

In order to realize the benefits of a Data Lake, it is important to know how a Data Lake may be expected to work and what components architecturally may help to build a fully functional Data Lake. Before we pounce on the architectural details, let us understand the life cycle of data in the context of a Data Lake.

At a high level, the life cycle of a data lake may be summarized as shown here:

Figure 01: Data Lake life cycle

These can also be called various stages of data as it lives within the Data Lake. The data thus acquired can be processed and analyzed in various ways. The processing and data analysis could be a batch process or it could even be a near-real-time process. Both of these kinds of processing are expected to be supported by a Data Lake implementation as both of these patterns serve very specific use cases. The choice between the type of processing and analysis (batch/near-real-time) may also depend on the amount of processing or analysis to be performed, as it may not be feasible to perform extremely elaborate operations for near-real-time expectations, while there could be business use cases that cannot wait for long-running business processes. 

Likewise, the choice of storage would also depend on the requirements of data accessibility. For instance, if it is expected to store the data such that it could be accessed via SQL queries, the choice of storage must support a SQL interface. If the data access requirement is to provide a data view, it may involve storing the data in such a way that the data may be exposed as a view and allows for easy manageability and accessibility of data. A more prominent requirement that has been evident in recent times is that of providing data as a service, which involves exposing data over a lightweight services layer. Each of those exposed services accurately describes and delivers the data. This mode also allows for service-based integration of data with systems that can consume data services.

While the data flows into a Data Lake from the point of acquisition, its metadata is captured and managed along with data traceability, data lineage, and security aspects based on data sensitivity across its life cycle.

Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility to the data analytics pipeline and simplifies tracing of errors back to their sources.

Traceability is the ability to verify the history, location, or application of an item by means of documented recorded identification.

- Wikipedia