Data pull

Any mechanism used to pull the data from Data Lake can be considered as Data Pull mechanism. Here, we will discuss, some of the most common data pull mechanisms.

  • Services: One of the most popular mechanisms of data delivery are the data services. This comprises of building web services (REST/SOAP) over the Data Lake, such that the data can be exposed via services to the consuming applications. This works very well for consuming relatively small volumes of data over HTTP for near real time application requirements. This also stems from the notion of data as a service, wherein the entire data is ready and available over services. Such service requests and response definitions must be concise and clearly defined so that these are generic enough for multiple consumers to consume. This also implicitly means that the data access must be highly optimized so as to guarantee sub-second response times or large dataset, with the capability of random access. These services are more geared towards read-only services for data and should not be used for data mutations.
  • Data Views: A Data Lake can also potentially have data delivery mechanisms based on data views that can be connected from various applications and the data can be fetched/pulled. This mechanism of serving data has been very common as it combines simplicity with ease of maintenance and access. Once data is exposed from a data view, any of the authorized applications can directly connect to such a data view using standard drivers and any additional data processing can be performed by the consuming application itself. These views are generally materialized views to keep them performant and isolate any query impact that may occur on underlying participating tables. However materialized views also need data refreshes to be performed which can be done incrementally or may also involve reconstruction of the entire data view, also known as refresh cycles. If a refresh cycle is involved in rebuilding the entire materialized view, the required mechanisms must be in place, such that while the refresh cycle are executed the data serving is not impacted. Traditionally this was done using synonyms; for some recent technologies, the same is achieved via replicated datasets.