Tuesday, December 30, 2014

Basics : ETL/ELT Concepts with Big Data



Big Data and Hadoop analytics has been a big buzz in IT industry and often you find some catchy terms associated with it. But if we take a closer look I think it would not be wrong to conclude that Big Data community has derived some terms which have the roots in traditional data warehousing or ETL implementation and are there from decades. 

We have observed that over the period of time ETL/ELT is evolving to support integration across much more than traditional data warehouses. ETL can support integration across transactional systems, operational data stores, BI platforms, MDM hubs, the cloud, and Hadoop platforms.

Below are some terms you will see when it comes to data processing with any Hadoop platform and they are listed below with corresponding concept in traditional data warehousing.

Tuple:  This term is used to define the basic information record that can be mapped to one row in the physical table in RDBMS or a record in a file.

Pipe Assembly: It is defined as SET of records which are under processing, you can imagine them as group of rows from a table or a file.

Tuple Stream:  It is actually the group of records which are under any kind of data processing and transformation, usually in any ETL tool the source data is selected and it will undergo some processing or transformation and this operation take place either in system memory or in case of a push down optimization it is done in RDMS spool space. Regardless where the transformation is applied it’s basically the set of records under processing.

Taps : Generic component independent of a platform , it can be mapped to something similar to a transformation step/stage e.g. a router or filter transformation in Informatica or Data stage ( any other ETL tool).

Flow: It is the series of Taps or transformation stages that are linked together to read, process and store some value into the target.

Cascade: Finally the term because of which I had to go through a lot of tutorials to drill down the science behind, this is a traditional concept for workflow it is defined as a collection of flow or in traditional ETL paradigm ETL Mapping Job to execute in a designed way to produce or achieve some value.

Hope that will help all people who are from DWH/BI background to get a grip quickly over the concepts related to ETL in Big Data domain.

References: