Big Data and Hadoop analytics has been a big buzz in IT
industry and often you find some catchy terms associated with it. But if we
take a closer look I think it would not be wrong to conclude that Big Data
community has derived some terms which have the roots in traditional data
warehousing or ETL implementation and are there from decades.
We have observed that over the period of time ETL/ELT is
evolving to support integration across much more than traditional data
warehouses. ETL can support integration across transactional systems,
operational data stores, BI platforms, MDM hubs, the cloud, and Hadoop
platforms.
Below are some terms you will see when it comes to data
processing with any Hadoop platform and they are listed below with corresponding
concept in traditional data warehousing.
Tuple: This
term is used to define the basic information record that can be mapped to one
row in the physical table in RDBMS or a record in a file.
Pipe Assembly: It
is defined as SET of records which are under processing, you can imagine them
as group of rows from a table or a file.
Tuple Stream: It
is actually the group of records which are under any kind of data processing
and transformation, usually in any ETL tool the source data is selected and it
will undergo some processing or transformation and this operation take place
either in system memory or in case of a push down optimization it is done in
RDMS spool space. Regardless where the transformation is applied it’s basically
the set of records under processing.
Taps : Generic
component independent of a platform , it can be mapped to something similar to
a transformation step/stage e.g. a router or filter transformation in
Informatica or Data stage ( any other ETL tool).
Flow: It
is the series of Taps or transformation stages that are linked together to read,
process and store some value into the target.
Cascade: Finally
the term because of which I had to go through a lot of tutorials to drill down
the science behind, this is a traditional concept for workflow it is defined as
a collection of flow or in traditional ETL paradigm ETL Mapping Job to execute
in a designed way to produce or achieve some value.
Hope that will help all people who are from DWH/BI background to get a
grip quickly over the concepts related to ETL in Big Data domain.
References: