Tuesday, August 23, 2016

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Tuesday, August 16, 2016

Operational Data Analytics using Big Data/ Data Lake


One of the major challenges in the fast pace reporting and analytical needs is to produce the desired reports/dashboards from near real time data and to take the immediate decisions.

The current use case we have is an organization ( a Health care Provider) who would like to have the visibility of their health care facilities , current or near real time ( no later then 30 minutes old) data statistic around patients, departments , hospital equipment usage , equipment current status etc.

The challenge , limited budget and very limited  time . In most of the case what development team end up doing is writing a "Operational Reporting Query" directly on the OLTP system and thus with a couple of queries /dashboards like this impact the real production environment.  It works well for a short term period of time but on a longer run has a huge impact on Production environment.

The solution we advise is to handle such cases by shifting the "Heavy processing work" to a cheap and strong candidate "Big Data". The revised work flow can look like something as below






The main idea is to offload the production work load to a cheap hadoop data lake environment , which can store the data and refresh it from source every 30 minutes.  The reporting Adhoc/Operational Queries (SQL) can be easily translated to Hive(HQL) and thus with minimal changes you can save a lot of effort and cost on Production maintenance and above all with limited Budget you are still able to meet critical project deadline and without impacting any OLTP system.

 Future perspective to this approach :


  1.  we can take advantage of the Data lake to have the data reconciliation between different OLTP source e.g. Claims or GL transactions etc. 
  2. we can do the Data integration to multiple heterogeneous sources e.g. business maintained files on share point with Operational reports etc. 
  3. Can serve as a Staging area to the Data warehouse environment , offloading the Production access window and near real time data can be pushed to DWH.
  4. Ensure to meet the SLA for dashboard deliveries to business stake holders both Operational as well the DWH based Adhoc/Analytical dashboards. 
  5. Also it can serve as a purpose to shift some of the complex and time taking "ETL" transformations from DWH or Data mart into Data Lake where as while pushing data to EDW or Data mart hadoop can apply those complex transformations and can send an "enriched" version of data set to DWH/Data mart.











Thursday, June 9, 2016

Customizing Oracle ODI Knowledge Modules


Oracle ODI ships with a bundle of very useful and quite diverse knowledge modules but at the same time one of the basic need it to get a customized version of the same knowledge module to get more control and gain more functionality. So how can we do that ?

Problem Statement : 
 
The Scenario we are taking about today is very simple one , if you are using any Integration Knowledge Module it creates a table with I$_  and do the further processing from their on wards. But here is a flaw , if you kick off the same job or any other job which loads the same target table ODI will launch a new instance and mess the current running job.

For example , I am using  "IKM Oracle Incremental Update" and what it does is it creates an table with I$ instead i wish to have it create a table post_fix with current SESSION number. e.g. I$

The steps are as follows
1 - Go to the project folder -->  Knowledge Modules -- > find IKM Oracle Incremental Update
2 - Right click on the selected module above and click on "Duplicate Selection".
3 - Give it a suitable name  e.g. "Custom_IKM Oracle Incremental Update"
4- Double click on the Module and click on "Details" to see the lost of steps for this module.
5- You would find a step with a title as " Create flow table I$".
6- Select that and lets see the code looks like as follows

create table <%=odiRef.getTable("L", "INT_NAME", "W")%>
(
    <%=odiRef.getColList("", "[COL_NAME]\t\t[DEST_WRI_DT] NULL", ",\n\t", "", "")%>,
    IND_UPDATE        CHAR(1)
)
<%=odiRef.getUserExit("FLOW_TABLE_OPTIONS")%>

7- Add the custom logic to append session number with the table being created.

create table <%=odiRef.getTable("L", "INT_NAME", "W")%>_<%=odiRef.getSession("SESS_NO")%>
(
    <%=odiRef.getColList("", "[COL_NAME]\t\t[DEST_WRI_DT] NULL", ",\n\t", "", "")%>,
    IND_UPDATE        CHAR(1)
)
<%=odiRef.getUserExit("FLOW_TABLE_OPTIONS")%>

 8- Save your knowledge module and that is it.
 9- Use the Custom Knowledge Module in your mapping .

You would see that the I$ table is created with a session post fix so any other instance of same job will not disturb the running flow.

Similarly you can add new steps , make them conditional and add new options to customize it more as per requirements.

Hope that helps.

Thanks



 

Thursday, March 10, 2016

The Refined and Optimized Data Architecture for Enterprise Needs

We all are by now familiar with the hype of fancy terms like "Big Data", "Hadoop", "No SQL Databases" , but having said that there is one principle behind all that technology expenses when we take a look at this from the enterprise perspective and this is "How much Value is added into the Enterprise".

Believe it or not at the end of the day  I.T services are a supporting platform to benefit business users and stake holders to achieve business goals and forecast the future projections. The main points can be short listed to following

Cost : EDW Typical Storage (usually any MPP architecture solution e.g. Teradata , Netteza , Oracle Exadata etc ) is quite an expensive one and also has a constraint when it comes to keep historical data or unstructured data /sensor data , there is an alternate and that is to store the "Cold" storage data into cheap commodity hardware e.g. Hadoop

Value of return on investment :  Replacing all EDW infrastructure and legacy system with new architecture and technologies is quite an expensive idea however looking at long term prospects and return of investment the best approach is slowly to resolve the limitation of existing EDW infrastructure and make a hybrid architecture to get the maximum value out of it.

Solution Linear Scale-ability and long term solution design :  No doubt , long term solution design and a flexible architecture to handle the growing data and type of data ( social media , sensor , click stream data) is a challenge so data architecture is to be refined to keep long term prospects in mind.

Below is a somewhat close hybrid solution design of a new data architecture for an enterprize.



Above is a reference from a white paper published by Hadoop only one argument which is my personal opinion is that with shifting all ETL to the Hadoop. Instead i believe we should keep it hybrid ( at least for some time to run a parallel architecture) and Keep ONLY Non structured data feeds /sensor/click stream detailed data ETL work on Hadoop platform and let the traditional sources be running into the existing infrastructure.

Reference  : http://info.hortonworks.com/rs/549-QAL-086/images/hortonworks-data-architecture-optimization.pdf?mkt_tok=3RkMMJWWfF9wsRonvKTKc%2B%2FhmjTEU5z16uQsWaeygYkz2EFye%2BLIHETpodcMTcVnMLDYDBceEJhqyQJxPr3AKNkNy9RxRhHqDg%3D%3D