Tuesday, August 23, 2016

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Tuesday, August 16, 2016

Operational Data Analytics using Big Data/ Data Lake


One of the major challenges in the fast pace reporting and analytical needs is to produce the desired reports/dashboards from near real time data and to take the immediate decisions.

The current use case we have is an organization ( a Health care Provider) who would like to have the visibility of their health care facilities , current or near real time ( no later then 30 minutes old) data statistic around patients, departments , hospital equipment usage , equipment current status etc.

The challenge , limited budget and very limited  time . In most of the case what development team end up doing is writing a "Operational Reporting Query" directly on the OLTP system and thus with a couple of queries /dashboards like this impact the real production environment.  It works well for a short term period of time but on a longer run has a huge impact on Production environment.

The solution we advise is to handle such cases by shifting the "Heavy processing work" to a cheap and strong candidate "Big Data". The revised work flow can look like something as below






The main idea is to offload the production work load to a cheap hadoop data lake environment , which can store the data and refresh it from source every 30 minutes.  The reporting Adhoc/Operational Queries (SQL) can be easily translated to Hive(HQL) and thus with minimal changes you can save a lot of effort and cost on Production maintenance and above all with limited Budget you are still able to meet critical project deadline and without impacting any OLTP system.

 Future perspective to this approach :


  1.  we can take advantage of the Data lake to have the data reconciliation between different OLTP source e.g. Claims or GL transactions etc. 
  2. we can do the Data integration to multiple heterogeneous sources e.g. business maintained files on share point with Operational reports etc. 
  3. Can serve as a Staging area to the Data warehouse environment , offloading the Production access window and near real time data can be pushed to DWH.
  4. Ensure to meet the SLA for dashboard deliveries to business stake holders both Operational as well the DWH based Adhoc/Analytical dashboards. 
  5. Also it can serve as a purpose to shift some of the complex and time taking "ETL" transformations from DWH or Data mart into Data Lake where as while pushing data to EDW or Data mart hadoop can apply those complex transformations and can send an "enriched" version of data set to DWH/Data mart.