Tuesday, August 23, 2016

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

No comments:

Post a Comment