Friday, August 7, 2020

IRI CoSort Experience with Talend

Some time ago I did an independent evaluation of IRI CoSort and benchmark it with Talend and demonstrated how the overall performance can be improved by integrating CoSort within Talend. 

Below is the outcome published on IRI official site.  

https://www.iri.com/blog/data-transformation2/optimizing-talend-transforms-with-cosort/



Sunday, July 26, 2020

Python Data Structure Playbook- Part 1

If you are like me who has been working with python for some time on and off you would feel a need to have a quick reference to all the basic data structure supported by python and have the commonly used function handly.

Here is a summary of the most commonly used data structure with some examples that could be used as a quick cheat sheet while working on projects.

1) String :

The most obvious data structure supported by all programming languages is string-like java and  C# python does provide some very kool function that makes your life easy.

Below example define a string and then print it, later it would Token the string into a list using "split" function where we defined the delimiter as ":".

Just one line of code will get you all the token you need for your string and they are dynamic so you can access your resulting List object and access like any data element.

Note: Like all programming language the index starts at 0, not 1. so the last line print the First element of token and convert that to lower case.


Str ='This:is:a:test'print(Str)
token=Str.split(':',99999)
print(token[0].lower())

This is the output when you run this in Python Editor.

This:is:a:test
this


2)List 

The list is the most used data structure in Python, theoretically, it can behave like a linked list, like a stack and like a que too. it's very flexible and comes with lot of functions.


SampleList=[12,13,45,67,18]
print(SampleList)
SampleList.append(17)
print(SampleList)
CopyLst = SampleList.copy()
CopyLst.reverse()
print(CopyLst)

cars=['ForD','BMW','VW','Toyotta']
cars.sort(reverse=False)
print(cars)
cars.sort(reverse=True)
print(cars)


Sample Output:

[12, 13, 45, 67, 18]
[12, 13, 45, 67, 18, 17]
[17, 18, 67, 45, 13, 12]
['BMW', 'ForD', 'Toyotta', 'VW']
['VW', 'Toyotta', 'ForD', 'BMW']


Above Example defined a list, then add one new element to it using the "append" function which bu default adds the element at the last location. There is also another function 'insert" which allows you to add an element at the defined position or "remove" / "pop" which would remove the defined position element.

Later we create a copy and reverse the order of the elements.

Another much-needed feature sorting, the sort function takes two arguments one is "reverse" which is a boolean, and by giving TRUE or false you can have the data sorted in Descending or Ascending order.  The second argument is "key" which you use to define your own sort key function by default it would use the data type of the element inside the list to sort the data. like in this example we have a string that is sorted using alphabetical order.

Often times loop are used to iterate over a list, in python looping a list is very easy, all you need to do is a simple line as follows the list can be defined with a RANGE in this example we wish to print data values from index 1 to index 3. you can change it its a short cut version of standard programming loops where you have to define an iterator, ++i and terminating condition.


for element in cars[1:3]:
    print(element)
for i in range(0,len(cars)):
    print('Using Range',cars[i])

3)NumPy Arrays

one of the downsides of the standard list data structure is it's not really efficient when it comes to mathematical operations on large data set and also matrix operations.

There are two key reasons why you would always like to use NumPy arrays then Python standard list data structure

1) When you need to do Matrix calculation or mathematical operation on large scale of data in the multidimensional form.

2) When performance is key and you are dealing with mathematical data set or maybe database like structure (row-column data)

One of the "secret" behind performance efficiency of NumPy array is that functions are optimized and written in C giving them the edge over standard Python functions.

Below is an example to initialize a simple NumPy array and see how simple is it to get something done without any loop or complex logic e.g. Standard deviation, Commulatove sum, Square root of each element in the array, etc.

import numpy as np
numpysample=np.array([12,11,2,36,16,25,9])
print('Sum of all elements', numpysample.sum())
print('Multiple Array with 2 and then Sum',(numpysample*2).sum())
print('Standard Deviation',numpysample.std())
print('Commulative Sum',numpysample.cumsum())
print(' Square Root of each element',np.sqrt(numpysample))

TwoDimArray= np.array([numpysample,numpysample*2])
print('Two Diemsnional Array ',TwoDimArray)

print('Second column of first row',TwoDimArray[0,1])


Sample Output :
Sum of all elements 111
Multiple Array with 2 and then Sum 222
Standard Deviation 10.466662333722402
Commulative Sum [ 12  23  25  61  77 102 111]
 Square Root of each element [3.46410162 3.31662479 1.41421356 6.         4.         5.
 3.        ]
Two Diemsnional Array  [[12 11  2 36 16 25  9]
 [24 22  4 72 32 50 18]]
The second column of first row 11


One of the basic requirements when working with the matrix is to know the rows/column or shape of the matrix and then the most used operation "Transpose". In Python, it's super easy just see the sample code below

print('Shape of Array',np.shape(TwoDimArray))
print('Transponse of Matrix',np.transpose(TwoDimArray))
print('Shape of Transposed Array',np.shape(np.transpose(TwoDimArray)))


Sample Output : 

Shape of Array (2, 7)
Transponse of Matrix [[12 24]
 [11 22]
 [ 2  4]
 [36 72]
 [16 32]
 [25 50]
 [ 9 18]]
Shape of Transposed Array (7, 2)

4)Structured NumPy Arrays

For years we have been working with relational data so wouldn't that be kool if you could represent the same type of data in python and process the data? the answer is Structured NumPy array it gives you the mechanism to represent data in row-column format with DB like data types

Below is an example of an Employee table like a schema definition in Python structured array 

import numpy as np

customDataType = np.dtype([('Name', 'S55'), ('Age', 'i4'), ('Salary', 'f'), ('OverTimeEligable', 'b')])
SampleData=np.array([ ('John',45,156000,True),                    ('Robert',23,80000,False),                      ('Ken',33,100000,True),                      ('Walker',56,180000,False)],dtype=customDataType
                    )
print(SampleData)

print('Mean  Salary of Staff',SampleData['Salary'].mean())
print('Standard Deviation of  Salary of Staff',SampleData['Salary'].std())
csum=0for i in range(0, len(SampleData)):
    csum=csum+SampleData[i]['Salary']

print('Cum sum of all Employee Salary',csum)


Sample Output:

[(b'John', 45, 156000., 1) (b'Robert', 23,  80000., 0)
 (b'Ken', 33, 100000., 1) (b'Walker', 56, 180000., 0)]
Mean  Salary of Staff 129000.0
Standard Deviation of  Salary of Staff 40533.938
Cum sum of all Employee Salary 516000.0


Tuesday, August 23, 2016

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Big Data Infrastructure Challenges and How a Service oriented Architecture can overcome it.

The Big Data (Hadoop) market is forecast to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion by 2020.

Most of the organization are stuck in the phase of a "Go" "No Go" decision. Having said that the major hurdles /challenges are mainly the following

1- Hadoop Infrastructure (Hardware, Network etc.) is cheap but is kind of complicated to operate and maintain specially as the cluster grows e.g. 100-150 machines ++.

2- For a Non-social media organization (e.g. Healthcare, Insurance, Banks or Telecommunications the use cases are not large enough to really shift from an existing MPP (Teradata , Netteza ,Oracle etc.) to a Distributed System 

3-  "Real" Hadoop Experts are hard to find and also do not come cheap along with another challenge that the learning curve for Big Data technologies is very steep.

4- Data Security in a "Distributed File system" lacks lot of features where the traditional RDBMS or MPP systems have been matured enough. 

5 - Corporate have already spent a lot of budget in purchasing, maintaining and implemented some expensive DWH/Data integration /Data analytics solutions and cannot just scrap it away to try something new. 

6- Business Users/ Data Analysts Reluctance to shift from "SQL " (e.g. Traditional RDBMS or SAS etc.)  based analysis to "Programming" (e.g. apache Hive, Pig, Impala, Python etc.) oriented data analysis approach and due to a slow learning curve it’s very hard to replace the whole man power with new Hadoop experts or train such staff to new technologies. 

Building up the Big Data infrastructure in chunks and taking baby steps to put your test cases under a production workload and evaluate the ROI is the only key to overcome these challenges. The key is "Start simple " and later "Expand" and finally with the ROI "shift".

Amazon Web Services has a considerable cheap, quick and easy "Service Oriented" architecture that take care of the above challenges and help you focus on your "Data Analytics/Data integration" and takes all the responsibilities of providing the software/hardware setup, maintenance, Storage and Scaling and that too can be achieved in Days not months.   Following is a sample illustration


Big Data Architecture on a AWS 
Amazon S3: is an Object level storage, all of your log files, data delimited files etc. can be stored here and can be used for data processing/integration and later archived on a cheap storage vault "Amazon Glacier". 
Amazon Dynamo DB: is a very strong No-SQL database and has a SQL like interface levering the gap of Skills to learn for Complex/big data analysis. 
Amazon RedShift: is a huge, scalable, Peta byte level of Structured storage solution to server the Fast data analysis and adhoc reporting backed by a distributed MPP engine (same as Hadoop or any other MPP). 
And Guess what? All these components are highly secured, high available through AWS VPN and Auto scale so no matter how small or huge the workload becomes the architecture remains same and enhance instantly and automatically with minimal effort. 
So, we suggest corporate should start with minimal investment and should put their use cases to test and evaluate the ROI and with results slowly shift to Big Data Environments. 
Would love to hear your feedback or share if you like it.
Thanks , Tahir Aziz

Tuesday, August 16, 2016

Operational Data Analytics using Big Data/ Data Lake


One of the major challenges in the fast pace reporting and analytical needs is to produce the desired reports/dashboards from near real time data and to take the immediate decisions.

The current use case we have is an organization ( a Health care Provider) who would like to have the visibility of their health care facilities , current or near real time ( no later then 30 minutes old) data statistic around patients, departments , hospital equipment usage , equipment current status etc.

The challenge , limited budget and very limited  time . In most of the case what development team end up doing is writing a "Operational Reporting Query" directly on the OLTP system and thus with a couple of queries /dashboards like this impact the real production environment.  It works well for a short term period of time but on a longer run has a huge impact on Production environment.

The solution we advise is to handle such cases by shifting the "Heavy processing work" to a cheap and strong candidate "Big Data". The revised work flow can look like something as below






The main idea is to offload the production work load to a cheap hadoop data lake environment , which can store the data and refresh it from source every 30 minutes.  The reporting Adhoc/Operational Queries (SQL) can be easily translated to Hive(HQL) and thus with minimal changes you can save a lot of effort and cost on Production maintenance and above all with limited Budget you are still able to meet critical project deadline and without impacting any OLTP system.

 Future perspective to this approach :


  1.  we can take advantage of the Data lake to have the data reconciliation between different OLTP source e.g. Claims or GL transactions etc. 
  2. we can do the Data integration to multiple heterogeneous sources e.g. business maintained files on share point with Operational reports etc. 
  3. Can serve as a Staging area to the Data warehouse environment , offloading the Production access window and near real time data can be pushed to DWH.
  4. Ensure to meet the SLA for dashboard deliveries to business stake holders both Operational as well the DWH based Adhoc/Analytical dashboards. 
  5. Also it can serve as a purpose to shift some of the complex and time taking "ETL" transformations from DWH or Data mart into Data Lake where as while pushing data to EDW or Data mart hadoop can apply those complex transformations and can send an "enriched" version of data set to DWH/Data mart.











Thursday, June 9, 2016

Customizing Oracle ODI Knowledge Modules


Oracle ODI ships with a bundle of very useful and quite diverse knowledge modules but at the same time one of the basic need it to get a customized version of the same knowledge module to get more control and gain more functionality. So how can we do that ?

Problem Statement : 
 
The Scenario we are taking about today is very simple one , if you are using any Integration Knowledge Module it creates a table with I$_  and do the further processing from their on wards. But here is a flaw , if you kick off the same job or any other job which loads the same target table ODI will launch a new instance and mess the current running job.

For example , I am using  "IKM Oracle Incremental Update" and what it does is it creates an table with I$ instead i wish to have it create a table post_fix with current SESSION number. e.g. I$

The steps are as follows
1 - Go to the project folder -->  Knowledge Modules -- > find IKM Oracle Incremental Update
2 - Right click on the selected module above and click on "Duplicate Selection".
3 - Give it a suitable name  e.g. "Custom_IKM Oracle Incremental Update"
4- Double click on the Module and click on "Details" to see the lost of steps for this module.
5- You would find a step with a title as " Create flow table I$".
6- Select that and lets see the code looks like as follows

create table <%=odiRef.getTable("L", "INT_NAME", "W")%>
(
    <%=odiRef.getColList("", "[COL_NAME]\t\t[DEST_WRI_DT] NULL", ",\n\t", "", "")%>,
    IND_UPDATE        CHAR(1)
)
<%=odiRef.getUserExit("FLOW_TABLE_OPTIONS")%>

7- Add the custom logic to append session number with the table being created.

create table <%=odiRef.getTable("L", "INT_NAME", "W")%>_<%=odiRef.getSession("SESS_NO")%>
(
    <%=odiRef.getColList("", "[COL_NAME]\t\t[DEST_WRI_DT] NULL", ",\n\t", "", "")%>,
    IND_UPDATE        CHAR(1)
)
<%=odiRef.getUserExit("FLOW_TABLE_OPTIONS")%>

 8- Save your knowledge module and that is it.
 9- Use the Custom Knowledge Module in your mapping .

You would see that the I$ table is created with a session post fix so any other instance of same job will not disturb the running flow.

Similarly you can add new steps , make them conditional and add new options to customize it more as per requirements.

Hope that helps.

Thanks



 

Thursday, March 10, 2016

The Refined and Optimized Data Architecture for Enterprise Needs

We all are by now familiar with the hype of fancy terms like "Big Data", "Hadoop", "No SQL Databases" , but having said that there is one principle behind all that technology expenses when we take a look at this from the enterprise perspective and this is "How much Value is added into the Enterprise".

Believe it or not at the end of the day  I.T services are a supporting platform to benefit business users and stake holders to achieve business goals and forecast the future projections. The main points can be short listed to following

Cost : EDW Typical Storage (usually any MPP architecture solution e.g. Teradata , Netteza , Oracle Exadata etc ) is quite an expensive one and also has a constraint when it comes to keep historical data or unstructured data /sensor data , there is an alternate and that is to store the "Cold" storage data into cheap commodity hardware e.g. Hadoop

Value of return on investment :  Replacing all EDW infrastructure and legacy system with new architecture and technologies is quite an expensive idea however looking at long term prospects and return of investment the best approach is slowly to resolve the limitation of existing EDW infrastructure and make a hybrid architecture to get the maximum value out of it.

Solution Linear Scale-ability and long term solution design :  No doubt , long term solution design and a flexible architecture to handle the growing data and type of data ( social media , sensor , click stream data) is a challenge so data architecture is to be refined to keep long term prospects in mind.

Below is a somewhat close hybrid solution design of a new data architecture for an enterprize.



Above is a reference from a white paper published by Hadoop only one argument which is my personal opinion is that with shifting all ETL to the Hadoop. Instead i believe we should keep it hybrid ( at least for some time to run a parallel architecture) and Keep ONLY Non structured data feeds /sensor/click stream detailed data ETL work on Hadoop platform and let the traditional sources be running into the existing infrastructure.

Reference  : http://info.hortonworks.com/rs/549-QAL-086/images/hortonworks-data-architecture-optimization.pdf?mkt_tok=3RkMMJWWfF9wsRonvKTKc%2B%2FhmjTEU5z16uQsWaeygYkz2EFye%2BLIHETpodcMTcVnMLDYDBceEJhqyQJxPr3AKNkNy9RxRhHqDg%3D%3D