Friday, August 7, 2020

IRI CoSort Experience with Talend

Some time ago I did an independent evaluation of IRI CoSort and benchmark it with Talend and demonstrated how the overall performance can be improved by integrating CoSort within Talend. 

Below is the outcome published on IRI official site.  

https://www.iri.com/blog/data-transformation2/optimizing-talend-transforms-with-cosort/



Sunday, July 26, 2020

Python Data Structure Playbook- Part 1

If you are like me who has been working with python for some time on and off you would feel a need to have a quick reference to all the basic data structure supported by python and have the commonly used function handly.

Here is a summary of the most commonly used data structure with some examples that could be used as a quick cheat sheet while working on projects.

1) String :

The most obvious data structure supported by all programming languages is string-like java and  C# python does provide some very kool function that makes your life easy.

Below example define a string and then print it, later it would Token the string into a list using "split" function where we defined the delimiter as ":".

Just one line of code will get you all the token you need for your string and they are dynamic so you can access your resulting List object and access like any data element.

Note: Like all programming language the index starts at 0, not 1. so the last line print the First element of token and convert that to lower case.


Str ='This:is:a:test'print(Str)
token=Str.split(':',99999)
print(token[0].lower())

This is the output when you run this in Python Editor.

This:is:a:test
this


2)List 

The list is the most used data structure in Python, theoretically, it can behave like a linked list, like a stack and like a que too. it's very flexible and comes with lot of functions.


SampleList=[12,13,45,67,18]
print(SampleList)
SampleList.append(17)
print(SampleList)
CopyLst = SampleList.copy()
CopyLst.reverse()
print(CopyLst)

cars=['ForD','BMW','VW','Toyotta']
cars.sort(reverse=False)
print(cars)
cars.sort(reverse=True)
print(cars)


Sample Output:

[12, 13, 45, 67, 18]
[12, 13, 45, 67, 18, 17]
[17, 18, 67, 45, 13, 12]
['BMW', 'ForD', 'Toyotta', 'VW']
['VW', 'Toyotta', 'ForD', 'BMW']


Above Example defined a list, then add one new element to it using the "append" function which bu default adds the element at the last location. There is also another function 'insert" which allows you to add an element at the defined position or "remove" / "pop" which would remove the defined position element.

Later we create a copy and reverse the order of the elements.

Another much-needed feature sorting, the sort function takes two arguments one is "reverse" which is a boolean, and by giving TRUE or false you can have the data sorted in Descending or Ascending order.  The second argument is "key" which you use to define your own sort key function by default it would use the data type of the element inside the list to sort the data. like in this example we have a string that is sorted using alphabetical order.

Often times loop are used to iterate over a list, in python looping a list is very easy, all you need to do is a simple line as follows the list can be defined with a RANGE in this example we wish to print data values from index 1 to index 3. you can change it its a short cut version of standard programming loops where you have to define an iterator, ++i and terminating condition.


for element in cars[1:3]:
    print(element)
for i in range(0,len(cars)):
    print('Using Range',cars[i])

3)NumPy Arrays

one of the downsides of the standard list data structure is it's not really efficient when it comes to mathematical operations on large data set and also matrix operations.

There are two key reasons why you would always like to use NumPy arrays then Python standard list data structure

1) When you need to do Matrix calculation or mathematical operation on large scale of data in the multidimensional form.

2) When performance is key and you are dealing with mathematical data set or maybe database like structure (row-column data)

One of the "secret" behind performance efficiency of NumPy array is that functions are optimized and written in C giving them the edge over standard Python functions.

Below is an example to initialize a simple NumPy array and see how simple is it to get something done without any loop or complex logic e.g. Standard deviation, Commulatove sum, Square root of each element in the array, etc.

import numpy as np
numpysample=np.array([12,11,2,36,16,25,9])
print('Sum of all elements', numpysample.sum())
print('Multiple Array with 2 and then Sum',(numpysample*2).sum())
print('Standard Deviation',numpysample.std())
print('Commulative Sum',numpysample.cumsum())
print(' Square Root of each element',np.sqrt(numpysample))

TwoDimArray= np.array([numpysample,numpysample*2])
print('Two Diemsnional Array ',TwoDimArray)

print('Second column of first row',TwoDimArray[0,1])


Sample Output :
Sum of all elements 111
Multiple Array with 2 and then Sum 222
Standard Deviation 10.466662333722402
Commulative Sum [ 12  23  25  61  77 102 111]
 Square Root of each element [3.46410162 3.31662479 1.41421356 6.         4.         5.
 3.        ]
Two Diemsnional Array  [[12 11  2 36 16 25  9]
 [24 22  4 72 32 50 18]]
The second column of first row 11


One of the basic requirements when working with the matrix is to know the rows/column or shape of the matrix and then the most used operation "Transpose". In Python, it's super easy just see the sample code below

print('Shape of Array',np.shape(TwoDimArray))
print('Transponse of Matrix',np.transpose(TwoDimArray))
print('Shape of Transposed Array',np.shape(np.transpose(TwoDimArray)))


Sample Output : 

Shape of Array (2, 7)
Transponse of Matrix [[12 24]
 [11 22]
 [ 2  4]
 [36 72]
 [16 32]
 [25 50]
 [ 9 18]]
Shape of Transposed Array (7, 2)

4)Structured NumPy Arrays

For years we have been working with relational data so wouldn't that be kool if you could represent the same type of data in python and process the data? the answer is Structured NumPy array it gives you the mechanism to represent data in row-column format with DB like data types

Below is an example of an Employee table like a schema definition in Python structured array 

import numpy as np

customDataType = np.dtype([('Name', 'S55'), ('Age', 'i4'), ('Salary', 'f'), ('OverTimeEligable', 'b')])
SampleData=np.array([ ('John',45,156000,True),                    ('Robert',23,80000,False),                      ('Ken',33,100000,True),                      ('Walker',56,180000,False)],dtype=customDataType
                    )
print(SampleData)

print('Mean  Salary of Staff',SampleData['Salary'].mean())
print('Standard Deviation of  Salary of Staff',SampleData['Salary'].std())
csum=0for i in range(0, len(SampleData)):
    csum=csum+SampleData[i]['Salary']

print('Cum sum of all Employee Salary',csum)


Sample Output:

[(b'John', 45, 156000., 1) (b'Robert', 23,  80000., 0)
 (b'Ken', 33, 100000., 1) (b'Walker', 56, 180000., 0)]
Mean  Salary of Staff 129000.0
Standard Deviation of  Salary of Staff 40533.938
Cum sum of all Employee Salary 516000.0