One of the basic things that one does when dealing with numeric data sets is to add them up for some given attribute. Here is a subset of sample data of baseball statistics via http://seanlahman.com/baseball-archive/statistics/
The file used for the purpose of this post [Managers] is a list of Wins and losses by Baseball team managers from the late 1800’s to 2012. Lets try to roll up the wins and losses per manager to calculate the total wins and losses for each team manager. Do download the data file to see the raw data. (Note that .key files are just csv so named to get around a wordpress restriction and therefore can be opened with a text editor/openoffice/excel)
How can this be done?
Early 21st century method:
Use Excel to calculate totals manually(sigh), write a macro if you are more adept.
2014 method:
Write a python program Use the csv library in python to read the file, then keep a dictionary of form {category1:{attrib1:val1, attrib2:val2….attribn:valn},category2:{attrib1:val1,attrib2:val2,attrib3:val3…attribn:valn}}
Then as you pass over each row update the sums for each attribute while checking if the category you are referencing exists, if not create a entry in the dictionary and repeat till end of the file.
Lets see another way to do this right out of the 1970’s:
Say hello to Awk. Awk is an interpreted language designed specifically for extracting and manipulating text. Awk natively interprets tabular data. How cool is that? The nice part is awk is shipped with any standard linux/unix distribution. For those still in the windows world, installing cygwin will get you awk.
The anatomy of an awk program is simple: pattern {action} filename with optional BEGIN and END patterns which refer to actions preceding and after the file is read.
To roll up the wins and losses per team manager from the data file that we have, we use a concept called associative array. Wait associative what? An Associative array is a data structure which can be indexed by anything(typically a string). While this may not seem any different than a python dictionary, the magic lies in the fact that this is applied across the file without any need for iterating over the file explicitly. Lets see the actual code that will do this. Save the following script as sum_wins_and_losses.awk and apply a chmod 755 so that it can execute.
#!/bin/awk -f BEGIN{ FS=","; OFS=","; total_wins[""]=0; total_losses[""]=0; } { manager=$1 wins=$7; losses=$8; } { total_wins[manager]+=wins; total_losses[manager]+=losses; } END{ print "manager,total_wins,total_losses" for (i in total_wins){ if(i != "") { print i,total_wins[i],total_losses[i] } } }
In the Begin block we define the field separator(FS) and the output field separator(OFS) as a comma in addition to initializing arrays that we intend to use. The OFS determines how the data will be separated on output of the program.
By default awk interprets space separated files. Once the FS is established
you can refer to any column by its index ie. the first column of the data table can be referred to by $1, the second by $2 and so on. It is a good practice to assign these to variables .That enables you to make changes easily at a central point when there is a need to change the column position in the code. Typical use case would be to adapt the program for a file with additional columns, with the current columns appearing at different position’s.
The third block is where the magic begins, we index the arrays that we defined by the field that we want to roll up our data by. In this case we use manager.
total_wins[manager]+=wins;
All that this snippet of code does is that if the manager is “foo” the array bucket of total_wins indexed by “foo” will hold the total wins achieved by foo. This is so since the operation += wins is applied across the entire file and adds any wins achieved by foo to the same index. This is done for all unique managers and we are left with rolled up values of wins and losses by manager for the entire dataset.
Now for the finale , in the END block all we are doing is iterating over the indexed associative array and spewing out the rolled up data. This will be to the console.
The actual program can be executed by invoking the following snippet which redirects the output to a file.
awk -f sum_wins_and_losses.awk Managers.key >rolled_up_file.key
Open the rolled_up_file to see total Wins and Losses by the manager. Next time you are faced with manipulating tabular data, think awk!
References: