How I built a weather decision engine? or The story of wearthejacket.com

So as I was stepping out of my apartment last week, I thought in california I really don’t care about the temperature outside, all I want to know is if it is cold enough to wear a jacket or warm enough not to. I decided to build a weather based decision engine which does just that, figure out where I am, check the temperature and give me a decision. I also wanted it to be blazing fast and scalable.
The end result was http://wearthejacket.com/[Update:shutting this service down today on October 9th 2014, after almost an year of  100% uptime. This was a fantastic learning experience]

The first step was to think how I could achieve this. After some thought I came up with this sketch.

architecture

architecture

I was aware of existing geolocation api’s  which translate an IP address to a location. That led me to
http://freegeoip.net/, which is a great api for a project like this with access to 10000 api calls per hour before getting throttled. This was more than sufficient for my needs.

The other component was getting the weather information. After googling for a bit, I came across forecast.io which though being a robust api had a free cap of 1000 calls per day and a nominal payment after that.
The hosting was on a AWS(Amazon web services) small ubuntu linux instance.
I decided to use tornado over apache mainly due to its low memory consumption during idle time and since I was going to write this in python. The decision to use Redis was a simple one as I would definitely need to cache some values as the end user requests came in.

This was the first pass I wrote.
1. User comes to webserver,
2. webserver queries geolocation api and obtains end users coordinates and location.
3. Based on the coordinates we query a weather api and
4. Finally take a decision based on the prevailing temperature. The decision algorithm being very simple based on a pre-set threshold below which wearing a jacket is advised. This is a todo for future  enhancement.

Note that I had not written the caching mechanism yet. There lies the fun part. At this point the decision was taking well over 1 second. Certainly not acceptable for a public facing web application.

Enter Redis. Redis is an in-memory key-value data store. That makes lookups blazing fast. The first thing that I needed to do was to cache the location information that we pulled from the geo-location api . The location information that the application needed were the latitude, longitude and actual city name. This made it a good candidate for the usage of a Redis Hash.

So the first mapping was

HMSET IP location val latitude val longitude val

Since this data will not change rapidly even with dynamic ips we can keep these mappings forever and over time as user queries come in build our own database of ip’s to locations.

For the weather information, I made an assumption that weather conditions will be similar over 10 mins at a given location[debatable, but will fulfill most needs.]

This is a simple redis key value pair location:apparentTemperature the only caveat being we want it to expire every 10 minutes(configurable).

This is done easily in Redis via the setex command, with the invocation

SETEX key <expiration_time_in_seconds> value.

Once the cache mechanism was in place the benchmarking showed dramatic improvements. Sub 30 ms response times after the first api call was made. The first api call to the application though was still remarkably slow. Then I started looking at individual api calls to the external services.

There lied the answer, the forecast.io api was spewing out an entire days worth of data.
The fix was to append the forecast.io api call with

?exclude=minutely,hourly,daily,alerts,flags

which had the effect of only giving back the current prevailing conditions.

Once this was done came the part to write tests. Not true Test driven development but I was’t launching without baseline tests. This part took me the longest time but greatly increased confidence in the code for launch. As of now it has run for over a day serving requests across the globe. Always write tests, preferably before even the first line of code.

Benchmarking after that indicated a theoretical capacity of 1.5 million requests/ day. Not bad for a tiny server, and the best part is that it can be horizontally scaled.[though I doubt I will do that considering it takes $$$ to keep servers running.]
The components are modular so that you use individual components. Would love to know your thoughts on how this project can be enhanced and or design decisions that you would make.

One more thing
Building this has been a great learning experience and to enable others to learn/critique
The source code is released under the GPLV3 license.
https://github.com/hvd/wearthejacket_oss

Advertisements

Rolling up data with Awk

One of the basic things that one does when dealing with numeric data sets is to add them up for some given attribute. Here is a subset of sample data of baseball statistics via http://seanlahman.com/baseball-archive/statistics/
The file used for the purpose of this post [Managers] is a list of Wins and losses by Baseball team managers from the late 1800’s to 2012. Lets try to roll up the wins and losses per manager to calculate the total wins and losses for each team manager. Do download the data file to see the raw data. (Note that .key files  are just csv so named to get around a wordpress restriction and therefore can be opened with a text editor/openoffice/excel)

How can this be done?
Early 21st century method:
Use Excel to calculate totals manually(sigh), write a macro if you are more adept.

2014 method:
Write a python program Use the csv library in python to read the file, then keep a dictionary of form {category1:{attrib1:val1, attrib2:val2….attribn:valn},category2:{attrib1:val1,attrib2:val2,attrib3:val3…attribn:valn}}
Then as you pass over each row update the sums for each attribute while checking if the category you are referencing exists, if not create a entry in the dictionary and repeat till end of the file.

Lets see another way to do this right out of the 1970’s:
Say hello to Awk.  Awk is an interpreted language designed specifically for extracting and manipulating text. Awk natively interprets tabular data. How cool is that? The nice part is awk is shipped with any standard linux/unix distribution. For those still in the windows world, installing cygwin will get you awk.

The anatomy of an awk program is simple: pattern {action} filename with optional BEGIN and END patterns which refer to actions preceding and after the file is read.

To roll up the wins and losses per team manager from the data file that we have,  we use a concept called associative array. Wait associative what? An Associative array is a data structure which can be indexed by anything(typically a string). While this may not seem any different than a python dictionary, the magic lies in the fact that this is applied across the file without any need for iterating over the file explicitly. Lets see the actual code that will do this. Save the following script as sum_wins_and_losses.awk and apply a chmod 755 so that it can execute.

#!/bin/awk -f
BEGIN{
   FS=",";
   OFS=",";
   total_wins[""]=0;
   total_losses[""]=0;
}
{
   manager=$1
   wins=$7;
   losses=$8;
}
{
   total_wins[manager]+=wins;
   total_losses[manager]+=losses;
}
END{
   print "manager,total_wins,total_losses"
   for (i in total_wins){
   if(i != "")
   {
   print i,total_wins[i],total_losses[i]
   }
  }
}

In the Begin block we define the field separator(FS) and the output field separator(OFS)  as a comma in addition to initializing arrays that we intend to use. The OFS determines how the data will be separated on output of the program.
By default awk interprets space separated files. Once the FS is established
you can refer to any column by its index ie. the first column of the data table can be referred to by $1, the second by $2 and so on. It is a good practice to assign these to variables .That enables you to make changes easily at a central point when there is a need to change the column position in the code. Typical use case would be to adapt the program for a file with additional columns, with the current columns appearing at different position’s.

The third block is where the magic begins, we index the arrays that we defined by the field that we want to roll up our data by.  In this case we use manager.

total_wins[manager]+=wins;

All that this snippet of code does is that if the manager is “foo”  the array bucket of total_wins indexed by “foo” will hold the total wins achieved by foo. This is so since the operation += wins is applied across the entire file and adds any wins achieved by foo to the same index. This is done for all unique managers and we are left with rolled up values of wins and losses by manager for the entire dataset.

Now for the finale , in the END block all we are doing is iterating over the indexed associative array and spewing out the rolled up data. This will be to the console.

The actual program can be executed by invoking the following snippet which redirects the output to a file.

awk -f sum_wins_and_losses.awk Managers.key >rolled_up_file.key

Open the rolled_up_file to see total Wins and Losses by the manager. Next time you are faced with manipulating tabular data, think awk!

References:

1. http://www.grymoire.com/Unix/Awk.html.

2.http://en.wikipedia.org/wiki/AWK

Tools

If all you have is a hammer, everything looks like a nail -Abraham Maslow

Building quality software relies on a myriad of tools.  One essential pillar of craftsmanship software or otherwise is in the mastery of tools that you work with. Some tools that I think are important for a software craftsman:

The Mind:
The most important and essential tool that you have is your mind.
All software originates as a thought so having a clear thought process is necessary.
Unfortunately we have become accustomed to google, facebook and twitter. Being mindful of what you are trying to achieve is important. There is a quote from the TV show Sherlock that has stuck with me “People fill their heads with all kinds of rubbish. And that makes it hard to get at the stuff that matters. Do you see?” Treat your mind as a garden and only let in thoughts that should be nurtured, getting rid of weeds is essential.

The Operating System
I use xubuntu linux, which to me offers simplicity and power. Using Linux forces me to learn more about the underlying machine itself.
We must remember for all the clouds that are now available, ultimately it is a group of computers connected to a network. Quoting Larry Ellison “Google does not run on water vapor”. The choice of operating system that you go with will have a effect on how much of the underlying system you understand. Macs and Windows will abstract a lot of the underlying machine, which if you are an application developer may not be a bad thing  in terms of productivity gains. However gaining knowledge of underlying processes will be harder.

The Programming language:
There are a plethora of programming languages available and to persist with the thought that all languages are made equal is a fallacy.
Using the right language for the job at hand will go a long way in building successful products. Learning a different language than one which you use on a daily basis will create new ways of thinking. For instance try writing to a file first in Java, then in Python. While I am not getting in the argument of which is better, I do want to emphasize that languages have there own strengths.To deal with data in flat files, you are missing out if you do not use the trinity of awk,sed and grep(sed and grep being command line tools). They will provide in simplicitly for which you would be writing programs of hundreds of lines in a high level language.

The Text Editor
From notepad to vim to sublime text to even a IDE. The text editor is where you translate your thought to code. Structured information that can make computers do what you want it to do. Mastery of the text editor will determine how long it takes you to write code assuming that you can type.

Version Control:
Any Software that is not one time use should be maintained in a version control system. Git is my favorite, however depending on your work environment you could end up using svn or perforce. git has many subtleties that will be apparent with practice, know your git and never worry about losing source code.

The Debugger
While best avoided since you do want your mind to be the compiler, debuggers aid understanding both the program and data flow in complex projects. Mastering the debugger will go a long way in solving bugs in programs that you are unfamiliar with and sometimes in code that is familiar.

Databases
Databases are the foundation of the applications that we build.
We have SQL and not only SQL. For structured data SQL based databases are still golden.
Since real data is not always structured, that has forced the move to NOSQL databases.
Redis is the swiss-knife of data in key-value pairs. Learn about couchbase or mongodb for exposure to document based databases.

This list of tools is in no way comprehensive, Building mastery of tools is a long and arduous process. However taking tiny steps today go a long way and add up in a few years. More often than not the problem you are trying to solve will drive the tools that you will use. So try to solve as many different problems as you can, slowly but surely you will see a expanding toolkit.

The path I have laid for myself is to always be learning and teach what I  learn.  I would love to know your experiences in mastering the tools that you work with and ultimately in mastering your craft.