We had an excellent Hack/Reduce event again this weekend in Ottawa. Our knowledge has been expanded again, we now know how desperate daters can get and at what time people do their online shopping among other things.
In order to get all the tech to work and keep the clusters humming a lot of coffee and pizza was consumed. I would be afraid to calculate how much.
Most importantly, we saw a lot of great Hack/Reductions by the participants. Actually, first and foremost it was a great weekend for the Hack/Reduce team again – the best way to spend a Saturday – for real! Thanks!
We want to thank all the participants who made it out and the sponsors of course: Hopper, Infoglutton and Shopify.
Here’s a short description of some of the presentations we saw:
Jean-Claude Batista (@jcbatista) , Taswar Bhati (@taswarbhatti) , Joel Sachs, Andrew Clunis (@orospakr), (Petro Verkhogliad)

The Hackify team worked on the shopify dataset with information about US Shopify orders. The team used Ruby and the streaming api.
Firstly, the team analyzed where people shop the most. Naturally, the answer was California (largest state).
Second, the team wanted to find the date when people had shopped the most, which was on August 19th between noon and 6pm ($185891).
The team concluded that using Ruby with the streaming api makes it easy to do map/reduce and that Hadoop is cool!
Petro Verkhogliad (@vpetro)
Petro first worked with the Hackify team but then turned to using Python which he was more familiar with. Petro also analyzed the Shopify shopping data.
Petro found that people shop the most at 9am in the morning and 9pm in the evening. The most popular shopping day is Friday with Saturday as a close second.
Geographically most shopping is done by Californians ordering from California.
Rod Dunne, Marc Lepage, Mohamed Mansour (@mohamedmansour) , Alexis Brunet
Code: https://github.com/mlepage/HackReduce
Team-RIM was voted as winner by the participants. They worked on the google n-grams data and the Amazon review data. First, the team calculated the amount of alliterative 2-grams by letter per year. They also calculated the amount of alliterative 2-grams by letter per year.
For the amazon review data, team-RIM calculated the average rating given on specific dates calculated over all of the years in the dataset. You can see that all products are almost given a rating of 4. You can also notice that right after christmas the ratings drop off, ie. people give worse review right after christmas.

The amazon dataset also includes data over how useful reviews are. The reviews can be voted up or down on amazon by users. Team-RIM analyzed this data and came to the conclusion that reviews that give a higher rating to a product are considered more useful.
The team also analyzed the usefulness of reviews based on the review length. From the picture we can see that 50-60 character reviews are considered most useful. As a bonus, the team calculated that products received worse reviews as time went by.

Pascal from Hopper wanted to find the most desperate daters. He analyzed the amount of profile views by users. According to his analysis, some users are checking so many profiles that with a 30-second timeframe it amounts to full-time work. (~17k views per month… Which amounts to something like 60 profile views every working hour…)
Pascal also found the most visited profiles.
Lastly, Pascal analyzed how users use the Mate1 site. The results were quite surprising, as 23% of users only view one other profile. 65% only ever check 10 profiles.
JF (@jeanfrancoisim) from Hopper also analyzed the dating dataset.
First, JF mapped the birth year of users to their hotness by gender. JF calculated hotness with the following formula: (msgs received+msgs sent) * ((msgs received+10)/(msgs sent+10)).
The red dots represent women and blue dots men. We can clearly see that the woman demographic is all in all younger. I’m afraid to draw any other conclusions from the results…
Next JF mapped hotness and height by gender. You could easily see that men are taller than women but it is unclear if height directly influences hotness.
As a last analysis JF mapped income to hotness. Most people answered “rather not say” to this question, why most datapoints are in the second column. It’s also unclear if income influences hotness.
Steven Noble (@snoble), Chris Saunders(@chris_saunders) , David Underwood (@davefp)
The XKCD team wanted to test the internet “truism” that has been said in a xkcd comic: “Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”
The team didn’t really end up with a result for this since it turned out to be difficult to only include the actual article content and not all of the other links on the page. They started their work in Clojure but later ended up using java instead.
Martin Samson (@pgdown) , Edgar Acosta, David Germain
Team 8 explored the dating dataset with the goal of measuring popularity/hotness and correlating it to other factors.
The team started by analyzing some basic measures for profiles.
They created their own popularity formula.
The team made some nice graphs mapping times listed vs. messages received, times listed vs. profile views, popularity vs. number of times viewed..
The team used streaming with Python and fought with it for a while. Lesson: Do not use the same file name for the mapper and the reducer scripts.
Learnt: Do not use the same file name for the mapper and reducer scripts.
Richard Desmarais, Chris Camden, Philippe Savoie, Ryan McLeod, Eric Ax, Manuel Belmadani (@pragmatwit)
The weather and crime team created a web interface that could answer queries such as “What is the maximum temperature in january 2007”. When the search query is launched it will run through the dataset and give you the answer. The team used python streaming.
Chris Fournier (@cfournie ), Oana Frunza, Alistair Kennedy, Russell Luo, Dominic Plouffe (@dplouffe )
Team UOttawa analyzed the sentiment of the Amazon reviews. As expected, there were clear differences in sentiment between the good and the bad reviews.
In the end, we let the participants vote for the best Hack/Reduction. Team-RIM with their n-grams and Amazon analyses took the win… Congrats!
Thanks to everyone for coming, the Hack/Reduce team had a great time and it was amazing to meet you all. We hope you keep hacking and we hope we’ll see you next time!
Hack/Reduce 4 Ottawa, Saturday 13th of August 2011, 10am-8pm at the Language Technologies Research Center.
We’ll spawn up clusters with hundreds of nodes for free use.
You can use our datasets or send us your own and hack on anything you want.
We provide the clusters, Hadoop/Mapreduce experts and food and drinks.
Intense coding for 7 hours and presentations at the end of the day. What could be a better way to spend a Saturday?
Just bring your laptop and an idea of what you want to solve.
Basically you show up, pitch your idea if you have one and gather or find a team to work with. We’ll have a short presentation about the infrastructure and how to run your Hadoop jobs (if you will be using Hadoop). Then you just start coding with your team. Food and drinks are served throughout the day. At the end of the day everyone presents their results and/or shares what they did and what they learnt.
| 10-10.15 | Coffee and Introduction |
| 10.15-10.45 | Participants get to pitch their ideas and find or gather a team to work with |
| 10.45-11.30 | Intro to the infrastructure, tutorial and setup (optional) |
| 11.30-18.30 | Development time |
| 19-20 | Presentations by the teams, beers |
| 20 | Gathering stuff and closing |
You don’t have to do any special preparations for Hack/Reduce. However, one of the things we often hear from participants is that they should have RTFM. Here’s how you can prepare:
At the other events we’ve had 10-20 people stand up and give a short 30 second pitch about what they wanted to build or what dataset they were interested in. After the pitches anyone could go talk to the people that had pitched to join their team. We also had several teams that had been pre-formed. We also had some people that worked alone. Anything is possible.
You will need to bring your own laptop to the event, everything else will be provided. We have bad experiences of getting Windows environments to work with hadoop, we suggest you bring a linux or osx machine.
We’ll provide about 10 Amazon clusters with regular Amazon EC2 xlarge instances that we can scale up according to need (Usually up to 500 nodes). We’ll have Hadoop installed and instructions ready for how to run your Hadoop jobs. If you want to use other technologies you’ll have to contact us well in advance (2 weeks) or install them yourself on the day of. You’ll have full access to the clusters. The cluster nodes are regular EC2 m1.xlarge instances running Ubuntu 10.04.
You can read examples of what has been built on the pages of the Montreal, Toronto and Boston events. Of course, we encourage you to figure out your own great ideas using any datasets you can find (you’ll have to contact us to make sure we can make the datasets available).
You only have max 8 hours of efficient coding time, so you have to take that into account. It’s a good idea to gather a team beforehand and figure out what you want to build before coming to the event.
We can make any datasets that you want available for the event. However, you’ll have to contact us before the event so that we have time to upload the dataset (because it can take a long time).
Check out the existing datasets on github.
We can also upload any dataset that you provide a link for. You can also look through the public datasets on Amazon. If you want us to make any of these available for Hack/Reduce you need to contact us. We also have a post about other possible datasets. You can also look through Infochimps for interesting datasets.
The event will be held at the Language Technologies Research Center
Language Technologies Research Center
283 Alexandre-Taché Blvd.
Gatineau, Quebec J8X 3X7
Canada
Saturday, August 13, 2011 from 10:00 AM – 8:00 PM (ET)
View Hack/Reduce Ottawa in a larger map