All videos from the presentations: Hack/Reduce Vimeo
Hi everyone, this is coming a bit late now since Hack/Reduce Toronto was already over a week ago, but we were too busy setting up for Boston to write about the presentations we saw in Toronto… Anyways, we had an amazing time in Toronto. Thanks to all of the participants who were working hard the whole day! Seeing people working hard, learning and enjoying Hack/Reduce is what makes it all worth while for us!
The day started off with coffee, a short introduction to using the cluster and pitches by the participants. The Hopper team then gave a short tutorial on Hadoop and Map/Reduce.
People got to work really quickly and the amount of noise in the beginning when the teams were discussing their projects was mind-blowing. A lot of buzz. Soon after 1 o’clock all the teams hunkered down to start coding and there was an eerie silence again…
In the end, the teams really got a lot done. We saw some really amazing presentations. I’ll give some short descriptions of the pitches here. I’ve put up all of the presentations we had on our Vimeo channel.
There are some interesting things to learn from the videos, mostly about the technologies used and tested, so I suggest you check them out!
In the end, 10 teams out of 21 ended up presenting:
Bartek and Ash analyzed search queries that had been made during the course of one day. They grouped search queries in four categories: travel, sex, nerd and cooking. They then analyzed how the popularity of these categories in searches varied during the day.
Here are two pictures of the results, grouped by 1 minute and 30 minutes:
The source code is on github at https://github.com/ashchristopher/HackReduceToronto.
- The most romantic year by looking for the word love in song titles.
- The variation of words in song titles (Only 100 words are used in song titles)
- Average song tempo per year
- Song lengths per year
- Saddest tones (Turns out D is really sad)
- Recording locations.
The source code can be found on github: Github/joyerobert/hackreduce
Gar, Nathan and Khurram first wanted to figure out if oil prices affected flight prices. They took oil company stock prices and the average price from all the flights in the dataset. However, the flight dataset was limited and the results ambiguous, so the team changed direction. Next, they wanted to calculate which stocks were the most volatile in the NYSE data. They created a scoring algorithm to determine which stocks are the most volatile.
Team 3 used Mandy, an easy library to use Hadoop with Ruby. They also tried out Wukong, and don’t recommend that. Mandy worked very well though.
Seak Pek Chhan, Nick Ursa (@nickursa), Athir Nvaimi, Gabe Sawhney
Pek, Nick, Athir and Gabe took a month and a half of Toronto bixi data and wanted to see if bixi data is affected by the weather. The answer is yes. Pek also took the Perl code and turned it into Python for fun.
Stefan, Olivier, David and Mike had no prior experience with Hadoop. They used Python with mrjob. They analyzed DNS data for various things:
Average number of nameservers (it’s 2.25, max is 6)
Number of domains with a specific number of characters. (11 is the most popular)
Domains for which there exists most numbers of permutations of the same domain (mostly used by spammers. Every permutation of Yahoo and Youtube for example exist)
The team noted that the configuration of number of mappers and reducers is very important to speed up the jobs.
Jordan Christiansen (Kobo, @thebigjc)
Jordan analyzed the correlations of every single stock pair on NYSE. The data started at 0.5 gb and expanded to 250gb when the pairs and prices had bee created. A linear regression was then run for the dataset ending up with 4M pairs. Some interesting correlations were found and Jordan ended up with a huge list of correlated stocks.
You can find the code on github: github.com/thebigjc/hackreduce
Cleaver, Max, Shanly and Matt ran their last job while the pitches were already under way.
They analyzed the “connectedness” of various tech communities based on the twitter social graph. It was done by choosing a couple of influencers per community and a person was determined to be part of the community if he followed any of the influencers of that community. For example, John Resig was a community leader in jQuery. You can check out the results in the video.
Yong worked on finding the cheapest flight combinations. He found the cheapest chained flights from Seattle. The projects was limited because of the limitation of the dataset (only flights from Seattle.)
Christophe Biocca, Akash Vaswani, Jake Nielsen, Drew Gross
Team 9 wanted to Basically the team ended up workeing on parsing wikipedia and came to the conclusion that it’s painful.
In the end they just calculated which article has the most outbound links, but it was uncertain if it actually worked correctly. The result was some error correction page, for more details, check the video.
Jamie Wong (@jlfwong), Snady Wu, Wien Leung, Maverick Lee, Christopher Wu, Christopher Cooper
Team number 10 had members that worked on a couple of different projects:
Jamie Wong analyzed what made people notable from specific years based on the year they were born and what they had become famous for.
Snady Wu and Christopher Cooper worked on indound links to articles but were halted by the wikipedia parsing issues.
Christopher Wu worked on figuring out how long before you should by your flight in order to get the cheapest flight.