All videos from the presentations: Hack/Reduce Vimeo
Hi everyone, this is coming a bit late now since Hack/Reduce Toronto was already over a week ago, but we were too busy setting up for Boston to write about the presentations we saw in Toronto… Anyways, we had an amazing time in Toronto. Thanks to all of the participants who were working hard the whole day! Seeing people working hard, learning and enjoying Hack/Reduce is what makes it all worth while for us!
The day started off with coffee, a short introduction to using the cluster and pitches by the participants. The Hopper team then gave a short tutorial on Hadoop and Map/Reduce.
People got to work really quickly and the amount of noise in the beginning when the teams were discussing their projects was mind-blowing. A lot of buzz. Soon after 1 o’clock all the teams hunkered down to start coding and there was an eerie silence again…
In the end, the teams really got a lot done. We saw some really amazing presentations. I’ll give some short descriptions of the pitches here. I’ve put up all of the presentations we had on our Vimeo channel.
There are some interesting things to learn from the videos, mostly about the technologies used and tested, so I suggest you check them out!
In the end, 10 teams out of 21 ended up presenting:
Check out the presentation on Vimeo
Bartek Ciszkowski (@bartek), Ash Christopher (@ashchristopher)
Bartek and Ash analyzed search queries that had been made during the course of one day. They grouped search queries in four categories: travel, sex, nerd and cooking. They then analyzed how the popularity of these categories in searches varied during the day.
Here are two pictures of the results, grouped by 1 minute and 30 minutes:
1 minute:
30 minutes:
The source code is on github at https://github.com/ashchristopher/HackReduceToronto.
Check out the presentation on Vimeo
Joel Crocker (@joelcrocker), Johan Harjono (@jharjono), Joey Robert (@joeyrobert), Ian Stevens (@istevens)
Joel, Johan, Joey and Ian used a 10 000 song subset of the million song dataset. They were using the Disco distributed computing framework with Python.
They analyzed:
The source code can be found on github: Github/joyerobert/hackreduce
Check out the presentation on Vimeo
Gar Liu (@lonelydatum), Nathan Rambarran (@wibblz), Khurram Virani (@viranik)
Gar, Nathan and Khurram first wanted to figure out if oil prices affected flight prices. They took oil company stock prices and the average price from all the flights in the dataset. However, the flight dataset was limited and the results ambiguous, so the team changed direction. Next, they wanted to calculate which stocks were the most volatile in the NYSE data. They created a scoring algorithm to determine which stocks are the most volatile.
Team 3 used Mandy, an easy library to use Hadoop with Ruby. They also tried out Wukong, and don’t recommend that. Mandy worked very well though.
Check out the presentation on Vimeo
Seak Pek Chhan, Nick Ursa (@nickursa), Athir Nvaimi, Gabe Sawhney
Pek, Nick, Athir and Gabe took a month and a half of Toronto bixi data and wanted to see if bixi data is affected by the weather. The answer is yes. Pek also took the Perl code and turned it into Python for fun.
Check out the presentation on Vimeo
Stefan Arentz (@satefan), Olivier Yiptong (@sayhello), David Chang, Mike Pettypiece (@mtpettyp)
Stefan, Olivier, David and Mike had no prior experience with Hadoop. They used Python with mrjob. They analyzed DNS data for various things:
Average number of nameservers (it’s 2.25, max is 6)
Number of domains with a specific number of characters. (11 is the most popular)
Domains for which there exists most numbers of permutations of the same domain (mostly used by spammers. Every permutation of Yahoo and Youtube for example exist)
The team noted that the configuration of number of mappers and reducers is very important to speed up the jobs.
Check out the presentation on Vimeo
Jordan Christiansen (Kobo, @thebigjc)
Jordan analyzed the correlations of every single stock pair on NYSE. The data started at 0.5 gb and expanded to 250gb when the pairs and prices had bee created. A linear regression was then run for the dataset ending up with 4M pairs. Some interesting correlations were found and Jordan ended up with a huge list of correlated stocks.
You can find the code on github: github.com/thebigjc/hackreduce
Check out the presentation on Vimeo
Cleaver Barnes (@cleaverbarnes), Max Brodie (@maxwellbrodie), Shanly Suepaul, Matt MacLean
Cleaver, Max, Shanly and Matt ran their last job while the pitches were already under way.
They analyzed the “connectedness” of various tech communities based on the twitter social graph. It was done by choosing a couple of influencers per community and a person was determined to be part of the community if he followed any of the influencers of that community. For example, John Resig was a community leader in jQuery. You can check out the results in the video.
Check out the presentation on Vimeo
Yong Liang
Yong worked on finding the cheapest flight combinations. He found the cheapest chained flights from Seattle. The projects was limited because of the limitation of the dataset (only flights from Seattle.)
Check out the presentation on Vimeo
Christophe Biocca, Akash Vaswani, Jake Nielsen, Drew Gross
Team 9 wanted to Basically the team ended up workeing on parsing wikipedia and came to the conclusion that it’s painful.
In the end they just calculated which article has the most outbound links, but it was uncertain if it actually worked correctly. The result was some error correction page, for more details, check the video.
Check out the presentation on Vimeo
Jamie Wong (@jlfwong), Snady Wu, Wien Leung, Maverick Lee, Christopher Wu, Christopher Cooper
Team number 10 had members that worked on a couple of different projects:
Jamie Wong analyzed what made people notable from specific years based on the year they were born and what they had become famous for.
Snady Wu and Christopher Cooper worked on indound links to articles but were halted by the wikipedia parsing issues.
Christopher Wu worked on figuring out how long before you should by your flight in order to get the cheapest flight.
Thanks a lot for the event everyone. We also want to thank the sponsors, Hopper, Amazon, Kobo, Mantella Venture Partners, Chango, Attachments.me and Startupnorth
The teams are working hard at Hack/Reduce Toronto. We have 20 teams working, 80 coders on our clusters with a total of 500 nodes.
Some images of the action:
Hack/Reduce is a free one-day big data hackathon. The goal is to extract valuable information from large datasets and learn how to work with big data.
You can sign-up for Hack/Reduce in Toronto on June 18th at hackreduce2.eventbrite.com and Hack/Reduce Boston on June 25th at hackreduce3.eventbrite.com.
The event brings together Developers, Companies, Entrepreneurs and Students interested in Big Data.
Provided:
At the end of the event, participating teams and developers get to present what they have done, what they learned and what problems they faced.
It’s an opportunity to develop something great, learn Hadoop MapReduce and meet people interested in big data.
For more info, check out the event page for our last event in Montréal or discuss with us on Convore: https://convore.com/hackreduce/
Developers, researchers and students in big data or interested in working with big data. The best thing is if you have something you want to get done that requires a lot of computing power. Alternatively, you can come to learn to use Hadoop. Basically Hack/Reduce is about developers, working with new people, pizza, unlimited computing power and large data sets.
Check out the photos from Hack/Reduce in Montréal:
You have free hands to build anything you want. You can start by looking at the datasets that we have confirmed and the example datasets. You can choose from these datasets or suggest your own. You can even provide us with your own dataset, we’ll upload it for the event.
You’ll only have about 8 hours to build your application, so you should aim to build something that can be done in that timeframe. Maybe you can find some interesting information in a dataset or from the combination of two datasets.
You can check the projects that were built at the Montréal event. Summary:
Using our flight dataset, the team calculated which day you should choose to get the cheapest flights by comparing prices for the same routes on different weekdays.
The Lazy bixi activity team mashed up bixi data from Montréal with the Google Elevation API to analyze where the laziest bixi users were situated in Montreal.
By analyzing words from the Google n-grams (Books dataset) the team was able to pick out words that we’re particulary popular in specific years compared to other years. For example, Viagra was one of these words in 2005.
The Twitter influence team calculated the most influential tweeters based on the number of followers followers. Results are posted here.
verify the value of a trigger event on when to buy stock.
building lucene-like inverted index on all of wikipedia.
PageRank implementation of ranking people using twitter followers as “links”
calculating the times when specific bixi spots in Montreal have the most activity.
http://www.opendataday.org/wiki/Opendata_Day_Projects
We need to pre-load datasets for use at the Hackathon, otherwise we’ll waste time just loading datasets at the event. We’ll at least have the datasets that are in the confirmed list below. We’ll be updating the list as the event draws closer.
If you require other datasets than the ones listed under “Confirmed”, please send us an email (riku -at- hopper.travel) or contact us through twitter (@hackreduce). We’ll pre-load them for you.
You can also use any of the datasets that Amazon provides access to: Public Datasets on AWS. However, you’ll have to contact us about which datasets you want to use (riku – at – hopper.travel).
You can also bring us or suggest your own datasets.
| Twitter Data, Social Graph and Public User Information | http://an.kaist.ac.kr/traces/WWW2010.html |
| Google Books Data | http://ngrams.googlelabs.com/datasets |
| NYSE data 1970 to 2010:(daily high, low and volume) | http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange |
| NASDAQ data 1970 to 2010: (daily high, low and volume) | http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange |
| 100GB of search data | Chango is providing 100GB of search term data with timestamps. It’s about half a day of searches. |
| CO2 Ocean measurements | Carbon dioxide measurements done on the ocean surface over the last 50 years or so from http://cdiac.ornl.gov/ |
| Million song database (We hope we’ll have it uploaded, not certain) | http://labrosa.ee.columbia.edu/millionsong/ |
| 5M flight price data | Flight data of return flights with query time, itinerary and price. 500k different flights queried at different times |
| Wikipedia | Complete English Wikipedia articles |
| wordnet, geonames | |
| Freebase (26 GB) | (http://aws.amazon.com/datasets/2320?_encoding=UTF8&jiveRedirect=1) |
| Human Genome data (200 GB) | (http://aws.amazon.com/datasets/2315?_encoding=UTF8&jiveRedirect=1) |
| US Census data 2000 (200 GB) | (http://aws.amazon.com/datasets/Economics/2290) |
| Enron e-mail dataset: (500,000 messages) | http://www.cs.cmu.edu/~enron/ |
| Whitehouse visitor records: (1,000,000 records) | http://www.whitehouse.gov/briefing-room/disclosures/visitor-records |
| IMDB (1.7M filmographies): | http://www.infochimps.com/datasets/internet-movie-database |
| The first billion digits of Pi: | http://www.infochimps.com/datasets/the-first-billion-digits-of-pi |
| hbgary email archive: | http://www.p2pnet.net/story/48930 |
| - dbpedia | http://wiki.dbpedia.org/Datasets |
You don’t have to do any special preparations for Hack/Reduce other than thinking about what you might want to build. However, we suggest you:
In Montréal we had about 10 people stand up and give a short 30 second pitch about what they wanted to build or what dataset they were interested in. After the pitches anyone could go talk to the people that had pitched to join their team. We also had several teams that had been pre-formed. We also had some people that worked alone. Anything is possible.
You will need to bring your own laptop to the event, everything else will be provided. We have bad experiences of getting windows environments to work with hadoop, we suggest you bring a linux or osx machine.
| 10-10.15 | Introduction |
| 10.15-11.30 | Intro to technologies, tutorial and setup |
| 11.30-19 | Development time, support from our experts |
| 19-20 | Presentations of results |
| 20 | Gathering stuff and off to celebrate! |
Food, coffee and drinks will be available throughout the day.
Hack/Reduce 2 in Toronto will be held at
Mantella Venture Partner Offices
488 Wellington Street West, Suite 300
Toronto, Ontario M5V 1E3
Canada
View Larger Map
The next Hack/Reduce is in Toronto on June 18th, 2011.
You can sign-up at hackreduce2.eventbrite.com.