Toronto

From the blog about » Toronto

 

Hack/Reduce Toronto Presentations

All videos from the presentations: Hack/Reduce Vimeo

Hi everyone, this is coming a bit late now since Hack/Reduce Toronto was already over a week ago, but we were too busy setting up for Boston to write about the presentations we saw in Toronto… Anyways, we had an amazing time in Toronto. Thanks to all of the participants who were working hard the whole day! Seeing people working hard, learning and enjoying Hack/Reduce is what makes it all worth while for us!

The day started off with coffee, a short introduction to using the cluster and pitches by the participants. The Hopper team then gave a short tutorial on Hadoop and Map/Reduce.

People got to work really quickly and the amount of noise in the beginning when the teams were discussing their projects was mind-blowing. A lot of buzz. Soon after 1 o’clock all the teams hunkered down to start coding and there was an eerie silence again…

In the end, the teams really got a lot done. We saw some really amazing presentations. I’ll give some short descriptions of the pitches here. I’ve put up all of the presentations we had on our Vimeo channel.

There are some interesting things to learn from the videos, mostly about the technologies used and tested, so I suggest you check them out!

In the end, 10 teams out of 21 ended up presenting:

Team 1

Check out the presentation on Vimeo

Bartek Ciszkowski (@bartek), Ash Christopher (@ashchristopher)

Bartek and Ash analyzed search queries that had been made during the course of one day. They grouped search queries in four categories: travel, sex, nerd and cooking. They then analyzed how the popularity of these categories in searches varied during the day.

Here are two pictures of the results, grouped by 1 minute and 30 minutes:

1 minute:

30 minutes:

The source code is on github at https://github.com/ashchristopher/HackReduceToronto.

Team 2

Check out the presentation on Vimeo

Joel Crocker (@joelcrocker), Johan Harjono (@jharjono), Joey Robert (@joeyrobert), Ian Stevens (@istevens)

Joel, Johan, Joey and Ian used a 10 000 song subset of the million song dataset. They were using the Disco distributed computing framework with Python.

They analyzed:

  • The most romantic year by looking for the word love in song titles.
  • The variation of words in song titles (Only 100 words are used in song titles)
  • Average song tempo per year
  • Song lengths per year
  • Saddest tones (Turns out D is really sad)
  • Recording locations.

The source code can be found on github: Github/joyerobert/hackreduce

Team 3

Check out the presentation on Vimeo

Gar Liu (@lonelydatum), Nathan Rambarran (@wibblz), Khurram Virani (@viranik)

Gar, Nathan and Khurram first wanted to figure out if oil prices affected flight prices. They took oil company stock prices and the average price from all the flights in the dataset. However, the flight dataset was limited and the results ambiguous, so the team changed direction. Next, they wanted to calculate which stocks were the most volatile in the NYSE data. They created a scoring algorithm to determine which stocks are the most volatile.

Team 3 used Mandy, an easy library to use Hadoop with Ruby. They also tried out Wukong, and don’t recommend that. Mandy worked very well though.

Team 4

Check out the presentation on Vimeo

Seak Pek Chhan, Nick Ursa (@nickursa), Athir Nvaimi, Gabe Sawhney

Pek, Nick, Athir and Gabe took a month and a half of Toronto bixi data and wanted to see if bixi data is affected by the weather. The answer is yes. Pek also took the Perl code and turned it into Python for fun.

Team 5

Check out the presentation on Vimeo

Stefan Arentz (@satefan), Olivier Yiptong (@sayhello), David Chang, Mike Pettypiece (@mtpettyp)

Stefan, Olivier, David and Mike had no prior experience with Hadoop. They used Python with mrjob. They analyzed DNS data for various things:

Average number of nameservers (it’s 2.25, max is 6)
Number of domains with a specific number of characters. (11 is the most popular)
Domains for which there exists most numbers of permutations of the same domain (mostly used by spammers. Every permutation of Yahoo and Youtube for example exist)

The team noted that the configuration of number of mappers and reducers is very important to speed up the jobs.

Team 6

Check out the presentation on Vimeo

Jordan Christiansen (Kobo, @thebigjc)

Jordan analyzed the correlations of every single stock pair on NYSE. The data started at 0.5 gb and expanded to 250gb when the pairs and prices had bee created. A linear regression was then run for the dataset ending up with 4M pairs. Some interesting correlations were found and Jordan ended up with a huge list of correlated stocks.

You can find the code on github: github.com/thebigjc/hackreduce

Team 7

Check out the presentation on Vimeo

Cleaver Barnes (@cleaverbarnes), Max Brodie (@maxwellbrodie), Shanly Suepaul, Matt MacLean

Cleaver, Max, Shanly and Matt ran their last job while the pitches were already under way.

They analyzed the “connectedness” of various tech communities based on the twitter social graph. It was done by choosing a couple of influencers per community and a person was determined to be part of the community if he followed any of the influencers of that community. For example, John Resig was a community leader in jQuery. You can check out the results in the video.

Team 8

Check out the presentation on Vimeo

Yong Liang

Yong worked on finding the cheapest flight combinations. He found the cheapest chained flights from Seattle. The projects was limited because of the limitation of the dataset (only flights from Seattle.)

Team 9

Check out the presentation on Vimeo

Christophe Biocca, Akash Vaswani, Jake Nielsen, Drew Gross

Team 9 wanted to   Basically the team ended up workeing on parsing wikipedia and came to the conclusion that it’s painful.

In the end they just calculated which article has the most outbound links, but it was uncertain if it actually worked correctly. The result was some error correction page, for more details, check the video.

Team 10

Check out the presentation on Vimeo

Jamie Wong (@jlfwong), Snady Wu, Wien Leung, Maverick Lee, Christopher Wu, Christopher Cooper

Team number 10 had members that worked on a couple of different projects:

Jamie Wong analyzed what made people notable from specific years based on the year they were born and what they had become famous for.

Snady Wu and Christopher Cooper worked on indound links to articles but were halted by the wikipedia parsing issues.

Christopher Wu worked on figuring out how long before you should by your flight in order to get the cheapest flight.

 

Thanks a lot for the event everyone. We also want to thank the sponsors, Hopper, Amazon, Kobo, Mantella Venture Partners, Chango, Attachments.me and Startupnorth

Hack/Reduce Toronto Update: 500 nodes, 80 coders

The teams are working hard at Hack/Reduce Toronto. We have 20 teams working, 80 coders on our clusters with a total of 500 nodes.

Some images of the action:

Greg giving the starting tutorial about Map/Reduce and Hadoop to everyone interested

What it is

Hack/Reduce is a free one-day big data hackathon. The goal is to extract valuable information from large datasets and learn how to work with big data.

You can sign-up for Hack/Reduce in Toronto on June 18th at hackreduce2.eventbrite.com and Hack/Reduce Boston on June 25th at hackreduce3.eventbrite.com.

The event brings together Developers, Companies, Entrepreneurs and Students interested in Big Data.

Provided:

  • Free access to large clusters that can be scaled up according to your needs.
  • Pre-loaded datasets (participants are encouraged to suggest datasets)
  • Introduction to Hadoop and Map/Reduce and the infrastrucure
  • Support from Hadoop and Map/Reduce experts
  • Food and drinks

At the end of the event, participating teams and developers get to present what they have done, what they learned and what problems they faced.
It’s an opportunity to develop something great, learn Hadoop MapReduce and meet people interested in big data.

For more info, check out the event page for our last event in Montréal or discuss with us on Convore: https://convore.com/hackreduce/

Who is it for?

Developers, researchers and students in big data or interested in working with big data. The best thing is if you have something you want to get done that requires a lot of computing power. Alternatively, you can come to learn to use Hadoop. Basically Hack/Reduce is about developers, working with new people, pizza, unlimited computing power and large data sets.

Check out the photos from Hack/Reduce in Montréal:

Hack/Reduce on Picasa

Steven’s pics on Flickr

 

What will be built?

You have free hands to build anything you want. You can start by looking at the datasets that we have confirmed and the example datasets. You can choose from these datasets or suggest your own. You can even provide us with your own dataset, we’ll upload it for the event.
You’ll only have about 8 hours to build your application, so you should aim to build something that can be done in that timeframe. Maybe you can find some interesting information in a dataset or from the combination of two datasets.

You can check the projects that were built at the Montréal event. Summary:

Calculating the cheapest day to fly

Using our flight dataset, the team calculated which day you should choose to get the cheapest flights by comparing prices for the same routes on different weekdays.

Laziest bixi users

The Lazy bixi activity team mashed up bixi data from Montréal with the Google Elevation API to analyze where the laziest bixi users were situated in Montreal.

Spiking words

By analyzing words from the Google n-grams (Books dataset) the team was able to pick out words that we’re particulary popular in specific years compared to other years. For example, Viagra was one of these words in 2005.

Twitter influence

The Twitter influence team calculated the most influential tweeters based on the number of followers followers. Results are posted here.

Technical Trading Analysis

verify the value of a trigger event on when to buy stock.

Inverted Index:

building lucene-like inverted index on all of wikipedia.

PeopleRank

PageRank implementation of ranking people using twitter followers as “links”

Busiest Bixi stations

calculating the times when specific bixi spots in Montreal have the most activity.

Other sources of ideas

  • Examples from the SARA Hadoop Hackathon in Amsterdam:
    • Time series analysis measured by tension sensors in a newly built bridge
    • Automatic mathing of profiles on a dating site

Datasets

We need to pre-load datasets for use at the Hackathon, otherwise we’ll waste time just loading datasets at the event. We’ll at least have the datasets that are in the confirmed list below. We’ll be updating the list as the event draws closer.
If you require other datasets than the ones listed under “Confirmed”, please send us an email (riku  -at- hopper.travel) or contact us through twitter (@hackreduce). We’ll pre-load them for you.

You can also use any of the datasets that Amazon provides access to: Public Datasets on AWS. However, you’ll have to contact us about which datasets you want to use (riku – at – hopper.travel).

You can also bring us or suggest your own datasets.

Confirmed

Twitter Data, Social Graph and Public User Information http://an.kaist.ac.kr/traces/WWW2010.html
Google Books Data http://ngrams.googlelabs.com/datasets
NYSE data 1970 to 2010:(daily high, low and volume) http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange
NASDAQ data 1970 to 2010: (daily high, low and volume) http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange
100GB of search data Chango is providing 100GB of search term data with timestamps. It’s about half a day of searches.
CO2 Ocean measurements Carbon dioxide measurements done on the ocean surface over the last 50 years or so from http://cdiac.ornl.gov/
Million song database (We hope we’ll have it uploaded, not certain) http://labrosa.ee.columbia.edu/millionsong/
5M flight price data Flight data of return flights with query time, itinerary and price. 500k different flights queried at different times
Wikipedia Complete English Wikipedia articles

Other datasets that could be provided:

wordnet, geonames
Freebase (26 GB) (http://aws.amazon.com/datasets/2320?_encoding=UTF8&jiveRedirect=1)
Human Genome data (200 GB) (http://aws.amazon.com/datasets/2315?_encoding=UTF8&jiveRedirect=1)
US Census data 2000 (200 GB) (http://aws.amazon.com/datasets/Economics/2290)
Enron e-mail dataset: (500,000 messages) http://www.cs.cmu.edu/~enron/
Whitehouse visitor records: (1,000,000 records) http://www.whitehouse.gov/briefing-room/disclosures/visitor-records
IMDB (1.7M filmographies): http://www.infochimps.com/datasets/internet-movie-database
The first billion digits of Pi: http://www.infochimps.com/datasets/the-first-billion-digits-of-pi
hbgary email archive: http://www.p2pnet.net/story/48930
- dbpedia http://wiki.dbpedia.org/Datasets

 

How to prepare

You don’t have to do any special preparations for Hack/Reduce other than thinking about what you might want to build. However, we suggest you:

  • Think about what you could build, check out the datasets we have already chosen and suggest your own if you need other datasets
  • Get familiar with Hadoop and Map/Reduce and check that you can get a development environment running on your laptop. Also check out our tutorial and examples on Github.
  • Gather some friends to your team. You can also find teams at the event. At least check out our tutorial and examples on Github
  • Prepare to pitch your idea to the other participants, it’s your chance to get a team at the event.

In Montréal we had about 10 people stand up and give a short 30 second pitch about what they wanted to build or what dataset they were interested in. After the pitches anyone could go talk to the people that had pitched to join their team. We also had several teams that had been pre-formed. We also had some people that worked alone. Anything is possible.

You will need to bring your own laptop to the event, everything else will be provided. We have bad experiences of getting windows environments to work with hadoop, we suggest you bring a linux or osx machine.

Preliminary Schedule Toronto

Saturday 18th of June

10-10.15 Introduction
10.15-11.30 Intro to technologies, tutorial and setup
11.30-19 Development time, support from our experts
19-20 Presentations of results
20 Gathering stuff and off to celebrate!

Food, coffee and drinks will be available throughout the day.

Venue – Hack/Reduce 2 Toronto

Hack/Reduce 2 in Toronto will be held at
Mantella Venture Partner Offices
488 Wellington Street West, Suite 300
Toronto, Ontario M5V 1E3
Canada
 

View Larger Map

Sponsors

Hopper

Kobo logoChango logo

Attachments.me logoMantella Venture Partners logoStartupnorth logo

VM Farms

Toronto

The next Hack/Reduce is in Toronto on June 18th, 2011.
You can sign-up at hackreduce2.eventbrite.com.