Hack/Reduce

Stay tuned for the next edition of Hack/Reduce.

Learn – Our Github

Want to learn Hadoop and Map/Reduce?

Check out Hack/Reduce on Github

Lots of resources and example code you can run. It's a good place to start when you're preparing to come to Hack/Reduce

We're also gathering the code from projects that have been done at past events.



Twitter/News

  • Loading Tweets..

Shopping and Dating Hacks – Ottawa

We had an excellent Hack/Reduce event again this weekend in Ottawa. Our knowledge has been expanded again, we now know how desperate daters can get and at what time people do their online shopping among other things.

In order to get all the tech to work and keep the clusters humming a lot of coffee and pizza was consumed. I would be afraid to calculate how much.

Most importantly, we saw a lot of great Hack/Reductions by the participants. Actually, first and foremost it was a great weekend for the Hack/Reduce team again – the best way to spend a Saturday – for real! Thanks!

We want to thank all the participants who made it out and the sponsors of course: Hopper, Infoglutton and Shopify.

Here’s a short description of some of the presentations we saw:

Hackify

Jean-Claude Batista (@jcbatista) , Taswar Bhati (@taswarbhatti) , Joel Sachs, Andrew Clunis (@orospakr), (Petro Verkhogliad)

The Hackify team worked on the shopify dataset with information about US Shopify orders. The team used Ruby and the streaming api.

Firstly, the team analyzed where people shop the most. Naturally, the answer was California (largest state).
Second, the team wanted to find the date when people had shopped the most, which was on August 19th between noon and 6pm ($185891).

The team concluded that using Ruby with the streaming api makes it easy to do map/reduce and that Hadoop is cool!

Petro

Petro Verkhogliad (@vpetro)

Petro first worked with the Hackify team but then turned to using Python which he was more familiar with. Petro also analyzed the Shopify shopping data.

Petro found that people shop the most at 9am in the morning and 9pm in the evening. The most popular shopping day is Friday with Saturday as a close second.

Geographically most shopping is done by Californians ordering from California.

Team-RIM

Rod Dunne, Marc Lepage, Mohamed Mansour (@mohamedmansour) , Alexis Brunet

Code: https://github.com/mlepage/HackReduce

Team-RIM was voted as winner by the participants. They worked on the google n-grams data and the Amazon review data. First, the team calculated the amount of alliterative 2-grams by letter per year. They also calculated the amount of alliterative 2-grams by letter per year.

For the amazon review data, team-RIM calculated the average rating given on specific dates calculated over all of the years in the dataset. You can see that all products are almost given a rating of 4. You can also notice that right after christmas the ratings drop off, ie. people give worse review right after christmas.

The amazon dataset also includes data over how useful reviews are. The reviews can be voted up or down on amazon by users. Team-RIM analyzed this data and came to the conclusion that reviews that give a higher rating to a product are considered more useful.
The team also analyzed the usefulness of reviews based on the review length. From the picture we can see that 50-60 character reviews are considered most useful. As a bonus, the team calculated that products received worse reviews as time went by.

Pascal – Analysis of the most desperate daters

Pascal from Hopper wanted to find the most desperate daters. He analyzed the amount of profile views by users. According to his analysis, some users are checking so many profiles that with a 30-second timeframe it amounts to full-time work. (~17k views per month… Which amounts to something like 60 profile views every working hour…)

Pascal also found the most visited profiles.

Lastly, Pascal analyzed how users use the Mate1 site. The results were quite surprising, as 23% of users only view one other profile. 65% only ever check 10 profiles.

JF

JF (@jeanfrancoisim) from Hopper also analyzed the dating dataset.

First, JF mapped the birth year of users to their hotness by gender. JF calculated hotness with the following formula: (msgs received+msgs sent) * ((msgs received+10)/(msgs sent+10)).

The red dots represent women and blue dots men. We can clearly see that the woman demographic is all in all younger. I’m afraid to draw any other conclusions from the results…

Next JF mapped hotness and height by gender. You could easily see that men are taller than women but it is unclear if height directly influences hotness.

As a last analysis JF mapped income to hotness. Most people answered “rather not say” to this question, why most datapoints are in the second column. It’s also unclear if income influences hotness.

Team XKCD

Steven Noble (@snoble), Chris Saunders(@chris_saunders) , David Underwood (@davefp)

The XKCD team wanted to test the internet “truism” that has been said in a xkcd comic: “Wikipedia trivia: if you take any article, click on the first link in the article text not in parentheses or italics, and then repeat, you will eventually end up at “Philosophy”

The team didn’t really end up with a result for this since it turned out to be difficult to only include the actual article content and not all of the other links on the page. They started their work in Clojure but later ended up using java instead.

Team Dating 8

Martin Samson (@pgdown) , Edgar Acosta, David Germain

Team 8 explored the dating dataset with the goal of measuring popularity/hotness and correlating it to other factors.
The team started by analyzing some basic measures for profiles.

They created their own popularity formula.

The team made some nice graphs mapping times listed vs. messages received, times listed vs. profile views, popularity vs. number of times viewed..

The team used streaming with Python and fought with it for a while. Lesson: Do not use the same file name for the mapper and the reducer scripts.

Learnt: Do not use the same file name for the mapper and reducer scripts.

Team weather and crime

Richard Desmarais, Chris Camden, Philippe Savoie, Ryan McLeod, Eric Ax, Manuel Belmadani (@pragmatwit)

The weather and crime team created a web interface that could answer queries such as “What is the maximum temperature in january 2007”. When the search query is launched it will run through the dataset and give you the answer. The team used python streaming.

Team UOttawaNLP + others

Chris Fournier (@cfournie ), Oana Frunza, Alistair Kennedy, Russell Luo, Dominic Plouffe (@dplouffe )

Team UOttawa analyzed the sentiment of the Amazon reviews. As expected, there were clear differences in sentiment between the good and the bad reviews.

Winner

In the end, we let the participants vote for the best Hack/Reduction. Team-RIM with their n-grams and Amazon analyses took the win… Congrats!

 

Thanks to everyone for coming, the Hack/Reduce team had a great time and it was amazing to meet you all. We hope you keep hacking and we hope we’ll see you next time!

 

Enhanced by Zemanta

The Boston Hack/Reductions

On June 25th Hack/Reduce coders took over the NERD center in Cambridge. It was the biggest Hack/Reduce yet with 90 coders hacking on big data projects on a total of 600 nodes provided by Hopperand Cloudant.

As always, many were humbled by crashing bits of code but that was quickly forgotten while frantically reloading a white screen with some percentages representing your job being processed. An admittedly weird way to spend your Saturday, but you just can’t say no to the opportunity of having the power of hundreds of machines at your fingertips to serve your hastily crafted code.

In the end we let all the participants vote for the best Hack/Reduction, the winner was team 6 whose application visualized words that were most related to a chosen word. The related words were based on an algorithm that they created to calculate related words by processing all of wikipedia . We used polleverywhere from Boston for the voting, it worked great!

Results for Hack/Reduce Boston

The presentations

Team 1 Merging twitter and music data

Team: Pete Kruskall, Cinjon Resnick, Greg Sabo, Thierry Bertin-Mahieux , Bruce spang, Rob Speer, Thor Kell, William Dvorak, Nadav Aharony

Team 1 worked on a couple of different projects with the Million Song Dataset. They created nice looking visualizations of clusters of Musicians by merging the Million Song Dataset and the Twitter social graph dataset.  They also calculated the similarity between twitter handles of musicians based on who they followed and created nice dynamic visualizations of the relationships.

Team 2 How book mentions correlates to stock prices.

Team: Ben Popp, Adam Buggia, Andrew Rollins, Michael Axiak, Manish Maheshwari

Team 2 wanted to calculate how mentions in books correlates to stock prices. They used Google n-grams and NYSE stock data.

The results? “nobody’s gona get rich” (That’s a quote from their presentation, not our evaluation.)

Team 3 – Location popularity in history based on book mentions.

Matt Veitas, Alex Harris, Grace Woo, Pablo Azar, Justin Ryan

Team 3 wanted to see how popular locations have been in literature starting way back in 1700. They did this by calculating the amount of mentions in books for every location. The jobs were still running when we had the presentations so we didn’t get to see the final results… Great effort though from this team that hadn’t worked with Hadoop before.

Team 4 – How much is Tom Cruise worth for a movie?

Tuan Phan, Ekaterina Lesnaia, Jason Nochlin

Team 4 wanted to estimate the value of Tom Cruise by analyzing the IMDB dataset and the value an actor has on the gross sales of a movie. I think the results said that Tom was worth at least 1.6 Million for a movie, which means that the studios might have to rethink the salaries they’re paying.

Team 5 – Federal election donation clusters

Eric Brown-Munoz, Jacob Elder, Ben Darfler, Jim Gammill, Van Simmons

Team 5 wanted to match and cluster campaign donators from federal election data. We feared some government officials might turn up but it all ended ok. It turned out that merging the different types of names used in the dataset to one name was difficult and no politically sensitive results were achieved.

Team 6 – Winner! Clustering related words from wikipedia

Satish Gopalakrishnan, Vineet Manohar

Satish and Vineet wanted to create an application that would find a list of associated words for any chosen word. They created a distance algorithm that ranked words based on how close to the original word they were mentioned in wikipedia articles.  To get the results, they scanned through the wikipedia dataset and looked for the associated words for “McCain”, “Erlang” and “Reebok”.
winning team presenting

Team 7 – EnglishCentral – Which are the most difficult words for english learners?

JM Van Thong , Don McAllaster, Jonathon Marston

Team 7 wanted to analyze a dataset from EnglishCentral to find the words that English learners from specific countries have the most trouble with in spoken language. The dataset had 60 million recordings of learners learning English.

The most challenging word for Japanese English learners was the word “really”.

Next the team wants to find the 100 most difficult words per country for learners from around the world.

The team also discussed that it was very useful to learn how you had to break one task in to small chunks in order to be able to process it with Hadoop.

Team 8 – mappers for freebase dataset

Tom Morris
Tom Morris created mappers for freebase to make it easier at future Hack/Reduce events to use the freebase dataset. Thanks Tom!

Team 9 – Quant Finance

Dhanvi Reddy, Alban Chevignard, Ajit Padukone, Kah Keng Tay

Team 9 wanted to try MapReduce for quant finance. They basically created different portfolio strategies based on historical performance that they could then evaluate. The process they had to go through was to calculate:

  1. Monthly returns from daily prices for all stocks
  2. Create a model from monthly returns (a forecast of returns & risk)
  3. Create and test portfolio weights based on the model created
  4. Analyzing the portfolio return

The portfolio strategies tested were different versions based on historical performance.

Until next time…

We want to thank all of the participants for an amazing event, see you soon again! We also want to thank the sponsors, Hopper and Cloudant and Microsoft for offering us the space!

Then next Hack/Reduce will be organized right after summer, stay tuned!

 

 

Hack/Reduce Toronto Presentations

All videos from the presentations: Hack/Reduce Vimeo

Hi everyone, this is coming a bit late now since Hack/Reduce Toronto was already over a week ago, but we were too busy setting up for Boston to write about the presentations we saw in Toronto… Anyways, we had an amazing time in Toronto. Thanks to all of the participants who were working hard the whole day! Seeing people working hard, learning and enjoying Hack/Reduce is what makes it all worth while for us!

The day started off with coffee, a short introduction to using the cluster and pitches by the participants. The Hopper team then gave a short tutorial on Hadoop and Map/Reduce.

People got to work really quickly and the amount of noise in the beginning when the teams were discussing their projects was mind-blowing. A lot of buzz. Soon after 1 o’clock all the teams hunkered down to start coding and there was an eerie silence again…

In the end, the teams really got a lot done. We saw some really amazing presentations. I’ll give some short descriptions of the pitches here. I’ve put up all of the presentations we had on our Vimeo channel.

There are some interesting things to learn from the videos, mostly about the technologies used and tested, so I suggest you check them out!

In the end, 10 teams out of 21 ended up presenting:

Team 1

Check out the presentation on Vimeo

Bartek Ciszkowski (@bartek), Ash Christopher (@ashchristopher)

Bartek and Ash analyzed search queries that had been made during the course of one day. They grouped search queries in four categories: travel, sex, nerd and cooking. They then analyzed how the popularity of these categories in searches varied during the day.

Here are two pictures of the results, grouped by 1 minute and 30 minutes:

1 minute:

30 minutes:

The source code is on github at https://github.com/ashchristopher/HackReduceToronto.

Team 2

Check out the presentation on Vimeo

Joel Crocker (@joelcrocker), Johan Harjono (@jharjono), Joey Robert (@joeyrobert), Ian Stevens (@istevens)

Joel, Johan, Joey and Ian used a 10 000 song subset of the million song dataset. They were using the Disco distributed computing framework with Python.

They analyzed:

  • The most romantic year by looking for the word love in song titles.
  • The variation of words in song titles (Only 100 words are used in song titles)
  • Average song tempo per year
  • Song lengths per year
  • Saddest tones (Turns out D is really sad)
  • Recording locations.

The source code can be found on github: Github/joyerobert/hackreduce

Team 3

Check out the presentation on Vimeo

Gar Liu (@lonelydatum), Nathan Rambarran (@wibblz), Khurram Virani (@viranik)

Gar, Nathan and Khurram first wanted to figure out if oil prices affected flight prices. They took oil company stock prices and the average price from all the flights in the dataset. However, the flight dataset was limited and the results ambiguous, so the team changed direction. Next, they wanted to calculate which stocks were the most volatile in the NYSE data. They created a scoring algorithm to determine which stocks are the most volatile.

Team 3 used Mandy, an easy library to use Hadoop with Ruby. They also tried out Wukong, and don’t recommend that. Mandy worked very well though.

Team 4

Check out the presentation on Vimeo

Seak Pek Chhan, Nick Ursa (@nickursa), Athir Nvaimi, Gabe Sawhney

Pek, Nick, Athir and Gabe took a month and a half of Toronto bixi data and wanted to see if bixi data is affected by the weather. The answer is yes. Pek also took the Perl code and turned it into Python for fun.

Team 5

Check out the presentation on Vimeo

Stefan Arentz (@satefan), Olivier Yiptong (@sayhello), David Chang, Mike Pettypiece (@mtpettyp)

Stefan, Olivier, David and Mike had no prior experience with Hadoop. They used Python with mrjob. They analyzed DNS data for various things:

Average number of nameservers (it’s 2.25, max is 6)
Number of domains with a specific number of characters. (11 is the most popular)
Domains for which there exists most numbers of permutations of the same domain (mostly used by spammers. Every permutation of Yahoo and Youtube for example exist)

The team noted that the configuration of number of mappers and reducers is very important to speed up the jobs.

Team 6

Check out the presentation on Vimeo

Jordan Christiansen (Kobo, @thebigjc)

Jordan analyzed the correlations of every single stock pair on NYSE. The data started at 0.5 gb and expanded to 250gb when the pairs and prices had bee created. A linear regression was then run for the dataset ending up with 4M pairs. Some interesting correlations were found and Jordan ended up with a huge list of correlated stocks.

You can find the code on github: github.com/thebigjc/hackreduce

Team 7

Check out the presentation on Vimeo

Cleaver Barnes (@cleaverbarnes), Max Brodie (@maxwellbrodie), Shanly Suepaul, Matt MacLean

Cleaver, Max, Shanly and Matt ran their last job while the pitches were already under way.

They analyzed the “connectedness” of various tech communities based on the twitter social graph. It was done by choosing a couple of influencers per community and a person was determined to be part of the community if he followed any of the influencers of that community. For example, John Resig was a community leader in jQuery. You can check out the results in the video.

Team 8

Check out the presentation on Vimeo

Yong Liang

Yong worked on finding the cheapest flight combinations. He found the cheapest chained flights from Seattle. The projects was limited because of the limitation of the dataset (only flights from Seattle.)

Team 9

Check out the presentation on Vimeo

Christophe Biocca, Akash Vaswani, Jake Nielsen, Drew Gross

Team 9 wanted to   Basically the team ended up workeing on parsing wikipedia and came to the conclusion that it’s painful.

In the end they just calculated which article has the most outbound links, but it was uncertain if it actually worked correctly. The result was some error correction page, for more details, check the video.

Team 10

Check out the presentation on Vimeo

Jamie Wong (@jlfwong), Snady Wu, Wien Leung, Maverick Lee, Christopher Wu, Christopher Cooper

Team number 10 had members that worked on a couple of different projects:

Jamie Wong analyzed what made people notable from specific years based on the year they were born and what they had become famous for.

Snady Wu and Christopher Cooper worked on indound links to articles but were halted by the wikipedia parsing issues.

Christopher Wu worked on figuring out how long before you should by your flight in order to get the cheapest flight.

 

Thanks a lot for the event everyone. We also want to thank the sponsors, Hopper, Amazon, Kobo, Mantella Venture Partners, Chango, Attachments.me and Startupnorth

Hack/Reduce Toronto Update: 500 nodes, 80 coders

The teams are working hard at Hack/Reduce Toronto. We have 20 teams working, 80 coders on our clusters with a total of 500 nodes.

Some images of the action:

Greg giving the starting tutorial about Map/Reduce and Hadoop to everyone interested

Announcing Hack/Reduce Boston

We’re happy to annouce that the Hack/Reduce hackathon is coming to Boston on the 25th of June. The event is held at the NERD center.

It’s a great opportunity for devs, researchers and students with interest in big data to get together and hack on something awesome during one day.

  • Free access to large clusters
  • Access to datasets, you can suggest your own
  • Food
  • Friends

More information on the Boston page. Registration on eventbrite.

What it is

Hack/Reduce is a free one-day big data hackathon. The goal is to extract valuable information from large datasets and learn how to work with big data.

You can sign-up for Hack/Reduce in Toronto on June 18th at hackreduce2.eventbrite.com and Hack/Reduce Boston on June 25th at hackreduce3.eventbrite.com.

The event brings together Developers, Companies, Entrepreneurs and Students interested in Big Data.

Provided:

  • Free access to large clusters that can be scaled up according to your needs.
  • Pre-loaded datasets (participants are encouraged to suggest datasets)
  • Introduction to Hadoop and Map/Reduce and the infrastrucure
  • Support from Hadoop and Map/Reduce experts
  • Food and drinks

At the end of the event, participating teams and developers get to present what they have done, what they learned and what problems they faced.
It’s an opportunity to develop something great, learn Hadoop MapReduce and meet people interested in big data.

For more info, check out the event page for our last event in Montréal or discuss with us on Convore: https://convore.com/hackreduce/

Who is it for?

Developers, researchers and students in big data or interested in working with big data. The best thing is if you have something you want to get done that requires a lot of computing power. Alternatively, you can come to learn to use Hadoop. Basically Hack/Reduce is about developers, working with new people, pizza, unlimited computing power and large data sets.

Check out the photos from Hack/Reduce in Montréal:

Hack/Reduce on Picasa

Steven’s pics on Flickr

 

What will be built?

You have free hands to build anything you want. You can start by looking at the datasets that we have confirmed and the example datasets. You can choose from these datasets or suggest your own. You can even provide us with your own dataset, we’ll upload it for the event.
You’ll only have about 8 hours to build your application, so you should aim to build something that can be done in that timeframe. Maybe you can find some interesting information in a dataset or from the combination of two datasets.

You can check the projects that were built at the Montréal event. Summary:

Calculating the cheapest day to fly

Using our flight dataset, the team calculated which day you should choose to get the cheapest flights by comparing prices for the same routes on different weekdays.

Laziest bixi users

The Lazy bixi activity team mashed up bixi data from Montréal with the Google Elevation API to analyze where the laziest bixi users were situated in Montreal.

Spiking words

By analyzing words from the Google n-grams (Books dataset) the team was able to pick out words that we’re particulary popular in specific years compared to other years. For example, Viagra was one of these words in 2005.

Twitter influence

The Twitter influence team calculated the most influential tweeters based on the number of followers followers. Results are posted here.

Technical Trading Analysis

verify the value of a trigger event on when to buy stock.

Inverted Index:

building lucene-like inverted index on all of wikipedia.

PeopleRank

PageRank implementation of ranking people using twitter followers as “links”

Busiest Bixi stations

calculating the times when specific bixi spots in Montreal have the most activity.

Other sources of ideas

  • Examples from the SARA Hadoop Hackathon in Amsterdam:
    • Time series analysis measured by tension sensors in a newly built bridge
    • Automatic mathing of profiles on a dating site

Datasets

We need to pre-load datasets for use at the Hackathon, otherwise we’ll waste time just loading datasets at the event. We’ll at least have the datasets that are in the confirmed list below. We’ll be updating the list as the event draws closer.
If you require other datasets than the ones listed under “Confirmed”, please send us an email (riku  -at- hopper.travel) or contact us through twitter (@hackreduce). We’ll pre-load them for you.

You can also use any of the datasets that Amazon provides access to: Public Datasets on AWS. However, you’ll have to contact us about which datasets you want to use (riku – at – hopper.travel).

You can also bring us or suggest your own datasets.

Confirmed

Twitter Data, Social Graph and Public User Information http://an.kaist.ac.kr/traces/WWW2010.html
Google Books Data http://ngrams.googlelabs.com/datasets
NYSE data 1970 to 2010:(daily high, low and volume) http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange
NASDAQ data 1970 to 2010: (daily high, low and volume) http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange
100GB of search data Chango is providing 100GB of search term data with timestamps. It’s about half a day of searches.
CO2 Ocean measurements Carbon dioxide measurements done on the ocean surface over the last 50 years or so from http://cdiac.ornl.gov/
Million song database (We hope we’ll have it uploaded, not certain) http://labrosa.ee.columbia.edu/millionsong/
5M flight price data Flight data of return flights with query time, itinerary and price. 500k different flights queried at different times
Wikipedia Complete English Wikipedia articles

Other datasets that could be provided:

wordnet, geonames
Freebase (26 GB) (http://aws.amazon.com/datasets/2320?_encoding=UTF8&jiveRedirect=1)
Human Genome data (200 GB) (http://aws.amazon.com/datasets/2315?_encoding=UTF8&jiveRedirect=1)
US Census data 2000 (200 GB) (http://aws.amazon.com/datasets/Economics/2290)
Enron e-mail dataset: (500,000 messages) http://www.cs.cmu.edu/~enron/
Whitehouse visitor records: (1,000,000 records) http://www.whitehouse.gov/briefing-room/disclosures/visitor-records
IMDB (1.7M filmographies): http://www.infochimps.com/datasets/internet-movie-database
The first billion digits of Pi: http://www.infochimps.com/datasets/the-first-billion-digits-of-pi
hbgary email archive: http://www.p2pnet.net/story/48930
- dbpedia http://wiki.dbpedia.org/Datasets

 

How to prepare

You don’t have to do any special preparations for Hack/Reduce other than thinking about what you might want to build. However, we suggest you:

  • Think about what you could build, check out the datasets we have already chosen and suggest your own if you need other datasets
  • Get familiar with Hadoop and Map/Reduce and check that you can get a development environment running on your laptop. Also check out our tutorial and examples on Github.
  • Gather some friends to your team. You can also find teams at the event. At least check out our tutorial and examples on Github
  • Prepare to pitch your idea to the other participants, it’s your chance to get a team at the event.

In Montréal we had about 10 people stand up and give a short 30 second pitch about what they wanted to build or what dataset they were interested in. After the pitches anyone could go talk to the people that had pitched to join their team. We also had several teams that had been pre-formed. We also had some people that worked alone. Anything is possible.

You will need to bring your own laptop to the event, everything else will be provided. We have bad experiences of getting windows environments to work with hadoop, we suggest you bring a linux or osx machine.