On June 25th Hack/Reduce coders took over the NERD center in Cambridge. It was the biggest Hack/Reduce yet with 90 coders hacking on big data projects on a total of 600 nodes provided by Hopper and Cloudant.
As always, many were humbled by crashing bits of code but that was quickly forgotten while frantically reloading a white screen with some percentages representing your job being processed. An admittedly weird way to spend your Saturday, but you just can’t say no to the opportunity of having the power of hundreds of machines at your fingertips to serve your hastily crafted code.
In the end we let all the participants vote for the best Hack/Reduction, the winner was team 6 whose application visualized words that were most related to a chosen word. The related words were based on an algorithm that they created to calculate related words by processing all of wikipedia . We used polleverywhere from Boston for the voting, it worked great!
Team 1 Merging twitter and music data
Team: Pete Kruskall, Cinjon Resnick, Greg Sabo, Thierry Bertin-Mahieux , Bruce spang, Rob Speer, Thor Kell, William Dvorak, Nadav Aharony
Team 1 worked on a couple of different projects with the Million Song Dataset. They created nice looking visualizations of clusters of Musicians by merging the Million Song Dataset and the Twitter social graph dataset. They also calculated the similarity between twitter handles of musicians based on who they followed and created nice dynamic visualizations of the relationships.
Team 2 How book mentions correlates to stock prices.
Team: Ben Popp, Adam Buggia, Andrew Rollins, Michael Axiak, Manish Maheshwari
Team 2 wanted to calculate how mentions in books correlates to stock prices. They used Google n-grams and NYSE stock data.
The results? “nobody’s gona get rich” (That’s a quote from their presentation, not our evaluation.)
Team 3 – Location popularity in history based on book mentions.
Matt Veitas, Alex Harris, Grace Woo, Pablo Azar, Justin Ryan
Team 3 wanted to see how popular locations have been in literature starting way back in 1700. They did this by calculating the amount of mentions in books for every location. The jobs were still running when we had the presentations so we didn’t get to see the final results… Great effort though from this team that hadn’t worked with Hadoop before.
Team 4 – How much is Tom Cruise worth for a movie?
Tuan Phan, Ekaterina Lesnaia, Jason Nochlin
Team 4 wanted to estimate the value of Tom Cruise by analyzing the IMDB dataset and the value an actor has on the gross sales of a movie. I think the results said that Tom was worth at least 1.6 Million for a movie, which means that the studios might have to rethink the salaries they’re paying.
Team 5 – Federal election donation clusters
Eric Brown-Munoz, Jacob Elder, Ben Darfler, Jim Gammill, Van Simmons
Team 5 wanted to match and cluster campaign donators from federal election data. We feared some government officials might turn up but it all ended ok. It turned out that merging the different types of names used in the dataset to one name was difficult and no politically sensitive results were achieved.
Team 6 – Winner! Clustering related words from wikipedia
Satish Gopalakrishnan, Vineet Manohar
Satish and Vineet wanted to create an application that would find a list of associated words for any chosen word. They created a distance algorithm that ranked words based on how close to the original word they were mentioned in wikipedia articles. To get the results, they scanned through the wikipedia dataset and looked for the associated words for “McCain”, “Erlang” and “Reebok”.
Team 7 – EnglishCentral – Which are the most difficult words for english learners?
JM Van Thong , Don McAllaster, Jonathon Marston
Team 7 wanted to analyze a dataset from EnglishCentral to find the words that English learners from specific countries have the most trouble with in spoken language. The dataset had 60 million recordings of learners learning English.
The most challenging word for Japanese English learners was the word “really”.
Next the team wants to find the 100 most difficult words per country for learners from around the world.
The team also discussed that it was very useful to learn how you had to break one task in to small chunks in order to be able to process it with Hadoop.
Team 8 – mappers for freebase dataset
Tom Morris created mappers for freebase to make it easier at future Hack/Reduce events to use the freebase dataset. Thanks Tom!
Team 9 – Quant Finance
Dhanvi Reddy, Alban Chevignard, Ajit Padukone, Kah Keng Tay
Team 9 wanted to try MapReduce for quant finance. They basically created different portfolio strategies based on historical performance that they could then evaluate. The process they had to go through was to calculate:
- Monthly returns from daily prices for all stocks
- Create a model from monthly returns (a forecast of returns & risk)
- Create and test portfolio weights based on the model created
- Analyzing the portfolio return
The portfolio strategies tested were different versions based on historical performance.
Until next time…
Then next Hack/Reduce will be organized right after summer, stay tuned!