Reveal – How much does the world know?

Have you ever had a shit-tonne of documents dumped into your inbox with an impossible deadline demanding to suck out the hidden juicy bits? Or may be it has been a joyful experience of discovering the dump of an MILF’s emails, diplomatic cables, or code dumps of an evil corporation’s website? At moments like those, you might have uttered, “fcuk! … Omne Ignotum Pro Magnifico!”. Wouldn’t it be nice if the needles just magically popped out of the haystack? Meet Reveal (clickable prototype) – a software framework that aspires to achieve that and may be a bit more.

Background:
While sed/awk/grep-ing the cablegate files, I stumbled upon a cable that mentioned Kofi Annan asking Robert Mugabe to step down in exchange for a handsome retirement plan during the Millennium summit. Being an ignorant bloke, I could hardly recall what the Millennium summit was about, had no clue if Mugabe was still in office, and if Kofi Annan has made a comment on this! Without the right background and context I could not appreciate the data to the full extent. Below is the #MozNewsLab final project idea pitch in the lights of the three speakets  this week: Chris Heilmann, John Resig and Jesse James Garrett

“What is this thing for? What does it do? How is it supposed to fit into people’s lives?”, @Jesse James Garrett:

Journalists get amazing amount of digital data everyday which are in the form of numbers in tables. With some spreadsheet skills or help from newsroom programmers, they produce incredible revelations of the reality that hides behind those numbers. However, when the data comes in the form of unstructured text files written in natural language – there isn’t much algorithmic help available, other than full text searches with a list of guess words. Using cutting edge information retrieval technique,  Reveal would aim to build a framework that automatically annotates names, places, locations, dates etc. in the unstructured text files.

“Adopting Open Source, Open standards“, @Chris Heilmann:
Being baptized by St. IGNUcius, the idea of Free as in Freedom runs through the core of the technology stack of Reveal. Standard LAMP stack for server side, UI powered by HTML5, CSS3 and jQuery plugins and a number of open source libraries for doing the information extraction – long post describing the information retrieval technology coming soon. (Mind map above).

Using the detected names, locations, dates etc., Reveal will try to aggregate additional information in the form of images, maps, news articles, videos, wikipedia pages, visualizations etc. via open API-s and use them as navigational elements to browse the data. Juxtaposed to the document under scrutiny, these will provide the right context to gauge the sensitivity of the information.

“User to Contributor”, @John Resig:
Additionally, by showing a relative score of “How much does the world know?”, calculated on the basis of the aggregated information published before the documents surfaced, we can excite the newsreaders to share the information across their own social network. Add some game mechanics by quantifying that “sharing”, and we bust the filter bubble of ignorant blokes and turn them into responsible citizens who’ll raise voices against wrong doings of totalitarian regimes, evil corporations or other bad asses. This will lead to creation of more content and will act as a feedback loop to the background and context aggregation step before.

Now, a similar project by the uber journalist-programmer Jonathan Stray of the AP has won this year’s Knight Mozilla news challenge. His approach, Overview, solely focuses on clustering documents based on cosine similarity of their tf-idf scores. Using sexy visualization, it pulls out key terms specific to the corpus under study. The night when the results of Knight Mozilla challenge was announced – in an euphoric outburst I sent him an embarrassingly long late night email ranting the above. Obviously, I never heard back but he will be releasing his code soon and I am super excited to fork it for visualizations in Reveal.

That is my final software idea pitch inspired by Chris Heilmann, John Resig and Jesse James Garrett #MozNewsLab Week 2:

Tweetsabers of News Revolution across the globe #MozNewsLab

After Amanda Cox’s lecture I was too pepped up to do some quick and sexy data visualization. In my daily life, I rely on R or GNUPlot for doing all my plots, simply because of their scripting interface. I have played a bit with Google chart and visualization api and they are absolutely brilliant. I’m planning to get my hands dirty with matplotlib (yep, yep … Python!).

So mid way into the re-listening the lecture, I was overpowered by this feeling of doing some global data visualization. One of the best global data visualization tool that have left an impression on my mind, is the WebGL Globe – an open platform from Google Data Arts Team. I grabbed the example code from here and with a simple Python script collected Twitter activities during the first week of #MozNewsLab into a .json file. Some simple changes to the sample javascript code and there you have colored light sabers shooting out from the globe.

The main problem was the twitter api limit – a meager 125 queries per hour. So I had to rely on the geo-location data that I had scraped earlier for mapping #MozNewsLab participants in the world map. That allowed me to narrow down the geo location queries for those who were not a participant of #MozNewsLab. In case the geo-location info was not available on their twitter profile, those homeless tweet counts were assigned to our dear Lab co-lead Phillip Smith.

Mozilla News Lab Schedule – Google Calendar

The Knight Foundation and Mozilla have joined forces to help the media adapt to the evolving technology landscape. After an open idea challenge, 60 hackers and journalists were selected for a month long Learning Lab. I somehow managed to sneak into this elite club with my two cents here and here.

The Learning Lab is going to be a series of webinars from some of the most respected names in technology and journalism. Here is a list of Twitter handles of the organizers and speakers. Below is a google calendar showing the webinar timings (PST).

Wanna see how wacky it will get? Check out the video from the most colorful moderator ever – Jacob Caggiano.

Can’t wait till Monday …