T blogs

Data pipeline for vacation photos

I take pictures when I am on vacation. Then I throw away 90% of them, some make the cut to end up on Instagram. Instagram is a great platform, however without an official API to upload images, they make it tough for lazy amateurs to publish their content. There is a popular unofficial api, which I intend to give a try.

But even before I get there, I need to get the pipeline ready for getting the data out of the memory card of my DSLR and finding the ones that are worthy of posting. I don’t do any editing what so ever – and proudly tag everything with #nofilter. The real reason is image editing is tough and time-consuming. I doubt anyone would have such a workflow, but I find the existing tooling frustrating makes me do boring manual jobs – that too on vacation.

The workflow

Typically when I am on vacation, I would take pictures all day and as soon as I reach the hotel I want to get the pictures off my camera, group them in keep, discard,maybe buckets, upload them to the cloud, post them to Instagram and finally have the memory card cleaned up for the next day. If I get about an hour to do all this – I’m lucky. The most time-consuming part of the workflow is looking at the images and deciding which bucket it belongs. Rest of the parts are easy to automate. So I wanted to take up that part of the workflow first.

This stage of bucketing a few hundred images into keep,maybe,discardbuckets needed a tool that is more flexible than Photos on mac. Sometimes there are multiple shots of the same subject which needs to be compared next to each other.

After some digging, I found feh. It is a lightweight image browser and offers productivity and simplicity. Installing feh was fairly simple – just install it us if you are on a mac.

brew install feh

Feh acts like any other simple fullscreen image browser,

feh -t # thumbnails 
feh -m # montage

Other useful keyboard shortcuts

/       # auto-zoom
Shift+> # rotate clockwise
Shift+< # rotate anti-clockwise

There are tonnes of other options and good enough mouse support as well.

Extending with custom scripts

However the real power is unleashed when you bind any arbitrary unix commands to the number keys. For example:

mkdir keep maybe discard
feh --scale-down --auto-zoom --recursive --action "mv '%f' discard" --action1 "mv '%f' keep" --action2 "mv '%f' maybe" . &

Here is what is going on in the two commands above. First we create three directories. Next we bring up feh in the directory (.the current directory in this case) where we have copied the images from the memory card and use right left keys to cycle through the images.

The recursive flag takes care of going through any subdirectories. The scale-down and auto-zoom handles the sizing the images properly. The action flag allows you to associate arbitrary unix commands with keys 0-9. And that is incredible!

In the example above hitting the 0 key moves it to the directory discard. This is for two reasons – I am right handed and my workflow is to aggressively discard rather than keep. keep-s are less in numbers and easy to decide, so they are bound to 1. maybe-s are time sinks, so I bind it to 2. I might do a few more passes on each folder before the keep bucket is finalized.

Taking it to the next level

But to take it to the next level, lets bind our 1 (keep) to aws s3 cp command. So we can instantly start uploading them to s3 with one keystroke. Here’s the basic idea:

bucket=`date "+%Y-%m-%d-%s"`
aws s3 mb s3://${bucket}/keep --region us-west-1
aws s3 mv '%f' s3://${bucket}/'%f' &

Note the ampersand at the end of the command – this helps in putting the upload command in the background. That way the upload is not blocking and you can keep going through the images.

This is reasonably stable – even if feh crashes in the middle of your workflow, the upload commands are queued up and continue in the background.

Here is what the final command looks like. You can put this is a script and add it to your path for accessing quickly.

feh --scale-down --auto-zoom --recursive --action "aws s3 mv '%f' s3://${bucket}/'%f' &" --action1 "mv '%f' keep" --action2 "mv '%f' maybe" . &

This workflow is not for those who do a lot of editing with their pictures. Overallfeh is fast to load images and provides a lot of extensibility.

Next Steps

The next step would be to configure the lambda function to the S3 upload event and have the unofficial instagram api post the image to instagram. One step remaining would be including the individual hashtags before S3upload. That way from memory card to instagram can be reduced to just a few keystrokes.

Beyond that, I intend to move feh part of the pipeline to a raspberry pi. I can plug the raspberry pi to the TV of the hotel I am staying at and cut my computer from the loop. Here’s a short dev.to post I wrote up for setting up my raspberry pi with a TV. It will probably be a few weeks to get everything together. Till then enjoy a very reticent feed from my instagram .

[instagram-feed]

Italy

A few pictures from my first Italy trip. It most certainly, would not be the last. The mesmerizing beauty of the land, the layers of history at every corner, the warmth of people and the sumptuous delicacies – makes you fall in love with the country the very moment you lay your first foot in the country. Hope to be back soon!

ChiPy Python Mentorship Dinner March 2015

Chicago Python Users group Mentorship program for 2015 is officially live! It is a three month long program, where we pair up a new Pythonista with an experienced one to help them improve as developers. Encouraged by the success of last year, we decided to do it in a grander scale this time. Last night ChiPy and Computer Futures hosted a dinner for the mentors at Giordano’s Pizzeria to celebrate the kick off – deep dish, Chicago style!

The Match Making:

Thanks to the brilliant work by the mentor and mentees from 2014, we got a massive response as soon as we opened the registration process this year. While the number of mentee applications grew rapidly, we were unable to get enough mentors and had to limit the mentee applications to 30. Of them, 8 were Python beginners, 5 were interested in web development, 13 in Data Science, and rest in Advanced Python. After some interwebs lobbying, some arm twisting mafia tactics, we finally managed to get 19 mentees hooked up with their mentors.

Based on my previous experience at pairing mentor and mentees, the relationship works out only if there is a common theme of interest between the two. To make the matching process easier, I focused on getting a full-text description of their background & end goals as well as their LinkedIn data. From what I heard last night from the mentors, the matches have clicked!

The Mentors’ Dinner:
As ChiPy organizers, we are incredibly grateful to these 19 mentors, who are devoting their time to help the Python community in Chicago. Last night’s dinner was a humble note of thanks to them. Set in the relaxed atmosphere of the pizzeria, stuffed with pizza and beer, it gave us an opportunity to talk and discuss how we can make the process more effective for both the mentor and mentees.

Trading of ideas and skills:
The one-to-one relationship of the mentor and mentee gives the mentee enough comfort for saying – “I don’t get it, please help!”. It takes away the fear of being judged, which is a problem in a traditional classroom type learning. But to be fair to the mentor, it is impossible for him/her to be master of everything Python and beyond. That is why we need to trade ideas and skills. Last time when one of the mentor/mentee pairs needed some help designing an RDBMS schema, one of the other mentors stepped in and helped them complete it much faster. Facilitating such collaboration brings out the best resources in the community. Keeping these in mind we have decided to use ChiPy’s meetup.com discussion threads to keep track of the progress of our mentor and mentee pairs. Here is the first thread introducing what the mentor and mentee are working on.

Some other points that came out of last night’s discussion:

We were not able to find mentors for our Advanced Python track. Based on the feedback we decided to rebrand it to Python Performance Optimization for next time.
Each mentor/mentee pair will be creating their own curriculum. Having a centralized repository of those will make them reusable
Reaching out to Python shops in Chicago for mentors. The benefit of this is far reaching. If a company volunteers their experienced developers as mentors, it could serve like a free apprenticeship program and pave the way in recruiting interns, contractors and full time hires. Hat-tip to Catherine for this idea.

Lastly, I want to thank our sponsor – Computer Futures, for being such a gracious hosts. They are focused on helping Pythonistas find the best Python job that are out there. Thanks for seeing the value in what we are doing and hope we can continue to work together to help the Python community in Chicago.

If you are interested in learning more about being a mentor or a mentee, feel free to reach out to me. Join ChiPy’s meetup.com community to learn more about what’s next for the mentor and mentees.

Chicago Python User Group Mentorship Program

If you stay in Chicago, have some interest in programming – you must have heard about the Chicago Python Users Group or Chipy. Founded by Brian Ray, it is one of the oldest tech group in the city and is a vibrant community that welcomes programmers of all skill levels. We meet on the second Thursday of every month at a new venue with some awesome talks, great food and a lot of enthusiasm about our favorite programming language. Other than talks on various language features and libraries, we have had language shootouts (putting Python on the line with other languages), programming puzzle night etc.

Chipy meetups are great to learn about new things and meet a lot of very smart people. Beginning this October, we are doing a one on one, three month mentorship program. Its completely free, and totally driven by the community. By building this one to one relationships through the mentorship program, we are trying to build a stronger community of Pythonistas in Chicago.

We have kept it open on how the M&M pairs want to interact, but as an overall goal we wanted the mentors to help the mentees with the following:

1. Selection of a list of topics that is doable in this time frame (October 2014 – January 2014)
2. Help the mentee with resources (pair programming, tools, articles, books etc) when they are stuck
3. Encourage the mentee to do more hands on coding and share their work publicly
It has been really amazing to see the level of enthusiasm among the M&M-s. I have been fortunate to play the role of a match maker – where I look into the background, level of expertise, topics of interests and availability for all M&M-s and try to find out an ideal pair. I’ve been collecting data at every juncture so that we can improve the program in later iterations.

Here are some aggregated data points till now:

Signups
# of mentors signed up: 15
# of mentees new to programming: 2
# of mentees new to Python: 16
# of mentee-s Advanced Python: 5
Total: 37

Assignment:
# of mentors with a mentee: 13
# of mentees new to programming with an assigned mentor:1
# of mentees new to Python with an assigned mentor:11
# of mentees with Advanced Python with an assigned mentor:1
Outstanding:
# of mentors for newbie mentees without an assignment: 2
# of mentees unreachable: 4
# of mentees new to programming without an assigned mentor:1 (unreachable)
# of mentees new to Python without an assigned mentor:2 (unreachable)
# of mentees with Advanced Python without an assigned mentor:4 (1 unreachable, 3 no advanced mentors)

Other points:
– Data analysis is the most common area of interest.
– # of female developers: 6
– # of students: 2 (1 high-school, 1 grad student)

All M&M pairs are currently busy figuring out what they want to achieve in the next three months and preparing a schedule. Advanced mentees, are forming a focused hack group to peer coach on advanced topics.
We are incredibly grateful to the mentors for their time and the enthusiasm that the mentees have shown for the program. While this year’s mentoring program is completely full, if you are interested in getting mentored in Python, check back in December. Similarly, if you want to mentor someone with your Python knowledge, please let me know. If you have any tips you would want to share on mentoring, being a smart mentee – please leave them in the comments – I’ll share them with the mentor and mentees. And lastly, please feel free to leave any suggestions on what I can do to make the program beneficial for everyone.

The Third Meetup

Last Tuesday was our third meetup for Chuck Eesley’s venture-lab.org. Instead of the Michigan Street Starbucks opposite to The Chicago Tribune, we pivoted to the Wormhole for this one. For any geek who has been to this place, knows what a riot it is. From “Back to the Future” time-machine retro-fitted on the ceiling, old atari cartridges as showpieces on the coffee-table, super typo-friendly wifi password, stopwatch controlled brewing, Starwars puppets, shopkeep.com app on ipad instead of cashbox – the bearded coffee masters had it all. Everything except a place to accommodate the thunderous 8 of Lake Effect Ventures.

Two hours of caffeine drenched brainstorming spitted out the following:

I sketched out how the process might flow in two steps. We are down to a pretty bare minimum concept build which is ideal both for this class and for getting something up quickly so that we can test it.
I set up a Twitter account for Lake Effect Ventures so that we can tweet about progress we are making.
Andy is going to jot up a positioning statement and beef up the business model canvas for the concept
Leandre will use these to complete our 2-slide initial submission for our deliverable for the next deadline
Leandre will also use this to start to craft a presentation deck
Benn will be working on the copy for the landing page that I started.
Benn will also be crafting a logo in Photoshop (Alex, Zak, Sidi if any one of you is good with design Benn would appreciate the assistance there)
We need to think of a name for the concept as well

We think it is a bit premature to start on the user stories right now given that we have a good idea of what we are gonna build. Charles and me are gonna start on that and look to have something complete from a Version 1.0 standpoint by mid next week barring no setbacks. We will look to craft the user stories once we complete the MVP and use them as structure for testing features and functionality (Zak stay tuned on this)
Benn and Andy will also be working on putting together a more formal customer survey so that we structure the interviews we are having and start to compile meaningful data which we will need going forward.
Its getting exciting ….

Advice:John Doerr on working in teams

Incredible Networking: Collect names, emails of all folks you meet. Be very careful about who your friends and keep in touch – after all you become the average of the five people you spend your time with. Call them up – Its incredible what people will tell you over the phone. (This is something, I have always fallen short – I can hardly get beyond emails).

Carry Chessick, the founder and last CEO of restaurant.com once told me after his lecture session at UIC, that networking as it is perceived is worthless. When you meet people, make sure you finish off by saying “If I can be of any help to you, please do not hesitate to get in touch”. That’s the only way that business card will actually fetch you some benefit. I met a sales guy from SalesForce.com, some time back at Chicago Urban Geeks drink … who sent out a mail immediately after the introduction from his phone with a one line saying who he was, where we met, and that he’ll keep an eye on tech internship notices for me. Brilliant.

360-s: If you want to find information about some company, of course you Google. So lets say if you are gathering info about Google, you’ll also want to talk to their competitors Yahoo, Bing … and find what they are thinking. Then you triangulate all that information to get in a good position.

Coaching: Make sure there is some one will consistently give you advice on what’s going on in your workplace.

Mentoring: Having a very trusted person outside your work who can give advice is invaluable.

Time buddy: How do you make sure that you are doing good time management? Get a time buddy, compare your calendars on how you are spending time. Bill Gates does this Steve Balmer.

Another interesting practice I’ve read sometime back on Hackernews is communicating with team members in two short at regular intervals:
(1) What I did last week/day:
(2) What I’ll do next week/day:

As my dear friend Guru Devanla(https://github.com/gdevanla) would put it “Its all about setting expectations … and meeting them”!

Data loss protection for source code

Scopes of Data loss in SDLC
In a post Wikileaks age the software engineering companies should probably start sniffing their development artifacts to protect the customer’s interest. From requirement analysis document to the source code and beyond, different the software artifacts contain information that the clients will consider sensitive. The traditional development process has multiple points for potential data loss – external testing agencies, other software vendors, consulting agencies etc. Most software companies have security experts and/or business analysts redacting sensitive information from documents written in natural language. Source code is a bit different though.

A lot companies do have people looking into the source code for trademark infringements, copyright statements that do not adhere to established patterns, checking if previous copyright/credits are maintained, when applicable. Blackduck or, Coverity are nice tools to help you with that.

Ambitious goal

I am trying to do a study on data loss protection in source code – sensitive information or and quasi-identifiers that might have seeped into the code in the form of comments, variable names etc. The ambitious goal is detection of such leaks and automatically sanitize (probably replace all is enough) such source code and retain code comprehensibility at the same time.

To formulate a convincing case study with motivating examples I need to mine considerable code base and requirement specifications. But no software company would actually give you access to such artifacts. Moreover (academic) people who would evaluate the study are also expected to be lacking such facilities for reproducibility. So we turn towards Free/Open source softwares. Sourceforge.net, Github, Bitbucket, Google code – huge archives of robust softwares written by sharpest minds all over the globe. However there are two significant issues with using FOSS for such a study.

Sensitive information in FOSS code?

Firstly, what can be confidential in open source code? Majority of FOSS projects develop and thrive outside the corporate firewalls with out the need for hiding anything. So we might be looking for the needle in the wrong haystack. However, being able to define WHAT sensitive information is we can probably get around with it.

There are commercial products like Identity Finder that detect information like Social Security Numbers (SSNs), Credit/Debit Card Information (CCNs), Bank Account Information, any Custom Pattern or Sensitive Data in documents. Some more regex foo or should be good enough for detecting all such stuff …

#/bin/sh
SRC_DIR=$1
for i in `cat sensitive_terms_list.txt`;do
        for j in `ls $SRC_DIR`; do cat $SRC_DIR$j | grep -EHn --color=always $i ; done
done

Documentation in FOSS

Secondly, the ‘release early, release often’ bits of FOSS make a structured software development model somewhat redundant. Who would want to write requirements docs, design docs when you just want to scratch the itch? The nearest in terms of design or, specification documentation would be projects which have adopted the Agile model (or, Scrum, say) of development. In other words, a model that mandates extensive requirements documentation be drawn up in the form of user stories and their ilk. being a trivial example.

Still Looking
What are some of the famous Free/Open Source projects that have considerable documentation closely resembling a traditional development model (or models accepted in closed source development)? I plan to build a catalog of such software projects so that it can serve as a reference for similar work that involve traceability in source code and requirements.

Possible places to look into: (WIP)
* Repositories mentioned above
* ACM/IEEE
* NSA, NASA, CERN

Would sincerely appreciate if you leave your thoughts, comments, poison fangs in the comments section … 🙂

Hacking the newsroom

[This is part 2 of the final pitch, which talks about the newsroom and business perspective. Part 1, detailing the newsreader perspective is here.]

Before anything else, there must be a 90 seconds theatrical promo:

Stop laughing at my amateurish video editing! This is my first ever … even Bergman, Godard, Fellini started somewhere to be great! Jokes apart here’s what REVEAL actually is all about:

Lets consider a hypothetical newsroom which uses REVEAL. A journalist gets hold a huge collection of classified documents that contains potentially sensitive information. Instead of painstakingly reading each line and jumping back to google to search relevant information – she uploads them to REVEAL and hits the pantry for her coffee. Reveal goes to work and automatically parses out names of pepole, places, organizations etc. Using the names it detected, REVEAL affixes thumbnail images with the mappings of the named entities with the documents. The journalist now sits back, sips the coffee and flips through the images looking for someone/something/some place that’s interesting and jumps directly to the document when she finds her target.

But that’s not all. In order to make the life much easier for the journalist – REVEAL uses the names and keywords from the document, to aggregates semantically related contents from the net – images, video, news, blog, wiki articles using open apis. Making the background context readily available, it allows the journalist focus solely on her analysis of the story.

What follows is an over the top ambitious plan for making lots of money – I mean the business plan.

Unearthing named entities involves doing tonnes of computationally intensive text analysis and for any sizable dataset we need a cloud based solution. While REVEAL will always be Free and Open Source Software, the business proposition is offering it as a service. Be a startup or a news corp, whoever deploys REVEAL at their site – they can offer it as a service to other news agencies/ organizations based on pay by usage model. Different packages can be offered based on when they want to share the information dug out from their documents.

Nothing like REVEAL exists today. The cohesive bond of unknown information on well known personalities and organizations, original content (the documents), expert opinion(journalist’s view), user generated content(comments) and aggregated content – will make REVEAL a dream product for generating ad-revenues. Features for lead generation is inbuilt into the system and the karma points based reader appreciation along with the 360 degree view of the world will ensure persistent traffic.

Now get me to Berlin Hackathon!
(398 words)

Most common names detected in Wikileaks cablegate files

Link to an incomplete implementation

Reveal – How much does the world know?

Have you ever had a shit-tonne of documents dumped into your inbox with an impossible deadline demanding to suck out the hidden juicy bits? Or may be it has been a joyful experience of discovering the dump of an MILF’s emails, diplomatic cables, or code dumps of an evil corporation’s website? At moments like those, you might have uttered, “fcuk! … Omne Ignotum Pro Magnifico!”. Wouldn’t it be nice if the needles just magically popped out of the haystack? Meet Reveal (clickable prototype) – a software framework that aspires to achieve that and may be a bit more.

Background:
While sed/awk/grep-ing the cablegate files, I stumbled upon a cable that mentioned Kofi Annan asking Robert Mugabe to step down in exchange for a handsome retirement plan during the Millennium summit. Being an ignorant bloke, I could hardly recall what the Millennium summit was about, had no clue if Mugabe was still in office, and if Kofi Annan has made a comment on this! Without the right background and context I could not appreciate the data to the full extent. Below is the #MozNewsLab final project idea pitch in the lights of the three speakets this week: Chris Heilmann, John Resig and Jesse James Garrett

“What is this thing for? What does it do? How is it supposed to fit into people’s lives?”, @Jesse James Garrett:
Journalists get amazing amount of digital data everyday which are in the form of numbers in tables. With some spreadsheet skills or help from newsroom programmers, they produce incredible revelations of the reality that hides behind those numbers. However, when the data comes in the form of unstructured text files written in natural language – there isn’t much algorithmic help available, other than full text searches with a list of guess words. Using cutting edge information retrieval technique, Reveal would aim to build a framework that automatically annotates names, places, locations, dates etc. in the unstructured text files.

“Adopting Open Source, Open standards“, @Chris Heilmann:
Being baptized by St. IGNUcius, the idea of Free as in Freedom runs through the core of the technology stack of Reveal. Standard LAMP stack for server side, UI powered by HTML5, CSS3 and jQuery plugins and a number of open source libraries for doing the information extraction – long post describing the information retrieval technology coming soon. (Mind map above).

Using the detected names, locations, dates etc., Reveal will try to aggregate additional information in the form of images, maps, news articles, videos, wikipedia pages, visualizations etc. via open API-s and use them as navigational elements to browse the data. Juxtaposed to the document under scrutiny, these will provide the right context to gauge the sensitivity of the information.

“User to Contributor”, @John Resig:
Additionally, by showing a relative score of “How much does the world know?”, calculated on the basis of the aggregated information published before the documents surfaced, we can excite the newsreaders to share the information across their own social network. Add some game mechanics by quantifying that “sharing”, and we bust the filter bubble of ignorant blokes and turn them into responsible citizens who’ll raise voices against wrong doings of totalitarian regimes, evil corporations or other bad asses. This will lead to creation of more content and will act as a feedback loop to the background and context aggregation step before.

Now, a similar project by the uber journalist-programmer Jonathan Stray of the AP has won this year’s Knight Mozilla news challenge. His approach, Overview, solely focuses on clustering documents based on cosine similarity of their tf-idf scores. Using sexy visualization, it pulls out key terms specific to the corpus under study. The night when the results of Knight Mozilla challenge was announced – in an euphoric outburst I sent him an embarrassingly long late night email ranting the above. Obviously, I never heard back but he will be releasing his code soon and I am super excited to fork it for visualizations in Reveal.

That is my final software idea pitch inspired by Chris Heilmann, John Resig and Jesse James Garrett #MozNewsLab Week 2:

Tweetsabers of News Revolution across the globe #MozNewsLab

After Amanda Cox’s lecture I was too pepped up to do some quick and sexy data visualization. In my daily life, I rely on R or GNUPlot for doing all my plots, simply because of their scripting interface. I have played a bit with Google chart and visualization api and they are absolutely brilliant. I’m planning to get my hands dirty with matplotlib (yep, yep … Python!).

So mid way into the re-listening the lecture, I was overpowered by this feeling of doing some global data visualization. One of the best global data visualization tool that have left an impression on my mind, is the WebGL Globe – an open platform from Google Data Arts Team. I grabbed the example code from here and with a simple Python script collected Twitter activities during the first week of #MozNewsLab into a .json file. Some simple changes to the sample javascript code and there you have colored light sabers shooting out from the globe.

The main problem was the twitter api limit – a meager 125 queries per hour. So I had to rely on the geo-location data that I had scraped earlier for mapping #MozNewsLab participants in the world map. That allowed me to narrow down the geo location queries for those who were not a participant of #MozNewsLab. In case the geo-location info was not available on their twitter profile, those homeless tweet counts were assigned to our dear Lab co-lead Phillip Smith.