Hacks/Hackers: Journalism meets technology. It’s a very human way of introducing people who count (as in numbers) to people who tell stories.
The Cape Town Chapter of Hacks/Hackers is convened by Raymond Joseph – a passionate hoarse-voiced journalist keen to see his profession evolve into the 21st century. At the first meeting last Friday, we had presentations from Friedrich Lindenberg from the data side and Justin Arenstein from the journalism side (not that there are ‘sides’).
Friedrich comes to us from the Open Knowledge Foundation – a great organisation that branches out into open scientific data, teaching data and open government data. OpenSpending.org was developed by a bunch of people at the OKFN and is now used to power Wheredoesmymoneygo.org (UK data) for example. If we could do similar things here, that would be cool. There is a project for that: AfricaOpenData.org <-- feel free to contribute if you can... Friedrich showed us some tools to extract data from web pages into a useable form, including:
- Data scraping into google docs using the ‘import HTML’ function (example/tutorial)
- Use the Scraper extension to Google’s Chrome browser, which exports into Google Docs (select table row in a web page and choose ‘scrape similar’ in contextual menu)
- Manually save and parse/clean the source code of a webpage using ‘Inspect Element’ or ‘View Source’ (which you can do using a menu item, or directly with a url, e.g. view-source:http://hackshackers.com/
- Use ScraperWiki.com where you can contribute scraping code or request data
- Once a data set is scraped, it usually is messy or incomplete. For those cases, Google Refine helps clean things up.
- Other tools are useful for getting data out of, e.g. PDFs and other unfriendly formats. Examples include CometDocs.com or ABBYY FineReader (commercial)
- More links worth exploring: poppler from freedesktop.org or tesseract
Justin gave us good examples where data helped investigative journalism find the real story, like the case where data helped figure our what kept girls out of school in Kenya after 12 years of age: lack of sanitary facilities. He talked about the Investigative Dashboard, helping journalists research stories using data and about a data bootcamp, where journos and techies scrape and work and learn together for three days. Justin also mentioned Tor, a project that helps safeguard anonymity (think whistle blowers). Once you have data you need to be able to do something with it, like visualize it. That’s what Overview is about. Similarly, DocumentCloud enables sharing annotated and commented data. He invited journalists to share their research materials and data for others – there might be more than one story lurking behind all that data…
He also expressed his citizen frustration at not being able to do anything in the face of news stories reporting bad things in the world. Creating spaces where citizens can go and contribute to resolve a situation after they come across the news is also a role that hacks and hackers can take on: Journalism going beyond reporting. For that, he mentioned the Public Insight Network where the public can react and contribute stories (a bit like what CNN are doing with iReport, except that it is more of a tool for citizens – hence the insight). Another initiative to help journalism make the most of the web is Mozilla Open News. Well worth a look. He finally mentioned the Knight Foundation as a source of funding to try new things out like the projects mentioned above. By the way, it is allowed to fail as long as we can all learn from it :)
In this fest of openness, there is an evolving free online resource: the Data Journalism Handbook, which proves very useful once the data has been scraped (from, for example, the Google Public Data Explorer).The role of Hack/Hackers is beyond building data journalism at the interface of the hacks (journos) and the hackers (techies). Data is a first step: the techies are the ‘skilled’ people who can juggle numbers, fine, but the journalistic skill of going beyond the numbers is at least as important. So while we all learn to speak a common language, let’s open our minds to each other’s imperfect worlds.
From my side: data are dirty, stats are statistical and codes are buggy. Journos, what are the imperfections in your world?