skip to main bit
a man slumped on his desk, from 'The Sleep of Reason Produces
      Monsters'

Oblomovka

Currently:

Archive for February 20th, 2016

2016-02-20

the inhuman search engine

There was a time when you could parlay a decent understanding of Google search (or any search) into a journalistic career. Journalists were, on the whole, trained to collect information through contacts and telephone calls, but at that time, they didn’t yet have a consistent grip on how to piece together stories from the Net. The majority of stories were built from legwork, not basic Internet skills. The pendulum is swinging the other way now I think. Many, many articles are now written that were spun from forwarded screenshots and searches. You can still get ahead a little from having advanced knowledge: there still remains a benefit, I believe, for journalists who know a little coding or a little statistics. But with the home base of journalism moving online, here’s almost certainly an emerging premium now for people who can simultaneously talk to computers and humans in languages they understand. Or maybe can use the Internet to peer into motivations and other intimacies, rather than uncover facts.  A good example is Gwern and Andy Greenberg’s piece on the identity of Satoshi Nakamoto. There’s some serious understanding of a lot of tech in their research, but it was mostly undone by underestimating how strange human motivation can be. Why would someone try to plant a trail suggesting they were Nakamoto, with no obvious benefit? Strange motives sink plenty of research projects. But perhaps one of the conclusions of anyone who swims in the large scale view of conspiracy theories and fraud that the Net offers is that, absent a permanent cost, motivations can be truly random.

I was thinking this today, just because I got caught up in an excursion into fact-checking. Someone said something on a forum; I was mildly curious who they were. The forum didn’t publish names or emails, and the username was not unique or lead anywhere. But the forum used gravatars: those little icons that either show patterns or a user-configured image next to your post. Gravatars are based on your email address which you enter to get a confirmation note when you post to some forums. The icon image itself is served from gravatar.com, based on a MD5 hash of your email.

There’s no known mathematical way to get from the hash to the email (touch wood). But the hash still leaks information. You can generate hashes from a set of possible email addresses. You can confirm a person has used a particular email address by checking that emails hash (note there’s no guarantee someone is using their own email address — strange motivations can lead you down wrong paths). In this case, though, I was able to just search for the hash itself. I quickly found another account on a separate site using that same hashed gravatar, and where the user had used a more personal username. From the username I was able to try out an email address that matched the hash. And from that, I found a site that listed the person full name and address. All of this took me less than ten minutes.

I hadn’t really thought about using gravatars to expose identities before (others have). It would be a useful skill to have in a modern journalist’s toolkit though. I guess more intriguingly, it might be a tool that one could provide to journalists. I keep thinking about the narrow subset of all possible characters that the world’s email addresses, and indeed human names inhabit. If you were to set about compiling and de-duping the world’s known spamming lists, how many of the world’s emails could you collect? How quickly could you brute force everyone’s full name, or a reasonably high percentage? Over 90% of the US population are covered by 200,000 surnames: how quickly could we get high coverage by combining those with the  most popular first names? (I admit to first considering this when thinking about how one could independently track the extent and use of the Right to be Forgotten in the EU. Programmatically generate a significant percentage of all the possible names in the European namespace, then check the affected and unaffected search engine results for each.)

I would like journalism to be about creating new facts about the world, instead of reporting pre-existing facts or just propagating novel speculation.

                                                                                                                                                                                                                                                                                                           

petit disclaimer:
My employer has enough opinions of its own, without having to have mine too.