skip to main bit
a man slumped on his desk, from 'The Sleep of Reason Produces



the inhuman search engine

There was a time when you could parlay a decent understanding of Google search (or any search) into a journalistic career. Journalists were, on the whole, trained to collect information through contacts and telephone calls, but at that time, they didn’t yet have a consistent grip on how to piece together stories from the Net. The majority of stories were built from legwork, not basic Internet skills. The pendulum is swinging the other way now I think. Many, many articles are now written that were spun from forwarded screenshots and searches. You can still get ahead a little from having advanced knowledge: there still remains a benefit, I believe, for journalists who know a little coding or a little statistics. But with the home base of journalism moving online, here’s almost certainly an emerging premium now for people who can simultaneously talk to computers and humans in languages they understand. Or maybe can use the Internet to peer into motivations and other intimacies, rather than uncover facts.  A good example is Gwern and Andy Greenberg’s piece on the identity of Satoshi Nakamoto. There’s some serious understanding of a lot of tech in their research, but it was mostly undone by underestimating how strange human motivation can be. Why would someone try to plant a trail suggesting they were Nakamoto, with no obvious benefit? Strange motives sink plenty of research projects. But perhaps one of the conclusions of anyone who swims in the large scale view of conspiracy theories and fraud that the Net offers is that, absent a permanent cost, motivations can be truly random.

I was thinking this today, just because I got caught up in an excursion into fact-checking. Someone said something on a forum; I was mildly curious who they were. The forum didn’t publish names or emails, and the username was not unique or lead anywhere. But the forum used gravatars: those little icons that either show patterns or a user-configured image next to your post. Gravatars are based on your email address which you enter to get a confirmation note when you post to some forums. The icon image itself is served from, based on a MD5 hash of your email.

There’s no known mathematical way to get from the hash to the email (touch wood). But the hash still leaks information. You can generate hashes from a set of possible email addresses. You can confirm a person has used a particular email address by checking that emails hash (note there’s no guarantee someone is using their own email address — strange motivations can lead you down wrong paths). In this case, though, I was able to just search for the hash itself. I quickly found another account on a separate site using that same hashed gravatar, and where the user had used a more personal username. From the username I was able to try out an email address that matched the hash. And from that, I found a site that listed the person full name and address. All of this took me less than ten minutes.

I hadn’t really thought about using gravatars to expose identities before (others have). It would be a useful skill to have in a modern journalist’s toolkit though. I guess more intriguingly, it might be a tool that one could provide to journalists. I keep thinking about the narrow subset of all possible characters that the world’s email addresses, and indeed human names inhabit. If you were to set about compiling and de-duping the world’s known spamming lists, how many of the world’s emails could you collect? How quickly could you brute force everyone’s full name, or a reasonably high percentage? Over 90% of the US population are covered by 200,000 surnames: how quickly could we get high coverage by combining those with the  most popular first names? (I admit to first considering this when thinking about how one could independently track the extent and use of the Right to be Forgotten in the EU. Programmatically generate a significant percentage of all the possible names in the European namespace, then check the affected and unaffected search engine results for each.)

I would like journalism to be about creating new facts about the world, instead of reporting pre-existing facts or just propagating novel speculation.

2 Responses to “the inhuman search engine”

  1. Sumana Harihareswara Says:

    I have thought several times about the advantage I have over many peers because I am completely fine with picking up the phone and initiating a phone call. Your more nuanced speculations is intriguing to me. Also do you have a take on Pro Publica and its ilk?

  2. Danny O'Brien Says:

    Ian Betteridge (of the law, and of these parts sometimes) and I agreed at some point that the best definition of journalism is someone who picks up the phone and calls people. Plenty of people in pre-Web journalism were shocked when they realised that they were being temporarily outjournalismed by people who did no such thing.

    I wasn’t a very good journalist, but I can and still do write things based on heavy research of is out there online, but buried, rather than talking to smart people who had not otherwise recorded their beliefs. But the combo is definitely the best thing.

    I like Pro Publica, but I do think that there’s a big segment of American journalism that doesn’t quite realise how dull public interest journalism can get. I went through a lot of early excitement when I first came to the US at the level of detailed investigative reporting took place in local newspapers here — months long reporting of court cases or government corruption. Then I realised that nobody except the people involved was actually reading it. That’s usually sufficient to get a reaction and sometimes change. If you’re a government official whose name is in the papers every day, you’re feeling some sort of pressure. But I’m not sure what relationship that pressure has to the reception of a wider public. The simplistic model of how journalism achieves change is “journalist uncovers the truth” -> “public outrage” -> “response by those sensitive to democratic forces”. But really the public outrage bit is mostly a latent power. People worry about facts in the public sphere being suddenly blown up into scandal. The fact that that doesn’t often happen doesn’t remove their fear of it. What can remove the fear is better methods to prevent that from happening.

    The reason I say all this is that the clear issue with something like Pro Publica is that it could unmoor public interest reporting from incentives that require an audience. Lots of work that fits the US model of exemplary journalism is being written now, almost all of it unread. Does that matter? Only to the extent that those, who might change their behavior based on the threat of exposure, realise that no-one is reading it. u

    My simplistic model of US journalism is that it goes through phases of being absolutely bloody and unchecked, and then is reigned back into respectability, then becomes ineffective for those wishing to make change, and so becomes bloody again. The standard model of Pulitzer journalism was created in a period of respectability, but we’ve been in a moment of bloodiness for the last few years, so that Pulitzer criterion is largely irrelevant. The Wikileaks and the Greenwalds and the Milos and the Gawkers and the TMZs run the rules now. They, plus everybody in the former audience who can build up a head of steam and swiftly propagate violations of their ethical models. But you can see everyone, from the elites to the average politically-engaged person, trying to will into place constraints on that.

    I think this is more generalisable to other countries, but the place of journalism is weird and culture-specific. I base a lot of my US model on reading about early post-revolutionary journalism (“Scandal and Civility: Journalism and the Birth of American Democracy” is a great book on this), and how early 20th century muckraking was replaced with journalism schools and professionalisation.


petit disclaimer:
My employer has enough opinions of its own, without having to have mine too.