Oblomovka

Tagling

Wow, it's messy in this website's CMS. It's like I'm typing this *directly into my webserver*. Anyway.

I've always rather admired Yahoo!'s Content Analysis Term Extraction Web Service, even if it's not the most mash-up-tastic of APIs. Still, I did think that it might work bolted onto that other Web 2.0 theme, tagging.

This is what term extraction means: you feed Yahoo! a pile of text from a document via its API, and it spits out what it thinks are the significant words or phrases in that sample. This little Ruby program uses Yahoo! Term Extraction on text you supply via STDIN, then gives you (by default) a list of suggested Technorati tags for the text. If you give the program "-txt" as an option, it'll just spit out the tags in human readable plaintext.

Yahoo! will generally give you more tags than you reasonably need, so the results usually require manual pruning. Here's the tags it suggested for the text on this page.

      danny% w3m -dump -T text/html tagling.php3 | tagling

phrases
technorati
modern browser
cms
ruby program
sleep of reason
stdin
spits
pruning
stylesheets
tagging
apis
webserver
mash
spit
messy

Some of those are from the secret text you see if you read this page in a text browser, and I'm not sure what depths of my subconscious it was ploughing at the end, but not bad.

Here's the code. I may move it from here if it grows any bigger, but right now you might as well just cut and paste. Don't worry about trying to get developer tokens from Yahoo -- they dole them out on an application basis, so the pre-cooked ones in the code should do you fine.

#!/usr/bin/env ruby -w
###
# tagling - Use Yahoo to spit out tags for a piece of text
###
#

require 'CGI'
require 'rexml/document'
require 'net/http'

appid = 'tagling1.0'
api_uri = URI.parse('http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction')

text = STDIN.read 
i = Net::HTTP.post_form(api_uri, { 'appid' => appid, 'context' => text  } )

i = REXML::Document.new i.body

i.each_element("//Result") do |a| 
    t = a.text
    puts case ARGV[0] 
    when '-txt' then t
    else  %(<a href="http://technorati.com/tag/#{CGI.escape(t)}" rel="tag">#{t}</a>)
    end
end

main bit This page looks very fancy in a modern browser, with "stylesheets" and "layout" and thing, but frankly I prefer the way you're seeing it here. Congratulations for not crumbling to the Browser Upgrade Initiative! Support the Web Designer Downgrade Conclusion!

Oblomovka

Tagling