How does a computer know what it’s reading?
When Tim Berners-Lee invented the web in the early 1990s, he envisioned an open system for people to share information across the world. Would he be surprised at the amount of information available sixteen years later? From the early days of academic papers the web now embraces photos, videos, up-to-the-minute news, blogs, shops, annotated maps, and a thousand other classes of information. You can spend hours browsing through this mass, consciously and sub-consciously cataloguing it all.
But how does a computer know what it’s reading through?
When a football fan reads the headline Rooney start ‘will scare Swedes’, they can be fairly certain that it will be a football (soccer) news article on the England player Wayne Rooney, most likely regarding the recent World Cup match between England and Sweden. They can even make a good guess as to the date of the article (most likely in the hours leading up to the match). A computer can do none of this cataloguing; it has no understanding of a news article, Wayne Rooney, football, or the national teams involved. (Let us ignore complex linguistic parsing for now.)
This paper will introduce the Semantic Web, the next stage in the development of the web. We will explain why semantics are important, how they can help computers catalogue data, and how this will benefit us as individuals. We will also look at microformats, an ongoing project the aims to help us create a more semantic web. We assume you have a good knowledge of XHTML.
The web is a great place: we can do almost anything and everything online; from grocery shopping to watching live World Cup footage, the web is a very easy place for us to navigate. But not for computers.
Let’s go straight to an example of a problem we might have: finding a plumber. Most likely, the next time you’ll need a plumber will be when you’re facing an emergency — a water leak, a lack of heating, and so on. At this juncture you’re not likely to want to spend an hour searching online for a local plumber that comes recommended.
How can a computer understand what it is to be a plumber local to you?
How can a computer help you find a local plumber? How can a computer know what a plumber is? How can a computer understand what it is to be a plumber local to you?
You might think you can search Google for ‘plumber edinburgh’, for example. But for all it’s usefulness, Google really has no idea what you’re looking for; it only returns those pages that include your search terms. How do you stop results coming back for plumbers in Edinburgh, Indiana, or for the Plumber family of Edinburgh? You want to be able to define what a plumber is, and you want to be able to add properties to a plumber, such as their location and their trustworthiness.
This is where the Semantic Web comes in.
Imagine an address book which, over time, as you browse the Web, collects and stores the contact details of various plumbers which, at the time, you barely notice being advertised. The details could come from anywhere: the Yellow Pages web site, a post on a friend’s blog where they recommend a plumber, or a site listing local services, for example. Using the plumbers unique identifier (their URI, uniform resource identifier), it finds reviews of the plumbers so you can decide on their reputation. And that’s just the start.
How is this possible? It’s all down to semantic markup and microformats.
Semantics, semantics, semantics
URIs: understanding Bob Smith is different from Bob Smith
One of the building blocks of the Semantic Web is the Uniform Resource Identifier (URI). Berners-Lee promotes the idea that
URIs can be used to uniquely identify anything from our recommended local plumber to
the abstract notion of world peace. Each plumber would have a URI, used to identify them uniquely on the web, and used by reviewers to indicate they are reviewing that unique plumber.
A microformat ‘mark-over’
Because of the vast amount of information on the web, no one is entirely sure how we can create a standard for extracting all the useful bits. One group however, is giving it a good go: enter Microformats.
The Microformats Project has devised several small formats for common data, the two finding most use being contact information and calendar events, known as hCard and hCalendar respectively. Microformats are not a new language; rather, they fit snugly into existing pages, often without the need for any visual change.
We’re not going to cover microformats in great detail (you can find that on the project’s wiki); we’ll instead provide an example. Remember Joe Bloggs from The separation of structure, presentation, and behaviour as a software architecture? Let’s format his details with the hCard microformat:
<dd class="fn">Joe Bloggs</dd>
<a href="mailto:firstname.lastname@example.org" class="email">
<dd class="tel">+44 (0)131 467 4602</dd>
As you can see, this is a fragment of HTML, understood by web browsers and other software since the mid-nineties. The microformats cleverness is in the special classes we’ve added to the elements.
Notice the vcard class given to the dl element? Or the fn class on the first dd element? What about email on the a element? It’s these trivial additions we make to the underlying markup that allow for the information to be pulled out and categorised by a computer.
To describe what the code above does: an hCard (that is, the contact details of an individual or business) is denoted by an element having a class attribute vcard (hCard is based on the vcard specification). The descendent elements (not just the children) of this element can all make up part of the hCard; all they need is a class recognised by the hCard specification. (They can, of course, have any number of other classes too.)
So the hCard specification defines the fn class to mean “this element’s content is the full name of the individual or business”. The tel class defines a telephone number. Email addresses are a special case: an email address is taken from the href attribute of the a element marked with an email attribute.
And there is no need to worry about changing your page to use definition lists; hCard attributes can be added to any elements, and the hCard data can be as far apart in a page, and in any order, as you need.
How do we use microformats?
So you’ve marked-up all your pages with microformats; you may be wondering how you use these little gems, hidden away in the structure of your page. Microformats might be in their infancy — around one year old — but people are finding ways to use them faster than you can say “Do you have the name of a good plumber?”
The most popular microformats tools must be Brian Suda’s X2V which converts any web page containing hCard items into vcard, and hCalendar items into iCalendar files. The most interesting use of this is to grab people’s contact details from their web site and import them into Microsoft Outlook, Apple’s Address Book and iCal — among many other applications.
You might even be using this service without knowing: World Cup Kick-off is using X2V to provide a service that allows you to add World Cup kick-off times to your calendar. For those of you in the UK, here’s the England schedule. (Sorry Scotland!) And Technorati have business cards for their staff, just waiting to be collected by you with X2V.
Tails Export, a free extension for the Firefox web browser, indicates to users that there are microformats embedded in a page; so, for example, you would know when you came across a page complete with hCards, and consequently save them into your address book.
Rolling our own
Of course, it’s more fun when we create our own tools. Because microformats are embedded in existing XML documents it means we can use those tools for transforming XML that are already available. You could write XSL transformations for pulling out reviews and events for example; or write a search-engine such as Technorati’s microformat’s search or Rubhub’s XFN search. As a pointer, the W3C are doing some work along these lines as part of their GRDDL data views project.
Many agree that microformats are but a means to an end (what is often called ‘the real-world semantic web’ or ‘lower-case semantic web’), and that the technology at the heart of the Semantic Web will be RDF (Resource Description Framework). This may well be the case, but microformats give you a change to start playing with the semantic web now, learning and understanding what the web will be in the future. Once the W3C’s RDF takes a greater role outside of academia, you will be well-placed to make the most of the Semantic Web.
The next revision of the web, the Semantic Web, may a a few years off yet, but that’s no reason not to start using the ideas from which it is being formed. Adding semantics (meaning) to existing documents allows computers to categorise and handle information, allowing us to use it in much more useful and interesting ways.
Microformats are an example of how we can do this now. Contact details on a web page can be added to your address book; event schedules can be added to your calendar. And, eventually, finding a plumber online will be easy!
- Rooney start ‘will scare Swedes’, BBC, 19th June 2006.
- Microformats, The Microformats Project, June 2006.
- Give yourself a URI, Sir Tim Berners-Lee, January 2006.
- Semantic Web, Wikipedia, June 2006 (current version).
- Uniform Resource Identifier, Wikipedia, June 2006 (current version).
- Resource Description Framework, Wikipedia, June 2006 (current version).
- GRDDL data views: getting started, learning more, W3C, February 2006.
Mercurytide is a forward thinking, dynamic, and innovative Internet applications development company. We are well-established with a proven track-record within the industry of attracting blue chip clients from around the world. We produce regular white papers on a variety of technology-orientated topics. For more details see the Mercurytide web site.