Validation (9) XML (9) Geeky (3) Java (3) Android (2) Business IT (1) Chromecast (1) Devfest (1) Web (1)

Thursday, May 27, 2010

Semantics in HTML

PDF Presentation
HTML (HyperText Markup Language and all it's derivates is probably the most successful and widely used mark-up language ever. In has been used and missused over last decades for variety of very different tasks. This had a destructive effect on the language and significantly influenced it's semantics.
As a result semantics is withdrawing from HTML. Current HTML code is mostly just a sequence of semantically empty div span elements. The difference between those two is purely presenational and may be expressed as display: block/inline.
When talking about semantics in current current HTML, it is probably hiding in the class attribute, although it is mostly only used to map elements to their CSS presentation. Expressing semantics in class is better than no semantics at all, but the model is ill. Why not expressing the semantic category directly using the element name? Putting semantically empty div span sequences is error prone and redundant (this is the approach used by microformats). It more difficult to process such documents using XML tools (XSLT, XQuery...), it is more diffcult to write and maintain schemas for such languages.

Figure 1. Get rid of unnecessary HTML elements
There is something wrong in the following code fragment (a weather box in a HTML document).
<div class=”weather”>
  <span class=”title”>Weather Forecast</span>
  <span class=”city”>Prague</span>
  <span class=”temperature”>25&deg;C</span>
  <div class=”claudy”/>
We can get rid of redundant div span and express the semantic information directly. Our code becomes nicer and more readable and easier to process. Also it current mainstream browsers it gets rendered the same as the previous fragment, just few very small changes needs to be applied in the CSS.
  <title>Weather Forecast</title>

So the lesson is: Today, one language (HTML) is not able to express all the complicated domains which occur in today's Web. There are: Internet shops, company presentations, newspapers, encyclopedias, chats, blogs, online community portals, webmail, online PIM with collaboration, porn-sites and many more applications on the Web and it is ill to map all their complex semantics into two or few more HTML tags. This leads only to unnecessary lost of semantics in the Web documents which makes them diffult to automatically process by machines and less accessible for humans.
Rather than using HTML it makes sense to either our own semantics or whenever possible use a common shared semantics which is understandable also to others. Thanks to XML namespaces and modularization of HTML this is the prefered and technically feasible approach. To keep as much semantics in Web documents use your own cocktail languages which is the best fit for what you are trying to say.
I call this approach Semantical Tagging and in fact it a lighweight approach to Semantic Web. Imagine you will have a shared XML language for estate agencies. Expressing your offert in such laguage makes it possible to run semantics queries against it. See the following example.

Figure 2. Use shared semantic languages in your documents
<r:flat xmlns:r="" xmlns:r="">
   <r:country>Czech republic</r:country>
    <r:room type="bedroom" sq="18"/>
    <r:room type="kitchen" number=”25”/>
  <r:price currency="CZK" vat="19">7250000</r:price>
  <img xmlns="" src="/232/343434.jpg"/>
Having data expressed in the such nice semantic way let you ask semantics queries:
  • All flats in Prague less than 30 kilometres from the centre with 5 rooms and at least 100 square meters for less than 6000000 CZK
  • All flats with a terrace at least 40 meters big
  • All flats in Prague with a garage and windows oriented on the east side which will be finished this year
  • All flats without annoying neighbours :)

This approach to semantic Web is lightweight and no so feature rich like for exmaple RDF/OWL, but it has several advantages and it may be the preferable approach. Semantic tagging of data is easy, it is similar to what we have now and it is processable with current mainstream browsers so there is not a big issue with backward comapatibility. It have one source of data used for both machines nad humans and this is an advatage. Imagine the price of a realproperty would in RDF (for machines) be two times smaller than in HTML (the version for humans). In such case, when sorting by price, the search engine will list such offer at the top, thus making it more accessed by humans even the price is not such a bergain. In semantic tagging this can't happen so easily. One source of data is also easier to maintain.
Although semantic tagging doesn't allow out of the box reasoning and other advanced features of RDF/OWL, it brings the most critical and most wanted and features of semantic Web and it can be very easily implemented in current Web environment.
For more details about this topic, please read the presentation in semanticHTML.pdf.

No comments:

Post a Comment