Mail to the author
xavier at ultra-fluide.com

Semantic markup using XSL transformation

XSLT

There are a number of resources on the Web which deal with XSLT (XSL transformation), starting with the W3C's recommendation. Therefore we won't focus on this technology but simply recap the essential points: XSLT is a language based on XML enabling the writing of transformation rules designed to pass from an XML source tree to a XML result tree.

Finally, the XSLT processor is the tool capable of interpreting XSLT language in order to carry out the actual transformation of XML trees.

Semark.xsl: general principals

Using a XML content source document and a lexicon as a parameter file, the semark.xsl transformation will obtain an XML result tree with the same content as the source, but with markups on specific content files described in the lexicon. The lexicon lists the specific vocabulary and clarifies this vocabulary in relation to the markup.
The transformation is applied to all XML files, and in particular to web files written in XHTML.

Let's say for example that we dispose of the following fragment of the source tree:

<body>
   <h1>SEMARK.XSL</h1>
   <p>an interesting <a href="url">xsl</a> tool
      <br />based on :
   </p>
   <ul>
       <li>XML</li>
       <li>XSLT</li>
   </ul>
</body>

And let's suppose that the lexicon looks like this:

<body>
   <dl>
      <dt>XSLT</dt> <dd>XSL Transformation</dd>
      <dt>SVG</dt> <dd>Scalable Vector Graphics</dd>
      <dt>XSL</dt> <dd>eXtended Stylesheet Language</dd>
    </dl>
</body>

With the default options we would obtain the following result:

<body>
   <h1>SEMARK.XSL</h1>
   <p>an interesting
      <a href="url">
         <acronym title="eXtended Stylesheet Language">xsl</acronym>
      </a>
      tool
      <br />based on :
   </p>
   <ul>
       <li>XML</li>
       <li>
           <acronym title="XSL Transformation">XSLT</acronym>
       </li>
   </ul>
</body>

The lexicon has a fixed structure based on the dl, dt and dd elements. The dt elements are the words or expressions which will have markups in the source document. The dd elements qualify the lexicon's entries. The contents of dd elements are used during the markup to fill the markup attribute if any.

Semark.xsl: operating options

The semark.xsl transformation allows for parameters which make it possible to adjust the way in which it operates.

Lexicon-name: Access path to the file containing the lexicon. Value by default: lexicon.html.
to-parse: List of the elements in the source document which need to be analysed for the markup. To receive markup a content must have an element on this list as an ancestor. Value by default: body.
not-to-parse: List of the elements from the source document which will be excluded from the analysis. No content with an ancestor on this list will have markup. not-to-parse has priority over to-parse, if content has ancestors in both lists it will not be analysed. Value by default: empty chain (no elements in this list).
marker-name: Element and attribute to produce markup. The markup attribute, if it exists, derives information from the content of the dd element in the lexicon. The markup is in the form: <element attribute="dd content taken from the lexicon"> vocabulary for markup</element>. Value by default : acronym title.
max-per-doc: Maximum number of markups per entry in the lexicon for the entire document. Value by default: -1 (no limit).
element-to-limit: List of the elements for which we wish to impose a specific limit on the number of markups. Value by default: empty chain (no elements in this list).
max-per-element: Maximum number of markups per entry in the lexicon for each element listed in the element-to-limit parameter. Value by default: -1(no limit).
case-accent-sensitive: Boolean. If True, only occurences where the case and accents matches exactly should be marked up. Value by default: false.

Agence Communication, Web et Technologies

Semantic markup using XSL transformation

XSLT

Semark.xsl: general principals

Semark.xsl: operating options