Lightweight Semantic Markup for Software Manuals

April 02, 2017 – tagged Docbook, Scala

Like most software developers, once in a while I need to write not only program code, but also text for humans to read, for example README files, tutorials, or software manuals. Lightweight markup languages (LML) such as Markdown or Textile have become popular for many use cases because they allow for some visual markup (e.g. bold or italic) while still being very easy to write and readable in plain text.

The first LML I got in touch with was Textile, when I wrote a static site generator for my never-to-be-published website at university. Nowadays I guess Markdown has become the most used language, at least among technical persons, as many tech-related websites (Github, Stack Overflow, ...) support some dialect of Markdown in their text input fields. Some situations (e.g. “nested list with an embedded blockquote”) always get me intro trouble when using Markdown, but still it's “good enough” for issue comments on Github or simple README files.

Languages to write software documentation have a slightly bigger set of requirements. Just off the top of my head, when writing a software handbook here are a couple of things that I would expect from my documentation system:

  1. Multi-format publishing, e.g., in HTML, PDF, ePub.
  2. Nicely formatted source code snippets.
  3. A possibility to format and/or annotate source code snippets.
  4. Semantic markup: Even though file names, variables, and shell commands may all be typeset in the same font, there should be a possibility to mark them as “file name”, “variable” or “shell command”, so that there is a chance to change formatting later or, say, create an index of all mentioned files.
  5. Easy to write, that is, no XML, good documentation of language elements, syntax that is easy to remember.

A couple of examples to show what I mean:

  • The documentation of PostgreSQL (for example, the CREATE TABLE reference) has a block explaining the syntax of the each statement where variables and placeholders have a special highlighting:

    PostgreSQL documentation

  • The waf book has syntax highlighting for code snippets and command line syntax has annotations to explain the meaning of each parameter (cf. section 3.1.1):

    waf book

  • The FreeBSD handbook (for example, the section on Mounting and Unmounting File Systems) has different formatting for file names, keywords, commands, parameters:

    FreeBSD handbook

    It is available as single-page HTML, multi-page HTML, PDF, and other formats.

These are nice works and recently I have been looking for a system to create that kind of documentation. As far as the examples above are concerned, PostgreSQL uses SGML (cf. CREATE TABLE source), waf seems to use hand-crafted HTML (cf. source) – although the comments indicate that it may have been created from AsciiDoc – and FreeBSD uses Docbook XML (cf. Basics chapter source). Raw HTML is of course able to express anything you want and Docbook was made for technical documentation, so it covers a lot of functionality, but both are cumbersome to write.

What other choices are there?

Most lightweight markup languages, including Markdown, AsciiDoc, and org-mode, easily convert to various formats – if not with a native application, then via pandoc. However, they all lack support for semantic markup and at some point all LML suffer from the problem of escaping characters that have a special meaning in that language (cf. How can the backtick character be included in code?).

LaTeX supports semantic markup and appearance can be customized in a very flexible manner, but the export to HTML and ePub is not very well supported.

A very popular tool, in particular in the Python world, is Sphinx, maybe partly due to the possibility to host documentation on Read the Docs. reStructuredText, the markup language used by Sphinx, is in general able to support semantic markup by means of custom text roles, so it is possible to write :file:`/etc/fstab` or :cmd:`git`. However, from a personal point of view, I find the syntax very hard to remember, unintuitive, error messages hardly helpful, and some things like “monospace text within link text” are very, very complicated to realize. I dislike the syntax so much that I will dismiss Sphinx as not “easy to write”.

Then there is troff, which is used to write UNIX man pages. From a brief glance at some code examples, I guess that troff is totally capable of semantic markup, but due to its age I am not so sure about multi-format output, and the syntax looks very cryptic to me.

Whatever I researched, I always had the feeling that Docbook is what I am actually looking for, just with too many angle brackets. Therefore I looked into ways to encode/write XML in a more compact way, and found Scaml (a dialect of HAML), a template language that can be used by web applications to fill in values into an HTML document. Scaml looks like

%html
  %body
    The quick brown fox jumps 
    over the lazy dog

which converts into the following HTML:

<html>
  <body>
    The quick brown fox jumps 
    over the lazy dog
  </body>
</html>

However, Scaml can as well be used to produce XML. Just as closing curly brackets in C-like languages are replaced by indentation in Python, closing tags in XML are replaced by indentation in Scaml.

The following Scaml snippet shows how a software handbook could start like:

!!! XML
%article(xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         version="5.0" xml:lang="en")
  %info
    %title MySoftware Handbook
    %subtitle Make Software Great Again

    %authorgroup
      %author
        %personname Tobias Pfeiffer

    %pubdate 2017-04-01

  %chapter
    %title MySoftware User Manual

    %section
      %title Installation

      %para
        Just download the
        %filename mytool.py
        file, make it executable, rename it to
        %filename mytool
        and put it into your&#x20;
        %envar> $PATH
        \. Verify that installation was successful by running&#x20;
        %command> mytool help
        \.

Note: To remove spaces around the <envar>$PATH</envar> tag, for example when followed by a comma, > must be appended to the %envar directive. However, as this also deletes preceding spaces, a non-deletable space is explicitly inserted as an XML entity.

It converts into Docbook XML that looks as follows:

<?xml version='1.0' encoding='UTF-8' ?>
<article version='5.0' xml:lang='en'
  xmlns:xlink='http://www.w3.org/1999/xlink'
  xmlns='http://docbook.org/ns/docbook'>
  <info>
    <title>MySoftware Handbook</title>
    <subtitle>Make Software Great Again</subtitle>
    <authorgroup>
      <author>
        <personname>Tobias Pfeiffer</personname>
      </author>
    </authorgroup>
    <pubdate>2017-04-01</pubdate>
  </info>
  <chapter>
    <title>MySoftware User Manual</title>
    <section>
      <title>Installation</title>
      <para>
        Just download the
        <filename>mytool.py</filename>
        file, make it executable, rename it to
        <filename>mytool</filename>
        and put it into your&#x20;<envar>$PATH</envar>. Verify that installation was successful by running&#x20;<command>mytool help</command>.
      </para>
    </section>
  </chapter>
</article>

I have published a small quick-and-dirty Scala snippet that does this conversion and the subsequent conversion to HTML using the Docbook XSL sheets as a Github Gist but would like to combine this into a proper sbt plugin later on.

Now, is this the processing pipeline for software documentation that the world has been waiting for? I admit, even the Scaml source looks rather cryptic and I would not call it “lightweight” any more. However, it fulfills the five requirements I listed above (stretching the “easy to write” point a bit). Furthermore, as Docbook is an established standard it is easy to migrate to any other system if need be. For now, I will try to use the described method for a manual that I am writing at the moment and will see how that works out.