HTML 5 in Haskell
I’ve just released the first version of the xmlhtml package, which is part of the Snap framework. The purpose of the package is to be a common parser and renderer for both HTML 5 and XML. I’m writing here to talk about what the package is, its goals and design choices, and so on.
Context: Heist and Hexpat
The Snap project includes a templating engine called Heist. Since I didn’t write it, I can say that it is, in my opinion, the best model for real page template engines that exists. If you’re generating very dynamic HTML using Haskell code, there are nice options like blaze; but if you want a template system, I think Heist is the clear choice. It’s simple, easy to understand, and extremely powerful.
Under the hood, Heist works with hexpat, a wrapper around the native expat library for handling XML documents. Unfortunately, HTML isn’t really XML, and it’s sometimes difficult to build pages that are valid XML and make use of a variety of client-side web techniques. Some problems that arose:
- CSS is similar; special characters can occur in valid CSS, but are not okay in XML documents.
- The converse problem exists as well; hexpat will escape special characters in text as entities, but web browsers that don’t expect that then won’t parse the code correctly.
- Some tags like textarea and object need to have explicit close tags to be understood by many web browsers and be valid HTML 5. Hexpat renders that as empty tags instead, with a slash in the start tag.
- HTML allows certain end tags to be omitted; for example, end tags on list item, paragraphs, etc. Hexpat is an XML parser, though, and insisted on close tags.
- Hexpat insists on a single root element, as is the custom for XML. However, Heist templates are allowed to have many root elements. The tricks to work around this in Heist had bad effects on other bits, such as DTD declarations. Better to have a proper parser that can understand multiple roots.
There are dozens of other such incompatibilities, and they formed a constant source of annoyance for Heist users.
The Answer: xmlhtml
To address these outstanding issues in Heist, I built a new library for handling both XML and HTML 5, which is creatively named xmlhtml. Since this is a huge design space, we narrowed down the intent of the library as follows:
- The intended use is for working with documents under your own control. This includes, for example, the templates in a web application.
- We support 99% of both XML and HTML 5. We leave out a few small things on both sides (most notably, processing instructions are silently ignored in XML).
- We use the same types for both HTML and XML, so you can write code to work with both seamlessly.
- We focus on keeping the type as simple and small as possible. The types and public API are designed to be as simple as possible.
The first point is crucial to keeping the project manageable. The latest draft HTML 5 specification contains over a hundred pages of intricate parsing rules designed to match the quirky behavior of 20 years worth of web browsers. While it might be useful to implement these rules for writing web spiders, screen scraping, and such; the result would be too complex for working with clean, controlled documents.
At the same time, it was important to us not to take compatibility and standards lightly. While we don’t adhere to all of the standards all of the time, we differ in controlled ways that are important for the application we have in mind.
Simplicity was also a huge design goal. There’s a tendency in Haskell for libraries to get more and more generic over time, to the point that actual honest-to-goodness types are few and far between! One goal of xmlhtml was to just go ahead and decide the types for you. So text is represented with the Text type from Bryan O’Sullivan’s package, which is now part of the Haskell Platform. Lists are lists, attributes and tags are also text… you don’t have to track down a hundred type variables to understand the types in the package. If you want to convert to a different type, you can; but the xmlhtml Document type uses the types that it uses.
This fills a space that I think is pretty much unoccupied in Haskell so far. For parsing and rendering valid XML, there’s hexpat. For handling arbitrary and possible invalid HTML, there is TagSoup. But for handling your own documents using both HTML and XML features, and without requiring detective work to figure out how to use it, this is the way to go.
A Brief Introduction
It’s dead simple to make basic use of the package. The module Text.XmlHtml exports parsing and rendering functions:
- parseXML and parseHTML: These are the starting points for parsing documents. The result is a Document value containing the detected character encoding, document type and the content of the document.
- render: This is the renderer. The result is a Builder object from the blaze-builder package. You can use blaze-builder’s toByteString if you prefer it in that form, but keep in mind that the Builder type has advantages if you’re further concatenating the result into a larger bit of output.
- Basic types and functions: The Text.XmlHtml module exports another 16 simple functions and a few types for manipulating document structure. They are all pretty obvious and simple; you can check if a node is an element, text node, or comment, get its children, get and set its attributes, and so on. You can get lists of child and descendant nodes. All of the basic things you’d expect.
- Cursor: In the package Text.XmlHtml.Cursor, there are functions for acting on document trees more imperatively using a zipper over document trees. The zipper type is Cursor, and there are a few dozen supporting functions for moving and expecting the nodes of the tree.
- renderHtml: The Text.Blaze.Renderer.XmlHtml module contains a renderer from blaze into this package’s document type. This is a sort of corner case, outside the expected usage, but occasionally helpful for some integration stuff if you do some of your HTML using Heist and other stuff with blaze.
So that’s the new xmlhtml package. It’s a very simple and nice way to play with document trees; and that’s pretty much it!