Programs must be written for people to read, and only incidentally for machines to execute.

— Structure and Interpretation of Computer Programs

Tangler: literate programming in HTML

Author

Theresa O'Connor <tess@oconnor.cx>

URL

https://hober.github.io/tangler/

Introduction
Tangler documents
How it works
Download Tangler
Chunk index

1. Introduction

The source code of programs needs to be written in a certain order, so that the compiler or interpreter will do what the programmer intended. That order is probably not the most natural order a person might put the code in, especially if they were trying to organize it in a way most conducive to being read and comprehended other people. According to Wikipedia,

Literate programming is a programming paradigm[…] in which a computer program is given as an explanation of how it works in a natural language, such as English, interspersed (embedded) with snippets of[…] source code, from which compilable source code can be generated.[…]
The literate programming paradigm[…] represents a move away from writing computer programs in the manner[…] imposed by the compiler, and instead gives programmers [the ability] to develop programs in the order demanded by the logic and flow of their thoughts.[…]
Literate programming (LP) tools are used to obtain two representations from a source file: one understandable by a compiler or interpreter, the "tangled" code, and another for viewing as formatted documentation, which is said to be "woven" from the literate source.

Don Knuth’s original literate programming tool, WEB, weaved T_eX documents out of literate source files, and many, maybe even most, LP tools written since have bias toward using T_eX as the output format. But as Mark Pilgrim wrote many years ago, HTML is not an output format. HTML is The Format. So I decided to have a go at writing a literate programming tool whose input documents are simply HTML.

2. Tangler documents

Tangler documents are just HTML. It’s called Tangler because there’s no need to weave anything: the document is already woven.

The two things that make an HTML document a Tangler document are chunks and chunk references (xrefs).

2.1. Chunks are named code blocks

A chunk is just a piece of source code.

Here’s an example of one. This chunk contains one line of JavaScript which defines a Tangler global object.

Define the Tangler global

globalThis.Tangler = {};

Each chunk has a name. The <figure> element is the most natural way in HTML to associate a name with a bit of markup, and <pre><code>…</code></pre> is the usual markup pattern for code blocks.

So here’s what the source code of that chunk looks like:

How to mark up a chunk

<figure class=chunk id=tangler-global>
<figcaption>Define the Tangler global</figcaption>
<pre><code>globalThis.Tangler = {};</code></pre>
</figure>

The <figure> element has the class “chunk”, which is how Tangler knows it’s a chunk.

Tangler only extracts the text when it extracts source code from a chunk. It ignores markup inside the chunk. So you’re free to mark things up inside the source when it makes sense to do so.

For instance, I could have linked globalThis to its MDN page like so, and it wouldn’t affect what gets extracted:

Source code can contain HTML elements

<figure class=chunk id=tangler-global>
<figcaption>Define the Tangler global</figcaption>
<pre><code><a href=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/globalThis>globalThis</a>.Tangler = {};</code></pre>
</figure>

While this is pretty neat, it’s probably best to do this only sparingly. It can make maintaining the source more challenging. I tend to use the <var> element when I define variables, the <wbr> element to provide line-breaking opportunities the browser’s line layout algorithm misses, and I occasionally sprinkle in links when I use something that might be unfamiliar to the reader.

2.2. xrefs: links to chunks

Chunks can refer to other chunks; this is how Tangler knows how to tangle them together. A reference from one chunk to another is called an xref.

The most obvious way to refer to another part of a document in HTML is the <a> element, of course. For instance, here’s a chunk that pulls in two other chunks:

Put funky brackets around chunk names

Style chunk names
Style chunk references

And here’s what the markup for that looks like:

Including other chunks within a chunk

<figure class=chunk id=put-funky-brackets-around-chunk-names>
<figcaption>Put funky brackets around chunk names</figcaption>
<pre><code>
  <a class=chunk href=#style-chunk-names>Style chunk names</a>
  <a class=chunk href=#style-chunk-xrefs>Style chunk references</a>
</code></pre>
</figure>

N.B. xrefs refer to chunks by id="", not by name. This is very natural in HTML, but might surprise you if you’ve used other literate programming tools before. In other tools, chunks are referenced by name. In these tools, multiple chunks with the same name get concatenated in document order to form a composite chunk. But in Tangler, we reference chunks by id="", which are unique. No magical concatenation for you.

This is a limitation of using id="" attributes for linking to chunks, but this limitation is intentional. The small convenience of being able to break up chunks without naming the parts is outweighed, I think, by the advantage to authors, who can simply use the HTML mechanisms they’re already accustomed to, and to readers, who always know that they’re seeing the whole chunk when they are looking at a <figure class=chunk>. You can explicitly split a chunk up, but you have to make each piece a chunk in its own right.

2.3. Adding a chunk index to your document

Once you’ve written a literate program of non-trivial length, an index to all of its chunks can be quite useful.

You can easily add a chunk index to your document by calling Tangler.createChunkIndex(). The function returns a DocumentFragment you can append wherever you’d like.

2.4. Editor support: Emacs

I write HTML in Emacs, and it’d be nice if I didn’t have to laboriously type out these chunk <figure>s by hand each time. Here’s how to define a skeleton, tangler-insert-chunk, that takes care of the boilerplate for you.

Skeleton for writing chunks

(define-skeleton tangler-insert-chunk
  "Insert a new chunk."
  "Name of chunk: "
  "<figure class=chunk id="
  (skeleton-read "ID of chunk: " nil nil)
  ">" "\n"
  "<figcaption>"
  str
  "</figcaption>" "\n"
  "<pre><code>" "\n"
  _ "\n"
  "</code></pre>" "\n"
  "</figure>" "\n")

3. How it works

3.1. Tangling text

The main entry point to the program is Tangler.tangle(), which tangles the chunk passed to it at the given indent depth.

As we process the chunk, we store our in-progress tangle in tangled.

Tangler.tangle()

Tangler.tangle = function(chunk, indent = 0) {
    let tangled = "";
    Tangle a chunk
    return tangled;
}

Implementing a tangler that operates on the DOM is surprisingly straightforward. It boils down to a simple tree walk. You look for references to other chunks, and if you find any, you recurse. Otherwise, you simply collect the text of the chunk and indent it.

Tangle a chunk

Create a string of whitespace for indentation
Create a place to accumulate unindented text
Remember the <a class=chunk> we’re in, if any
const walker = document.createTreeWalker(
    chunk.querySelector("pre"));
while (walker.nextNode()) {
    switch (walker.currentNode.nodeType) {
    case Node.ELEMENT_NODE:
        Ignore most elements
        Process a chunk reference
        break;
    case Node.TEXT_NODE:
        Ignore text inside <a class=chunk>
        accumulate text for indentation
        break;
    default:
        break;
    }
}
Indent remaining text

While we walk the tree, we really only care about text nodes and <a class=chunk> elements. If we encounter any other kind of element, we can ignore it.

Ignore most elements

if (walker.currentNode.localName != "a")
    continue;
if (!walker.currentNode.classList.contains("chunk"))
    continue;

Let’s cover the processing of text nodes. For the most part, we just accumulate unindented text into text:

Create a place to accumulate unindented text

let text = "";

accumulate text for indentation

text += walker.currentNode.data;

But we have to make sure we ignore any text inside <a class=chunk> links. If we’re inside one, we remember it in xref:

Remember the <a class=chunk> we’re in, if any

let xref = null;

Remember this <a class=chunk> while we process it

xref = walker.currentNode;

And then we don’t bother to accumulate text if xref is set:

Ignore text inside <a class=chunk>

if (xref && isAncestor(walker.currentNode, xref))
    continue;
else if (xref) // We’re no longer inside xref, so forget it
    xref = null;

Next, let’s go over what we do when we encounter a reference to another chunk (an <a class=chunk> element). There are three steps, the first of which we’ve already seen:

Process a chunk reference

Remember this <a class=chunk> while we process it
Indent the text we’ve accumulated thus far
Recurse with a new indent length

The second step is to indent all of the text that we’ve accumulated so far. We remember the lines of text in lines because we’ll need to use them later to compute a new indent length.

Indent the text we’ve accumulated thus far

const lines = text.split("\n");
tangled += lines.join(prefix);
text = "";

Next,we compute a new indent length and recurse, tangling the chunk that’s referred to by this <a class=chunk>.

Recurse with a new indent length

const last = lines[lines.length - 1];
tangled += Tangler.tangle(
    document.querySelector(
        walker.currentNode.getAttribute("href")),
    indent + last.length);

Finally, when the entire tree walk is done, we indent whatever text we have left over (after the last <a class=chunk> link).

Indent remaining text

tangled += text.split("\n").join(prefix);

3.2. Making links to tangled files

When you put a Tangler document up on the web, it’s nice to provide links to the tangled source files embedded in the document.

The easiest thing to do is to construct a Data URL to the tangled source. You’ll need to know the chunk to tangle, and the media type of the tangled document.

Create a data URL to a chunk

`data:${mediaType};charset=utf-8,${
    encodeURIComponent(Tangler.tangle(chunk))
}`;

Tangler.createChunkDownloadLink() is a helper method which can do this for you. You give it a chunk, a filename for the tangled document, and a media type, and it returns an <a> element you can stick in your DOM somewhere.

Tangler.createChunkDownloadLink()

Tangler.createChunkDownloadLink = function(chunk, filename, mediaType) {
    const link = document.createElement("a");
    link.download = filename;
    link.href = Create a data URL to a chunk
    let filename_el = document.createElement("code");
    filename_el.appendChild(document.createTextNode(filename));
    link.appendChild(filename_el);
    return link;
}

3.3. `Tangler.createChunkIndex()`

Tangler.createChunkIndex()

Tangler.createChunkIndex = function(id) {
    const index = new DocumentFragment();

    Create index heading
    Populate index

    return index;
}

If you pass an ID to createChunkIndex(), it’ll place the ID on the heading of the index, so you can link to it from your Table of Contents.

Create index heading

const heading = document.createElement("h2");
if (id)
    heading.id = id;
index.appendChild(heading);
heading.appendChild(document.createTextNode("Chunk index"));

Adding each chunk to the index is relatively straightforward.

Add chunk to index

const indexEntry = document.createElement("li");
chunkList.appendChild(indexEntry);
const link = document.createElement("a");
indexEntry.appendChild(link);

link.href = "#" + chunk.id;
link.classList.add("chunk");

const title = chunk.querySelector("figcaption").cloneNode(true);
while (title.childNodes.length > 0) {
    link.appendChild(title.childNodes[0]);
}

But the index should be in lexicographic order, not document order, so we need to sort the chunks by their names before adding index entries for them.

Populate index

const chunkList = document.createElement("ul");
index.appendChild(chunkList);

const chunkNameMap = new Map();

for (const chunk of document.querySelectorAll("figure.chunk")) {
    chunkNameMap.set(
        chunk.querySelector("figcaption").textContent,
        chunk);
}

for (const chunkName of Array.from(chunkNameMap.keys()).sort()) {
    const chunk = chunkNameMap.get(chunkName);
    Add chunk to index
}

3.4. Styling chunks

Besides providing a script that tangles documents, Tangler also includes a minimal stylesheet that makes chunks and chunk references look kind of like they do in noweb, the literate programming tool I’ve used the most over the years.

tangle.css

/* tangle.css - Tangler, a tool for literate programming
 *
 * Copyright
 */
Tighten chunk spacing
Style chunks
Put funky brackets around chunk names

Chunk names get bracketed with U+27EA (⟪) MATHEMATICAL LEFT DOUBLE ANGLE BRACKET and U+27EB (⟫) MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET.

Style chunk references

a.chunk::before {
    content: "⟪";
}
a.chunk::after {
    content: "⟫";
}

Where chunks are defined, their names are followed by U+2254 (≔) COLON EQUALS.

Style chunk names

figure.chunk > figcaption::before {
    content: "⟪";
}
figure.chunk > figcaption::after {
    content: "⟫≔";
}

I don’t want there to be any space between the chunk name and its body. So we remove any margin that might cause there to be space.

Tighten chunk spacing

figure.chunk > figcaption {
    margin-block-end: 0;
}
figure.chunk > pre {
    margin-block-start: 0;
}

Chunk names should be left-aligned. The bodies of chunks should be set in slightly from the chunk name, which should hang a bit left of the main text flow.

Style chunks

figure.chunk > figcaption {
    text-align: left;
    text-indent: -0.5ic;
}
figure.chunk > pre {
    margin-inline: 1ic;
}

3.5. Utilities

Python uses the multiplication operator to quickly create repetitive strings: 'foo'*3 returns 'foofoofoo'. This is super useful for creating runs of whitespace, which a tangler needs to do to get indentation right. Unfortunately, JavaScript’s * operator doesn’t work on strings this way.

spaces()

function spaces(length) {
    return Array(length).fill(" ").join("");
}

This is how we use it in Tangler.tangle():

Create a string of whitespace for indentation

const prefix = "\n" + spaces(indent);

Here’a a function to test if a DOM element is an ancestor of a node. I originally implemented this by walking up node.parentElement looking for the possible ancestor, because I always forget that node.compareDocumentPosition() exists.

isAncestor()

function isAncestor(node, possibleAncestor) {
    return Node.DOCUMENT_POSITION_CONTAINS &
        node.compareDocumentPosition(possibleAncestor);
}

3.6. Usage from the command line

It’d be nice if this thing worked from the command line, and not just from inside a browser. Fortunately jsdom makes that a breeze.

Parsing an HTML file with jsdom is really easy.

Parse the file with jsdom

const { JSDOM } = require("jsdom");
const html = require('fs').readFileSync(process.argv[2], 'utf-8');
const dom = new JSDOM(html);

The rest of Tangler’s code assumes it’s running in a browser. Now that we’ve parsed the HTML document, we can trick the rest of the script into thinking that’s the case, even though we’re in Node.js.

The only global objects the rest of the script relies on are window.Node, window.DocumentFragment, window.document, so we put references to them on globalThis.

Trick the rest of the file into thinking we’re in a browser

globalThis.Node = dom.window.Node;
globalThis.DocumentFragment = dom.window.DocumentFragment;
globalThis.document = dom.window.document;

All that’s left to do is to tangle the requested chunk and print it to stdout:

Print the tangled source code to stdout

console.log(
    Tangler.tangle(document.getElementById(process.argv[3])));

Some of this trickery has to happen before the rest of the script loads. In order to run some code only in Node.js, we look for globalThis.require, which is defined in Node.js but isn’t part of the web platform.

Node.js preamble

if ('require' in globalThis) {
    Bail if there aren’t enough command line arguments
    Parse the file with jsdom
    Trick the rest of this script into thinking we’re in a browser
}

We can’t print the tangled source code to stdout until the methods of the Tangler global have been defined.

Node.js postamble

if ('require' in globalThis) {
    Print the tangled source code to stdout
}

If the user didn’t pass in enough arguments, we tell them how to call this script from the command line.

Bail if there aren’t enough command line arguments

if (process.argv.length < 4) {
    console.error(
        "USAGE: node tangle.js tangler-filename chunk-id");
    process.exit(1);
}

3.7 A `Makefile` and a README

Tangler’s repository is on GitHub, so here’s a README.md that shows up in GitHub’s UI:

README.md

# Tangler

Tangler is a literate programming tool whose input format is HTML.

Read all about it at URL

To extract Tangler’s source code from  [the HTML file](URL), you can simply [download the source files](URL#download), or you can check out this repository and extract the Makefile with the following command:

    node - index.html Makefile < bootstrap/tangle.js > Makefile

After this, a simple `make` should do the right thing.

Makefile

# -*- makefile-gmake -*-

.PHONY: all bootstrap clean distclean

SOURCE=index.html
CODE=tangle.js tangle.css tangler.el
PRODUCTS=$(CODE) Makefile README.md

all: $(PRODUCTS)

bootstrap: bootstrap/tangle.js bootstrap/tangle.css

bootstrap/tangle.js: tangle.js
	cp tangle.js bootstrap/

bootstrap/tangle.css: tangle.css
	cp tangle.css bootstrap/

clean:
	rm -f $(CODE) *~

distclean:
	rm -f $(PRODUCTS) *~

$(PRODUCTS): $(SOURCE)
	node - $(SOURCE) $@ < bootstrap/tangle.js > $@

4. Download Tangler

I’m sure you’ve noticed by now that Tangler is self-hosted: this document describing how it works is also its source code. It contains several documents that can be extracted.

The script contains the definition of the Tangler global, the methods that hang off of it, and some utility functions:

tangle.js

/* tangle.js - Tangler, a tool for literate programming
 *
 * Copyright
 */
(function() {
    Define the Tangler global
    Node.js preamble
    isAncestor()
    spaces()
    Tangler.tangle()
    Tangler.createChunkIndex()
    Tangler.createChunkDownloadLink()
    Node.js postamble
})();

You can read about the stylesheet in §3.4. Styling chunks.

In §2.4. Editor support: Emacs I define the tangler-insert-chunk skeleton. I’ve wrapped it into an Elisp library below:

tangler.el

;;; tangler.el --- Support for editing Tangler literate programs

;; Copyright

;; Author: Author
;; Keywords: convenience, docs, hypermedia

;;; Code:

(require 'skeleton)

Skeleton for writing chunks

(provide 'tangler)
;;; tangler.el ends here