Programs must be written for people to read, and only incidentally for machines to execute.
— Structure and Interpretation of Computer Programs
Author
Theresa O'Connor <tess@oconnor.cx>
URL
https://hober.github.io/tangler/
Table of Contents
- Introduction
- Tangler documents
- Chunks are named code blocks
- xrefs: links to chunks
- Adding a chunk index to your document
- Editor support: Emacs
- How it works
- Tangling text
- Making links to tangled files
Tangler.createChunkIndex()
- Styling chunks
- Utilities
- Usage from the command line
- A
Makefile
and a README
- Download Tangler
- Chunk index
1. Introduction
The source code of programs needs to be written in a certain order, so that the compiler or interpreter will do what the programmer intended. That order is probably not the most natural order a person might put the code in, especially if they were trying to organize it in a way most conducive to being read and comprehended other people. According to Wikipedia,
Literate programming is a programming paradigm[…] in which a computer program is given as an explanation of how it works in a natural language, such as English, interspersed (embedded) with snippets of[…] source code, from which compilable source code can be generated.[…]
The literate programming paradigm[…] represents a move away from writing computer programs in the manner[…] imposed by the compiler, and instead gives programmers [the ability] to develop programs in the order demanded by the logic and flow of their thoughts.[…]
Literate programming (LP) tools are used to obtain two representations from a source file: one understandable by a compiler or interpreter, the "tangled" code, and another for viewing as formatted documentation, which is said to be "woven" from the literate source.
Don Knuth’s original literate programming tool, WEB, weaved TeX documents out of literate source files, and many, maybe even most, LP tools written since have bias toward using TeX as the output format. But as Mark Pilgrim wrote many years ago, HTML is not an output format. HTML is The Format.
So I decided to have a go at writing a literate programming tool whose input documents are simply HTML.
2. Tangler documents
Tangler documents are just HTML. It’s called Tangler because there’s no need to weave anything: the document is already woven.
The two things that make an HTML document a Tangler document are chunks and chunk references (xrefs).
2.1. Chunks are named code blocks
A chunk is just a piece of source code.
Here’s an example of one. This chunk contains one line of JavaScript which defines a Tangler
global object.
Define the Tangler global
globalThis.Tangler = {};
Each chunk has a name. The <figure>
element is the most natural way in HTML to associate a name with a bit of markup, and <pre><code>…</code></pre>
is the usual markup pattern for code blocks.
So here’s what the source code of that chunk looks like:
How to mark up a chunk
<figure class=chunk id=tangler-global>
<figcaption>Define the Tangler global</figcaption>
<pre><code>globalThis.Tangler = {};</code></pre>
</figure>
The <figure>
element has the class “chunk
”, which is how Tangler knows it’s a chunk.
Tangler only extracts the text when it extracts source code from a chunk. It ignores markup inside the chunk. So you’re free to mark things up inside the source when it makes sense to do so.
For instance, I could have linked globalThis
to its MDN page like so, and it wouldn’t affect what gets extracted:
Source code can contain HTML elements
<figure class=chunk id=tangler-global>
<figcaption>Define the Tangler global</figcaption>
<pre><code><a href=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/globalThis>globalThis</a>.Tangler = {};</code></pre>
</figure>
While this is pretty neat, it’s probably best to do this only sparingly. It can make maintaining the source more challenging. I tend to use the <var>
element when I define variables, the <wbr>
element to provide line-breaking opportunities the browser’s line layout algorithm misses, and I occasionally sprinkle in links when I use something that might be unfamiliar to the reader.
2.2. xrefs: links to chunks
Chunks can refer to other chunks; this is how Tangler knows how to tangle them together. A reference from one chunk to another is called an xref.
The most obvious way to refer to another part of a document in HTML is the <a>
element, of course. For instance, here’s a chunk that pulls in two other chunks:
Put funky brackets around chunk names
Style chunk names
Style chunk references
And here’s what the markup for that looks like:
Including other chunks within a chunk
<figure class=chunk id=put-funky-brackets-around-chunk-names>
<figcaption>Put funky brackets around chunk names</figcaption>
<pre><code>
<a class=chunk href=#style-chunk-names>Style chunk names</a>
<a class=chunk href=#style-chunk-xrefs>Style chunk references</a>
</code></pre>
</figure>
N.B. xrefs refer to chunks by id=""
, not by name. This is very natural in HTML, but might surprise you if you’ve used other literate programming tools before. In other tools, chunks are referenced by name. In these tools, multiple chunks with the same name get concatenated in document order to form a composite chunk. But in Tangler, we reference chunks by id=""
, which are unique. No magical concatenation for you.
This is a limitation of using id=""
attributes for linking to chunks, but this limitation is intentional. The small convenience of being able to break up chunks without naming the parts is outweighed, I think, by the advantage to authors, who can simply use the HTML mechanisms they’re already accustomed to, and to readers, who always know that they’re seeing the whole chunk when they are looking at a <figure class=chunk>
. You can explicitly split a chunk up, but you have to make each piece a chunk in its own right.
2.3. Adding a chunk index to your document
Once you’ve written a literate program of non-trivial length, an index to all of its chunks can be quite useful.
You can easily add a chunk index to your document by calling Tangler.createChunkIndex()
. The function returns a DocumentFragment
you can append wherever you’d like.
2.4. Editor support: Emacs
I write HTML in Emacs, and it’d be nice if I didn’t have to laboriously type out these chunk <figure>
s by hand each time. Here’s how to define a skeleton, tangler-insert-chunk
, that takes care of the boilerplate for you.
Skeleton for writing chunks
(define-skeleton tangler-insert-chunk
"Insert a new chunk."
"Name of chunk: "
"<figure class=chunk id="
(skeleton-read "ID of chunk: " nil nil)
">" "\n"
"<figcaption>"
str
"</figcaption>" "\n"
"<pre><code>" "\n"
_ "\n"
"</code></pre>" "\n"
"</figure>" "\n")
3. How it works
3.1. Tangling text
The main entry point to the program is Tangler.tangle()
, which tangles the chunk passed to it at the given indent depth.
As we process the chunk, we store our in-progress tangle in tangled
.
Tangler.tangle()
Tangler.tangle = function(chunk, indent = 0) {
let tangled = "";
Tangle a chunk
return tangled;
}
Implementing a tangler that operates on the DOM is surprisingly straightforward. It boils down to a simple tree walk. You look for references to other chunks, and if you find any, you recurse. Otherwise, you simply collect the text of the chunk and indent it.
Tangle a chunk
Create a string of whitespace for indentation
Create a place to accumulate unindented text
Remember the <a class=chunk> we’re in, if any
const walker = document.createTreeWalker(
chunk.querySelector("pre"));
while (walker.nextNode()) {
switch (walker.currentNode.nodeType) {
case Node.ELEMENT_NODE:
Ignore most elements
Process a chunk reference
break;
case Node.TEXT_NODE:
Ignore text inside <a class=chunk>
accumulate text for indentation
break;
default:
break;
}
}
Indent remaining text
While we walk the tree, we really only care about text nodes and <a class=chunk>
elements. If we encounter any other kind of element, we can ignore it.
Ignore most elements
if (walker.currentNode.localName != "a")
continue;
if (!walker.currentNode.classList.contains("chunk"))
continue;
Let’s cover the processing of text nodes. For the most part, we just accumulate unindented text into text
:
Create a place to accumulate unindented text
let text = "";
accumulate text for indentation
text += walker.currentNode.data;
But we have to make sure we ignore any text inside <a class=chunk>
links. If we’re inside one, we remember it in xref
:
Remember the <a class=chunk> we’re in, if any
let xref = null;
Remember this <a class=chunk>
while we process it
xref = walker.currentNode;
And then we don’t bother to accumulate text if xref
is set:
Ignore text inside <a class=chunk>
if (xref && isAncestor(walker.currentNode, xref))
continue;
else if (xref) // We’re no longer inside xref, so forget it
xref = null;
Next, let’s go over what we do when we encounter a reference to another chunk (an <a class=chunk>
element). There are three steps, the first of which we’ve already seen:
Process a chunk reference
Remember this <a class=chunk> while we process it
Indent the text we’ve accumulated thus far
Recurse with a new indent length
The second step is to indent all of the text that we’ve accumulated so far. We remember the lines of text in lines
because we’ll need to use them later to compute a new indent length.
Indent the text we’ve accumulated thus far
const lines = text.split("\n");
tangled += lines.join(prefix);
text = "";
Next,we compute a new indent length and recurse, tangling the chunk that’s referred to by this <a class=chunk>
.
Recurse with a new indent length
const last = lines[lines.length - 1];
tangled += Tangler.tangle(
document.querySelector(
walker.currentNode.getAttribute("href")),
indent + last.length);
Finally, when the entire tree walk is done, we indent whatever text we have left over (after the last <a class=chunk>
link).
Indent remaining text
tangled += text.split("\n").join(prefix);
3.2. Making links to tangled files
When you put a Tangler document up on the web, it’s nice to provide links to the tangled source files embedded in the document.
The easiest thing to do is to construct a Data URL to the tangled source. You’ll need to know the chunk to tangle, and the media type of the tangled document.
Create a data
URL to a chunk
`data:${mediaType};charset=utf-8,${
encodeURIComponent(Tangler.tangle(chunk))
}`;
Tangler.createChunkDownloadLink()
is a helper method which can do this for you. You give it a chunk, a filename for the tangled document, and a media type, and it returns an <a>
element you can stick in your DOM somewhere.
Tangler.createChunkDownloadLink()
Tangler.createChunkDownloadLink = function(chunk, filename, mediaType) {
const link = document.createElement("a");
link.download = filename;
link.href = Create a data
URL to a chunk
let filename_el = document.createElement("code");
filename_el.appendChild(document.createTextNode(filename));
link.appendChild(filename_el);
return link;
}
3.3. Tangler.createChunkIndex()
Tangler.createChunkIndex()
Tangler.createChunkIndex = function(id) {
const index = new DocumentFragment();
Create index heading
Populate index
return index;
}
If you pass an ID to createChunkIndex()
, it’ll place the ID on the heading of the index, so you can link to it from your Table of Contents.
Create index heading
const heading = document.createElement("h2");
if (id)
heading.id = id;
index.appendChild(heading);
heading.appendChild(document.createTextNode("Chunk index"));
Adding each chunk to the index is relatively straightforward.
Add chunk to index
const indexEntry = document.createElement("li");
chunkList.appendChild(indexEntry);
const link = document.createElement("a");
indexEntry.appendChild(link);
link.href = "#" + chunk.id;
link.classList.add("chunk");
const title = chunk.querySelector("figcaption").cloneNode(true);
while (title.childNodes.length > 0) {
link.appendChild(title.childNodes[0]);
}
But the index should be in lexicographic order, not document order, so we need to sort the chunks by their names before adding index entries for them.
Populate index
const chunkList = document.createElement("ul");
index.appendChild(chunkList);
const chunkNameMap = new Map();
for (const chunk of document.querySelectorAll("figure.chunk")) {
chunkNameMap.set(
chunk.querySelector("figcaption").textContent,
chunk);
}
for (const chunkName of Array.from(chunkNameMap.keys()).sort()) {
const chunk = chunkNameMap.get(chunkName);
Add chunk to index
}
3.4. Styling chunks
Besides providing a script that tangles documents, Tangler also includes a minimal stylesheet that makes chunks and chunk references look kind of like they do in noweb
, the literate programming tool I’ve used the most over the years.
tangle.css
/* tangle.css - Tangler, a tool for literate programming
*
* Copyright
*/
Tighten chunk spacing
Style chunks
Put funky brackets around chunk names
Chunk names get bracketed with U+27EA (⟪) MATHEMATICAL LEFT DOUBLE
ANGLE BRACKET and U+27EB (⟫) MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET.
Style chunk references
a.chunk::before {
content: "⟪";
}
a.chunk::after {
content: "⟫";
}
Where chunks are defined, their names are followed by U+2254 (≔) COLON
EQUALS.
Style chunk names
figure.chunk > figcaption::before {
content: "⟪";
}
figure.chunk > figcaption::after {
content: "⟫≔";
}
I don’t want there to be any space between the chunk name and its body. So we remove any margin that might cause there to be space.
Tighten chunk spacing
figure.chunk > figcaption {
margin-block-end: 0;
}
figure.chunk > pre {
margin-block-start: 0;
}
Chunk names should be left-aligned. The bodies of chunks should be set in slightly from the chunk name, which should hang a bit left of the main text flow.
Style chunks
figure.chunk > figcaption {
text-align: left;
text-indent: -0.5ic;
}
figure.chunk > pre {
margin-inline: 1ic;
}
3.5. Utilities
Python uses the multiplication operator to quickly create repetitive strings: 'foo'*3
returns 'foofoofoo'
. This is super useful for creating runs of whitespace, which a tangler needs to do to get indentation right. Unfortunately, JavaScript’s *
operator doesn’t work on strings this way.
spaces()
function spaces(length) {
return Array(length).fill(" ").join("");
}
This is how we use it in Tangler.tangle()
:
Create a string of whitespace for indentation
const prefix = "\n" + spaces(indent);
Here’a a function to test if a DOM element is an ancestor of a node. I originally implemented this by walking up node.parentElement
looking for the possible ancestor, because I always forget that node.compareDocumentPosition()
exists.
isAncestor()
function isAncestor(node, possibleAncestor) {
return Node.DOCUMENT_POSITION_CONTAINS &
node.compareDocumentPosition(possibleAncestor);
}
3.6. Usage from the command line
It’d be nice if this thing worked from the command line, and not just from inside a browser. Fortunately jsdom
makes that a breeze.
Parsing an HTML file with jsdom
is really easy.
Parse the file with jsdom
const { JSDOM } = require("jsdom");
const html = require('fs').readFileSync(process.argv[2], 'utf-8');
const dom = new JSDOM(html);
The rest of Tangler’s code assumes it’s running in a browser. Now that we’ve parsed the HTML document, we can trick the rest of the script into thinking that’s the case, even though we’re in Node.js
.
The only global objects the rest of the script relies on are
window.Node
,
window.DocumentFragment
,
window.document
,
so we put references to them on globalThis
.
Trick the rest of the file into thinking we’re in a browser
globalThis.Node = dom.window.Node;
globalThis.DocumentFragment = dom.window.DocumentFragment;
globalThis.document = dom.window.document;
All that’s left to do is to tangle the requested chunk and print it to stdout
:
Print the tangled source code to stdout
console.log(
Tangler.tangle(document.getElementById(process.argv[3])));
Some of this trickery has to happen before the rest of the script loads. In order to run some code only in Node.js
, we look for globalThis.require
, which is defined in Node.js
but isn’t part of the web platform.
Node.js
preamble
if ('require' in globalThis) {
Bail if there aren’t enough command line arguments
Parse the file with jsdom
Trick the rest of this script into thinking we’re in a browser
}
We can’t print the tangled source code to stdout
until the methods of the Tangler global have been defined.
Node.js
postamble
if ('require' in globalThis) {
Print the tangled source code to stdout
}
If the user didn’t pass in enough arguments, we tell them how to call this script from the command line.
Bail if there aren’t enough command line arguments
if (process.argv.length < 4) {
console.error(
"USAGE: node tangle.js tangler-filename chunk-id");
process.exit(1);
}
3.7 A Makefile
and a README
Tangler’s repository is on GitHub, so here’s a README.md
that shows up in GitHub’s UI:
README.md
# Tangler
Tangler is a literate programming tool whose input format is HTML.
Read all about it at URL
To extract Tangler’s source code from [the HTML file](URL), you can simply [download the source files](URL#download), or you can check out this repository and extract the Makefile with the following command:
node - index.html Makefile < bootstrap/tangle.js > Makefile
After this, a simple `make` should do the right thing.
Makefile
# -*- makefile-gmake -*-
.PHONY: all bootstrap clean distclean
SOURCE=index.html
CODE=tangle.js tangle.css tangler.el
PRODUCTS=$(CODE) Makefile README.md
all: $(PRODUCTS)
bootstrap: bootstrap/tangle.js bootstrap/tangle.css
bootstrap/tangle.js: tangle.js
cp tangle.js bootstrap/
bootstrap/tangle.css: tangle.css
cp tangle.css bootstrap/
clean:
rm -f $(CODE) *~
distclean:
rm -f $(PRODUCTS) *~
$(PRODUCTS): $(SOURCE)
node - $(SOURCE) $@ < bootstrap/tangle.js > $@
4. Download Tangler
I’m sure you’ve noticed by now that Tangler is self-hosted: this document describing how it works is also its source code. It contains several documents that can be extracted.
The script contains the definition of the Tangler
global, the methods that hang off of it, and some utility functions:
tangle.js
/* tangle.js - Tangler, a tool for literate programming
*
* Copyright
*/
(function() {
Define the Tangler global
Node.js
preamble
isAncestor()
spaces()
Tangler.tangle()
Tangler.createChunkIndex()
Tangler.createChunkDownloadLink()
Node.js
postamble
})();
You can read about the stylesheet in §3.4. Styling chunks.
In §2.4. Editor support: Emacs I define the tangler-insert-chunk
skeleton. I’ve wrapped it into an Elisp library below:
tangler.el
;;; tangler.el --- Support for editing Tangler literate programs
;; Copyright
;; Author: Author
;; Keywords: convenience, docs, hypermedia
;;; Code:
(require 'skeleton)
Skeleton for writing chunks
(provide 'tangler)
;;; tangler.el ends here