CT Blog - Understanding the Markdown syntax

Last time we talked about Markdown was almost a month ago. Now is the time to carry on our crazy Markdown parser-building adventure!

We have managed to split a file into multiple paragraphs. But by themselves, they are not enough to make a complete document. The next logical step is to extract headers out of our Markdown file.

Headers

If you are reading this blog, or any other blog for that matter, or most books, or document at work, you are familiar with the concept of headers.

They structure a document into a digestible outline, help our brain remember the content better and give us a sense of the complete document without having to read it.

In Markdown, headers can take two forms or syntaxes:

Setext
atx

Because we want our parser to implement the syntax described on John Gruber’s Website in its entirety, we will have to parse the two types of headers.

Sandy path between beach vegetation — “Headers? Where we’re going, we don’t need headers!” *― Dr Emmett Brown*

Setext

Setext headers support only two levels of headers (i.e. Titles and subtitles).

Setext-style headers are “underlined” using equal signs (for first-level headers) and dashes (for second-level headers). […] Any number of underlining =’s or -’s will work.

Here is an example of how Setext headers look:

Understanding the Markdown syntax
=================================

Headers
-------

Here, I matched the length of the title when “underlining” it, for esthetic reasons. But as it is mentioned in the Markdown syntax, you might use only one = or - to respectively signify a title or subtitle.

So if we rephrase what we have just learned, and considering our previous experience with blank lines and paragraphs, a Setext header is:

A normal paragraph (i.e. delimited by blank lines and not indented)
The last line of the block of text contains only = or - and trailing spaces and tabs can safely be ignored.

Parsing the headers

We already know how to parse a paragraph. Let us try to first understand how we can parse the last line of the Setext header by turning what we just described into a regex.

// First-level header “underline”
/^=+[ \t]*$/

// Second-level header “underline”
/^-+[ \t]*$/

To make sure our regexes are properly working, we will use the same method as in the previous article of this series: a simple webpage with some javascript.

Here is the javascript code you will need to test our two regexes. As always, let me provide you with the runnable page containing that code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


/*
 * Once again, we will define some test content
 * to make sure everything works as we expect.
 */
const content = `
First header
============

Multi
line
header
------

Another header
=

Simple second level header
--------------------------`;

const blankLine = /^[\t ]*$/;

const headers = {
  setext: {
    first: /^=+[ \t]*$/,
    second: /^-+[ \t]*$/,
  },
};

let lines = content.split('\n');
let headerFound = 0;
let currentParagraph = [];

lines.forEach((line, index) => {
  // If we find a Setext header “underline”
  if (
    (line.match(headers.setext.first) || line.match(headers.setext.second)) &&
    currentParagraph.length > 0
  ) {
    // We grab the current paragraph
    header = currentParagraph
      .join('\n')
      .replace('  \n', '<br/>')
      .replace('\n', ' ')
      .trim();

    // And write what we have found to the current page
    document.write(`Header ${++headerFound}: ${header}<br/>`);

    currentParagraph = [];
    return;
  }

  if (!line.match(blankLine)) {
    currentParagraph.push(line);
  } else {
    currentParagraph = [];
  }
});

With this code, we now get our headers like so:

Header 1: First header
Header 2: Multi line header
Header 3: Another header
Header 4: Simple second level header

But we miss something paramount to the structure of our document: header levels.

We managed to parse our headers, but now we also need to identify and pass along their respective level. Let us try to do that. With our Setext-style headers, we only have two different levels to parse, so that should not be too hard of a task.

Parsing the header levels

Add the following code to the previous page, at the beginning of the for-each loop, or go there, if you are still lazy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


lines.forEach((line, index) => {
    // By default, let us consider that our header is a level 1 header
    let headerLevel = 1;

    // If we find a level 2 header
    if (
        line.match(headers.setext.second)
        && currentParagraph.length > 0
    ) {
        // We set the level header to 2
        headerLevel = 2;
    }
});

Change the code that outputs the headers to read as follows:

1
2


// We display the headerLevel when writing out the header
document.write(`Header ${++headerFound} (h${headerLevel}): ${header}<br/>`);

Our modifications give us the following result:

Header 1 (h1): First header
Header 2 (h2): Multi line header
Header 3 (h1): Another header
Header 4 (h2): Simple second level header

Perfect! We now detect the header level as well as the header content. Let us carry on with a different task by parsing atx-style headers.

atx

atx-style headers are the most commonly used headers in Markdown. They allow 6 different levels and are preferred when building complicated outlines. For example this document has 4 levels of headers.

They use the 1 to 6 hash characters at the beginning of their line to express their depth level. They look like so:

# First level header
## Second level header
##### Fifth and so forth

They can also include any number of trailing hash characters, but only the leading hashes determine the level of the header.

# First level header #
##### Fifth and so forth ################

With all of that in mind (I once again just paraphrased the official specification), and what we just experienced with the Setext-style headers, atx-style headers are:

A single line
This line must start with 1 to 6 hash characters, and can have any number of trailing hashes.

Let us now take a look at the dreaded regex that would translate that into code:

/^(#{1-6})[ ]*(.*?)[ ]+#*[ \t]*$/;

This one is a little bit more complicated than previous ones. Tools like regex101 explain every bit of black magic that compose that regex. So go and paste it there if you are curious.

Otherwise, let us carry on with testing it. Let us take our previous code, and first update the test content with the following:

1
2
3
4
5
6
7


const content = `
## A header
# A header directly on the next line

### A header with closing hashes #####

###### Another header but this one ends with # #`;

We will also update the regex with the one a few lines above, and use the following logic in our loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


const headers = {
  atx: /^(#{1-6})[ ]*(.*?)[ ]+#+[ \t]*$/,
  setext: {
    first: /^=+[ \t]*$/,
    second: /^-+[ \t]*$/,
  },
};

lines.forEach((line, index) => {
  if (line.match(headers.atx)) {
    // We grab the matches from our regex
    let [_, hashes, headerContent] = line.match(headers.atx);

    header = headerContent.replace('  \n', '<br/>').replace('\n', ' ').trim();

    // The header level is the number of hashes
    let headerLevel = hashes.length;

    // We display the headerLevel when writing out the header
    document.write(`Header ${++headerFound} (h${headerLevel}): ${header}<br/>`);

    return;
  }
});

Now we can parse atx-style headers. Running the above code (or following this link), we get this:

Header 1 (h2): A header
Header 2 (h1): A header directly on the next line
Header 3 (h3): A header with closing hashes
Header 4 (h6): Another header but this one ends with #

Rendering headers

The last step is to render our headers, we will mix all the code from the first article and add the code for parsing the Setext and atx style headers.

Our for-each loop will look like the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


lines.forEach((line, index) => {
  if (line.match(headers.atx)) {
    let [_, hashes, headerContent] = line.match(headers.atx);
    let headerLevel = hashes.length;

    return renderHeader(headerContent, headerLevel);
  }

  // If we find a Setext header
  if (
    (line.match(headers.setext.first) || line.match(headers.setext.second)) &&
    currentParagraph.length > 0
  ) {
    // By default, let us consider that our header is a level 1 header
    let headerLevel = 1;

    // If we find a level 2 header
    if (line.match(headers.setext.second) && currentParagraph.length > 0) {
      // We set the level header to 2
      headerLevel = 2;
    }

    headerContent = currentParagraph.join('\n');

    currentParagraph = [];
    return renderHeader(headerContent, headerLevel);
  }

  if (!line.match(blankLine)) {
    currentParagraph.push(line);
  }

  if (
    (line.match(blankLine) || index === lines.length - 1) &&
    currentParagraph.length > 0
  ) {
    let paragraph = document.createElement('p');
    paragraph.innerHTML = currentParagraph
      .join('\n')
      .replace('  \n', '<br/>')
      .replace('\n', ' ')
      .trim();

    body.appendChild(paragraph);

    currentParagraph = [];
    return;
  }
});

The renderHeader function contains the following code:

1
2
3
4
5
6
7
8


let renderHeader = function (content, level = 1) {
  header = content.replace('  \n', '<br/>').replace('\n', ' ').trim();

  let element = document.createElement(`h${level}`);
  element.innerHTML = header;

  body.appendChild(element);
};

This code is starting to get long. In the next article of this series we will refactor this code to make it terser. But it works! You could already use this parser / renderer to process a Markdown document that would be comprised of only headers and paragraphs.

Not completely useful yet. To see what it does you can go to this page.

If you have made it all the way here, Congrats! 🎉 You have read at least 160 lines of code. From the next article on, I will try to move the parser to a repository on GitHub to make it easier for you to read the code.

Glasses on a folded scarf — Now, rest your eyes and go read something else that is not displayed on a screen!