CT Blog - Understanding the Markdown syntax

In this post, the goal will be to read through the syntax definition, and start thinking about how we will translate it into code.

Let us digest this content together.

Note: This is the first post with code and code blocks. I kept it very simple for now, but I plan on adding syntax highlighting and some other tweaks to the code blocks.

Edit: April 15th, 2020 – Well, I just added basic syntax highlight and line numbers as a side effect of moving to Hugo, apparently. Yay!

Elements types

From the get-go, you will notice that Markdown supports two major types of elements:

Block elements, elements that can, but do not have to, span multiple lines and that own their lines of the document (meaning that you cannot have two block elements sharing the same line)
Span elements, elements that you can add anywhere inside a block element, and that will render inline with the rest of the content.

Block elements

Paragraphs

As per John Gruber’s specification:

A paragraph is simply one or more consecutive lines of text, separated by one or more blank lines. (A blank line is any line that looks like a blank line — a line containing nothing but spaces or tabs is considered blank.) Normal paragraphs should not be indented with spaces or tabs.

So right there, we have the definition of two of what will become symbols, or nodes, once our parser has identified their syntax:

Blank lines, an empty line, or a line containing only spaces or tabs.
Paragraphs, unindented consecutive text.

With that information, we can already make two decisions:

We will treat blank lines as separators. They will split the document content into chunks that the parser will parse separately.
By default, the parser will assume that a block is a paragraph, unless detected otherwise.

If we had to represent a blank line as a regex, we would probably use something like this:

/^[\t ]*$/;

To make sure that our regex is correct, we will use the basic following HTML code, put that on a page, and open it with our browser. If you are lazy (and as a good programmer, you should be), you can open this page.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


<!DOCTYPE html>
<html>
  <head>
    <title>Regex Checker</title>
  </head>
  <body>
    <script>
      /*
       * This is the test content we are going to use to make sure that our
       * regex detects the right lines.
       * It should detect 4 blank lines.
       */
      const content = `
        \t    \t    \t     
        Some content
                    
        \t
        Some other content`;

      // This is the regex we want to test
      const regex = /^[\t ]*$/;

      // We split the content line by line
      let lines = content.split('\n');
      document.write(`Found ${lines.length} lines!<br/><br/>`);

      // And then we iterate upon those lines, and check for blank lines
      lines.forEach((line, index) => {
        document.write(`${index}: `);
        if (line.match(regex)) {
          document.write('This line is blank!<br/>');
        } else {
          document.write('This line has some content!<br/>');
        }
      });
    </script>
  </body>
</html>

If everything goes well with our regex, we should get the following on our page:

Found 6 lines!

0: This line is blank!
1: This line is blank!
2: This line has some content!
3: This line is blank!
4: This line is blank!
5: This line has some content!

Yay! We can now detect blank lines, the boundaries of each paragraph. Now onto our second task: extracting paragraphs.

We will update some of the javascript code like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


const blankLine = /^[\t ]*$/;

let lines = content.split('\n');
let paragraphFound = 0;

lines.forEach((line, index) => {
  if (!line.match(blankLine)) {
    paragraphFound++;
    document.write(`Paragraph ${paragraphFound}: ${line}`);
  }
});

Let us run the code again (you can follow this link to do that), and make sure that everything works like we expect. Our code should output the following lines:

Paragraph 1: Some content
Paragraph 2: Some other content

Perfect! Well, before we get ahead of ourselves, let us update our test content and make sure that the paragraphs detection works correctly. We will add a few lines to update the content variable with the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


const content = `
  \t    \t    \t     
  Some content
              
  \t
  Some other content

  A multiline paragraph
  with a bit of content

  A paragraph with a  
  line break`;

Once again, if everything goes well, we will extract four paragraphs out of our test content. But you might already see a potential problem with our code. And you would be right. Let us run the code. This is what we get:

Paragraph 1: Some content
Paragraph 2: Some other content
Paragraph 3: A multiline paragraph
Paragraph 4: with a bit of content
Paragraph 5: A paragraph with a
Paragraph 6: line break

We said earlier that, according to the syntax, blank lines separate the paragraphs. But right now, we use the newline character (\n) as a separator. Let us change that.

We will first need to store the lines of the current paragraph into a variable (a buffer), and process these when we encounter a new blank line. We will also implement the “hard-wrapping” feature:

When you do want to insert a <br/> break tag using Markdown, you end a line with two or more spaces, then type return.

To do that, we can change the javascript code with the following (or follow this link):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


let paragraphFound = 0;

// Initialize our buffer
let currentParagraph = [];

lines.forEach((line, index) => {
  // If the current line is not a blank line, we push it to our buffer
  if (!line.match(blankLine)) {
    currentParagraph.push(line);
  }

  // If the current line is a blank line or the last line of the document
  if (
    (line.match(blankLine) || index === lines.length - 1) &&
    currentParagraph.length > 0
  ) {
    paragraphFound++;
    /*
     * We join the different lines of the paragraph with the newline character
     * then we replace the double space newline (hard-wrapped) with a line break
     * and finally we replace all newline with a space
     *
     * This allow us to render the newlines according to the specification
     */
    document.write(
      `Paragraph ${paragraphFound}: ${currentParagraph
        .join('\n')
        .replace('  \n', '<br/>')
        .replace('\n', ' ')}<br/>`
    );
    currentParagraph = [];
    return;
  }
});

Now we get a promising result:

Paragraph 1: Some content
Paragraph 2: Some other content
Paragraph 3: A multiline paragraph with a bit of content
Paragraph 4: A paragraph with a
line break

This has been a long post, but here is the reward: we now can render paragraphs! You just made your first, very incomplete, Markdown parser and renderer.

To render your paragraphs, add the following code to your page (lazy-link):

1
2
3
4


let currentParagraph = [];

// Create a reference to the document body
let body = document.querySelector('body');

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


// Create a paragraph HTML element
let paragraph = document.createElement('p');
paragraph.innerHTML = currentParagraph
  .join('\n')
  .replace('  \n', '<br/>')
  .replace('\n', ' ')
  // Add a trim to cleanup the rendered content
  .trim();

// Append your paragraph to the document body
body.appendChild(paragraph);

currentParagraph = [];
return;

And we are done! You can use the last “lazy link”, download the code (with the “Save as” menu of your browser) and you will get the code we have written in this post in its entirety. Play with it, tweak it or break it.

If you do, be sure to ping me on Twitter and show me your work.

In the next article of this series, we will tackle Headers.