Last time we talked about Markdown was almost a month ago. Now is the time to carry on our crazy Markdown parser-building adventure!
We have managed to split a file into multiple paragraphs. But by themselves, they are not enough to make a complete document. The next logical step is to extract headers out of our Markdown file.
Headers
If you are reading this blog, or any other blog for that matter, or most books, or document at work, you are familiar with the concept of headers.
They structure a document into a digestible outline, help our brain remember the content better and give us a sense of the complete document without having to read it.
In Markdown, headers can take two forms or syntaxes:
- Setext
- atx
Because we want our parser to implement the syntax described on John Gruber’s Website in its entirety, we will have to parse the two types of headers.
Setext
Setext headers support only two levels of headers (i.e. Titles and subtitles).
Setext-style headers are “underlined” using equal signs (for first-level headers) and dashes (for second-level headers). […] Any number of underlining
=
’s or-
’s will work.
Here is an example of how Setext headers look:
Understanding the Markdown syntax
=================================
Headers
-------
Here, I matched the length of the title when “underlining” it, for esthetic
reasons. But as it is mentioned in the Markdown syntax, you might use only one
=
or -
to respectively signify a title or subtitle.
So if we rephrase what we have just learned, and considering our previous experience with blank lines and paragraphs, a Setext header is:
- A normal paragraph (i.e. delimited by blank lines and not indented)
- The last line of the block of text contains only
=
or-
and trailing spaces and tabs can safely be ignored.
Parsing the headers
We already know how to parse a paragraph. Let us try to first understand how we can parse the last line of the Setext header by turning what we just described into a regex.
// First-level header “underline”
/^=+[ \t]*$/
// Second-level header “underline”
/^-+[ \t]*$/
To make sure our regexes are properly working, we will use the same method as in the previous article of this series: a simple webpage with some javascript.
Here is the javascript code you will need to test our two regexes. As always, let me provide you with the runnable page containing that code.
|
|
With this code, we now get our headers like so:
Header 1: First header
Header 2: Multi line header
Header 3: Another header
Header 4: Simple second level header
But we miss something paramount to the structure of our document: header levels.
We managed to parse our headers, but now we also need to identify and pass along their respective level. Let us try to do that. With our Setext-style headers, we only have two different levels to parse, so that should not be too hard of a task.
Parsing the header levels
Add the following code to the previous page, at the beginning of the for-each loop, or go there, if you are still lazy.
|
|
Change the code that outputs the headers to read as follows:
|
|
Our modifications give us the following result:
Header 1 (h1): First header
Header 2 (h2): Multi line header
Header 3 (h1): Another header
Header 4 (h2): Simple second level header
Perfect! We now detect the header level as well as the header content. Let us carry on with a different task by parsing atx-style headers.
atx
atx-style headers are the most commonly used headers in Markdown. They allow 6 different levels and are preferred when building complicated outlines. For example this document has 4 levels of headers.
They use the 1 to 6 hash characters at the beginning of their line to express their depth level. They look like so:
# First level header
## Second level header
##### Fifth and so forth
They can also include any number of trailing hash characters, but only the leading hashes determine the level of the header.
# First level header #
##### Fifth and so forth ################
With all of that in mind (I once again just paraphrased the official specification), and what we just experienced with the Setext-style headers, atx-style headers are:
- A single line
- This line must start with 1 to 6 hash characters, and can have any number of trailing hashes.
Let us now take a look at the dreaded regex that would translate that into code:
/^(#{1-6})[ ]*(.*?)[ ]+#*[ \t]*$/;
This one is a little bit more complicated than previous ones. Tools like regex101 explain every bit of black magic that compose that regex. So go and paste it there if you are curious.
Otherwise, let us carry on with testing it. Let us take our previous code, and first update the test content with the following:
|
|
We will also update the regex with the one a few lines above, and use the following logic in our loop:
|
|
Now we can parse atx-style headers. Running the above code (or following this link), we get this:
Header 1 (h2): A header
Header 2 (h1): A header directly on the next line
Header 3 (h3): A header with closing hashes
Header 4 (h6): Another header but this one ends with #
Rendering headers
The last step is to render our headers, we will mix all the code from the first article and add the code for parsing the Setext and atx style headers.
Our for-each loop will look like the following:
|
|
The renderHeader
function contains the following code:
|
|
This code is starting to get long. In the next article of this series we will refactor this code to make it terser. But it works! You could already use this parser / renderer to process a Markdown document that would be comprised of only headers and paragraphs.
Not completely useful yet. To see what it does you can go to this page.
If you have made it all the way here, Congrats! 🎉 You have read at least 160 lines of code. From the next article on, I will try to move the parser to a repository on GitHub to make it easier for you to read the code.