CT Blog - Reinventing the Markdown parser

From the moment I laid my eyes on Markdown, a plain text formatting syntax that can easily be compiled to plain old HTML, I could not help myself but fall in love with the simplicity and the interoperability offered by the plain text syntax.

I have been avoiding my whole life softwares like Microsoft Word that, even though very powerful and useful for many, always felt stuffy and bloated to me.

Since I have found out about Markdown, this syntax is my goto method for any kind of writing task, may it be technical or user documentation, meeting minutes, training notes or even this blog post.

Something about the terse and simple approach of the syntax just hit the right spot, allows me to more productive, and focused on the content I am producing. It also allows me to share my writings with anyone, in whatever format necessary, even the dreaded Microsoft Word .docx format, with the help of a tool called Pandoc.

Obviously, the love I have for Markdown is what drives me to create a parser of my own. Now, listen, there are plenty of excellent, performant, well optimized markdown parser out there. But the initial parser and renderer was written by John Gruber, using Perl. This makes me think, perhaps, that even though it might not be trivial, it will not be too complicated of a task.

But enough with this preface. I hope that you love syntax references, for-loops and regular expressions (a.k.a. regex), because where we are going, there will be quite a bit of these.

I do not know just yet what programming language we will use, so bear with me as I make things up as I go. It will be all the more fun.

Setting up a clear goal

Before we start writing any code, though, we need first to give ourselves a clear goal. This goal will define what kind of syntax we will support with our parser and what trade off we will accept.

Syntax support

I will work with the official syntax (and not with GFM or other Markdown superset just yet) that can be found on Daring Fireball, John Gruber’s website, creator of the Markdown syntax.

We will support all of the features outlined in this syntax. Which means:

Inline HTML
Auto-escaping of special characters
Paragraphs
Headers
Block quotes
Lists
Code blocks
Horizontal rules
Links
Emphasis
Code
Images
Backslash escapes
Automatic links

Supporting all of these mean understanding correctly what each aspect of the syntax does and does not.

We will have to go through the syntax, and start testing the logic used to parse each of these features.