A bug in the Sundown and Redcarpet markdown parsers may lead to XSS

Published: 2015-04-07

Preface

In early February 2015, I reported an XSS vulnerability in HackerOne itself. After some investigation, we determined that the vulnerability was due to a bug in version 3.2.2 of the Redcarpet markdown parser … which was due to a bug in the autolink feature in version 1.16.0 of the Sundown markdown parser that Redcarpet was based off of.

In short, text like this:

_danlec_@danlec.com

… isn't handled correctly, and would be rendered as something like

<em>dan<a href="mailto:_danlec_@danlec.com">_danlec_@danlec.com</a>

… and in certain conditions that bug could allow insertion of malicious tags and event handlers.

The bug was present in several other libraries that were based off of Sundown, so while HackerOne was able to mitigate the issue quickly, we were unable to disclose the issue until new library versions were available.

Since a fixed version of Redcarpet is now available, the report was publicly disclosed today.

Curious if a markdown library is potentially vulnerable?

Not all libraries are vulnerable (obviously); it's possible you have nothing to worry about. You can check by seeing how

_nothing_@danlec.com

is rendered.

If you get something like not_nothing_@danlec.com (or no_nothing_@danlec.com) then your parser has the broken behavior that can cause the bug, and you should look into updating it.

(If you're only using your library to format markdown you've written yourself, then there isn't as much cause for concern … but many sites use affected libraries to format markdown from untrusted sources, and they're the ones that should be worried.)

Show me the bug report!

Sure, it's right here

Cast of Characters

HackerOne is a "Security Response and Bug Bounty Platform" that was "[built] from the ground up with security as [a] top priority". The team that built it includes several security engineers and penetration testers, so I'm particularly excited when I'm able to find a vulnerability in their site. They use Redcarpet to render the markdown in the bugs submitted by security researchers.

Redcarpet is a popular Ruby library for Markdown processing. Internally it uses Sundown.

Sundown is a (deprecated) C markdown parser. It hasn't been touched in several years, but it has been used as the basis for several other markdown libraries

I'm a developer at Trello and I run Trello's bug bounty program on HackerOne. Occasionally I'll try to find security issues in the applications that I use.

Background

About a week before submitting this bug, I'd submitted the first XSS vulnerability found in HackerOne itself.

Finding the bug

Based on my earlier report, I knew that I was able to find a security issue with how links were processed, so I kept playing around with other things that were rendered as links.

One thing that is commonly linkified in markdown is an email address, so I was playing around with that. I quickly realized that HackerOne didn't actually link email addresses, but it did appear to recognize them.

For example, while <foo> would render as <foo>, <daniel@example.com> would render as daniel@example.com.

While trying out different things, I noticed that something like

_danlec_@danlec.com

was being rendered as

da_danlec_@danlec.com

That was … a completely unexpected output. Somehow that da part was being repeated, and the  tag wasn't being closed properly.

Investigating the bug

I tried several more inputs to see if I could figure out what was going on

_danlec_@danlec.com -> <u>da_danlec_@danlec.com
_danle_@danlec.com -> <u>da_danle_@danlec.com
_danl_@danlec.com -> <u>da_danl_@danlec.com
_dan_@danlec.com -> <u>da_dan_@danlec.com
_da_@danlec.com -> <u>da_da_@danlec.com

Okay, weird, but so far it seems like the output is the same as the input, but with a da at the front. (Actually, there were also surrounding  tags, but I've left those out because they aren't relevant.)

_d_@danlec.com -> <u>d<_d_@danlec.com

Wha!?

That of course seemed completely wrong, plus now there was the start of a _d_@danlec.com tag on the output.

I wish I could say that at this point I knew exactly what was wrong, but in reality I just tried a bunch of different similar inputs until I observed that

_http://danlec_@.1

rendered as

<p><u><a href="http://danlec">http://danlec<danlec_@.1</p>

and

_http://danlec_@.1 foo=bar

rendered as

<p><u><a href="http://danlec">http://danlec<danlec_@.1 foo=bar</p>

So at this point I was able to inject an arbitrary tag with arbitrary attributes. I didn't know how it worked, but I knew I could do something bad.

Weaponizing the bug

So maybe you've been thinking "there's no such thing as a <danlec_@.1>" element! And you're right … however browsers will still go ahead and render unknown elements as inline elements. They'll also use any styling or event handlers they're given.

Since I could set the style attribute on this element, I could make it bigger, and I could use the onmouseover attribute to execute script.

The input I ended up with was

_http://danlec_@.1
style=background-image:url(... data uri ...);background-repeat:no-repeat;display:block;width:100%;height:100px;
onclick=alert(unescape(/Oh%20No!/.source));return(false);//

… which rendered as a big red box that said "Oh No!" and which could do an alert if you clicked on it … and happened to be using a browser that didn't have support for HackerOne's strict content security policy. (Unfortunately I didn't find a CSP bypass until later…)

I submitted the report, and HackerOne quickly responded, investigated and reported back that they'd "disabled [their] automatic linking functionality to address the issue".

I figured that a bounty and public disclosure were imminent … but it turned out I was overly optimistic, and several weeks went by without the issue being closed.

Digging deeper

After several weeks, I revisited the issue. Reading through HackerOne's response again, I realized that the issue was in the library that they were using to parse the markdown. Because of that, it wasn't as simple as fixing a bug in their own code.

Usually I treat websites I'm investigating as black boxes, and don't spend much time thinking about what's actually going on behind the scenes, but at this point I was pretty curious. I quickly determined (by googling "hackerone markdown") that HackerOne used Redcarpet to render markdown, so I downloaded the latest version of the library to play with.

I verified that Redcarpet did in fact have strange output for an input like _danlec_@danlec.com, and then I looked through the source to see if I could figure out why.

After some digging, I realized the issue came from this function:


size_t
sd_autolink__email(
    size_t *rewind_p,
    struct buf *link,
    uint8_t *data,
    size_t max_rewind,
    size_t size,
    unsigned int flags)
{
    size_t link_end, rewind;
    int nb = 0, np = 0;

    for (rewind = 0; rewind < max_rewind; ++rewind) {
        uint8_t c = data[-rewind - 1];

        if (isalnum(c))
            continue;

        if (strchr(".+-_", c) != NULL)
            continue;

        break;
    }

    if (rewind == 0)
        return 0;

    for (link_end = 0; link_end < size; ++link_end) {
        uint8_t c = data[link_end];

        if (isalnum(c))
            continue;

        if (c == '@')
            nb++;
        else if (c == '.' && link_end < size - 1)
            np++;
        else if (c != '-' && c != '_')
            break;
    }

    if (link_end < 2 || nb != 1 || np == 0 ||
        !isalpha(data[link_end - 1]))
        return 0;

    link_end = autolink_delim(data, link_end, max_rewind, size);

    if (link_end == 0)
        return 0;

    bufput(link, data - rewind, link_end + rewind);
    *rewind_p = rewind;

    return link_end;
}

and how it was used by this function


static size_t
char_autolink_www(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t offset, size_t size)
{
    struct buf *link, *link_url, *link_text;
    size_t link_len, rewind;

    if (!rndr->cb.link || rndr->in_link_body)
        return 0;

    link = rndr_newbuf(rndr, BUFFER_SPAN);

    if ((link_len = sd_autolink__www(&rewind, link, data, offset, size, 0)) > 0) {
        link_url = rndr_newbuf(rndr, BUFFER_SPAN);
        BUFPUTSL(link_url, "http://");
        bufput(link_url, link->data, link->size);

        ob->size -= rewind;
        if (rndr->cb.normal_text) {
            link_text = rndr_newbuf(rndr, BUFFER_SPAN);
            rndr->cb.normal_text(link_text, link, rndr->opaque);
            rndr->cb.link(ob, link_url, NULL, link_text, rndr->opaque);
            rndr_popbuf(rndr, BUFFER_SPAN);
        } else {
            rndr->cb.link(ob, link_url, NULL, link, rndr->opaque);
        }
        rndr_popbuf(rndr, BUFFER_SPAN);
    }

    rndr_popbuf(rndr, BUFFER_SPAN);
    return link_len;
}

… when called by this function


/* parse_inline • parses inline markdown elements */
static void
parse_inline(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t size)
{
    size_t i = 0, end = 0;
    uint8_t action = 0;
    struct buf work = { 0, 0, 0, 0 };

    if (rndr->work_bufs[BUFFER_SPAN].size +
        rndr->work_bufs[BUFFER_BLOCK].size > rndr->max_nesting)
        return;

    while (i < size) {
        /* copying inactive chars into the output */
        while (end < size && (action = rndr->active_char[data[end]]) == 0) {
            end++;
        }

        if (rndr->cb.normal_text) {
            work.data = data + i;
            work.size = end - i;
            rndr->cb.normal_text(ob, &work, rndr->opaque);
        }
        else
            bufput(ob, data + i, end - i);

        if (end >= size) break;
        i = end;

        end = markdown_char_ptrs[(int)action](ob, rndr, data + i, i, size - i);
        if (!end) /* no action from the callback */
            end = i + 1;
        else {
            i += end;
            end = i;
        }
    }
}

I've marked up some bits that were interesting, but here's basically what it tries to do:

Process any inline markdown (bold, italic, etc) or normal text and output the corresponding HTML
If an @ symbol is encountered, check to see if it's the @ in the middle of an email address. If it is, then remove the part of the email address that we had output before we got to the @ (i.e. "rewind"), and output an <a> tag with a mailto:
If there's still more input, goto step 1

So you can imagine how that would play out for an input like this:

This **is** _a_ test@danlec.com

First, we process This; it's plain text so add This to the output.
Now we handle the **is**, which is bold, so we add is to the output. So far, we've output This is.
Now the _a_ is underlined, so we add a to the output and have This is a
Now we handle the test. It looks like plain text, so we add it to the output and end up with This is a test
When we encounter the @ symbol, we realize that we're in the middle of an email address! Because the beginning of the email address is 4 characters long, we chop that off the output (leaving This is a) and then output a link to the email address, ending up with This is a <a href="mailto:test@danlec.com">test@danlec.com</a>

But what happens when we're processing one of the inputs I mentioned earlier, something like _http://danlec_@danlec.com foo=bar?

First, we process the _http://danlec_. That looks like an underlined link, and we end up outputting <a href="http://danlec">http://danlec</a>
Now we see the @. We decide that that's the middle of the email address danlec_@danlec.com. The first part of the email address, danlec_ is 7 characters long, so we chop that many characters off the output (forgetting that some of the characters are parts of HTML tags), leaving <a href="http://danlec">http://danlec< (uh oh!) and then add the link to the email address to get <a href="http://danlec">http://danlec<<a href="mailto:danlec_@danlec.com">danlec_@danlec.com</a>
The foo=bar looks like normal text, so we output it directly and get <a href="http://danlec">http://danlec<<a href="mailto:danlec_@danlec.com">danlec_@danlec.com</a> foo=bar

In HackerOne's case however, they chose not to render email addresses as links, so the output is just

<a href="http://danlec">http://danlec<danlec_@danlec.com foo=bar

So, the root cause for the issue was that the rewind performed when processing an inline element was allowed to step back into previously output elements, and an assumption was being made that the rewind would only contain plain text.

Once I understood what the bug was, I was able to write a patch that addressed the issue, and included that in the report. (I didn't want to make a public pull request, since that would effectively disclose the issue…) The fix was to limit the rewinding so it would never move into text that had already been output as an element.

After that, it still took a while for the change to make it into Redcarpet, but it eventually did (and they even made sure I got credit for it!).

HackerOne's response

While HackerOne was quick to mitigate the issue, it took a couple months for the issue to be completely resolved; mostly due to external factors. Midway through, HackerOne took the unusual step of awarding a bounty for the bug without immediately disclosing it. (We agreed that we wouldn't disclose the issue until Redcarpet had an updated version available.)

HackerOne also did the additional work of reaching out to the Redcarpet maintainers.

About the author:

I'm Daniel LeCheminant, a developer at Trello Inc.

You can follow me on Twitter or e-mail me.

Most recent post:

The most popular things I've written: