Regex match html content without screwing up the tags
When needing to highlight words in a string containing HTML we found we soon ran into problems when the word we were searching for appeared in the middle of a tag..
Imagine the example:
<a href="geekzilla.aspx">you searched for geekzilla</a>
If I wanted to bold all occurances of geekzilla, I'd usually do this:
String html = @"#<a href=""geekzilla.aspx"">you searched for geekzilla</a>"
Regex.Replace(html,"geekzilla","<b>$1</b>");
.. unfortunately, when dealing with HTML rather than just text, this will screw my tag and produce the following
<a href="<b>geekzilla</b>.aspx">you searched for <b>geekzilla</b></a>
We did a lot of googling and found loads of people discussing ways to ignore the tags. Suggetions rainged from sax parsers to character by character loops (nasty).
Armed with an excellent regex for matching an entire HTML tag we came up with the following solution
Our Solution
Use a custom Regex match evaluator to ignore any tags. This works well and is very fast. There may be a slicker way to do this, I hope someone is inspired enough to figure it out and post a comment
private string replaceString = "";
public string Parse(string content)
{
const string regTagName = @"<.[^>]*>";
Regex reg = new Regex(@"(" + regTagName + ")|(geekzilla)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
// this is what I'd like to replace the match with
replaceString = "<b>$1</b>";
// do the replace
content = reg.Replace(content, new MatchEvaluator(MatchEval));
return content;
}
protected string MatchEval(Match match)
{
if (match.Groups[1].Success)
{
// the tag
return match.ToString();
}
if (match.Groups[2].Success)
{
// the text we're interested in
return Regex.Replace(match.ToString(), "(.+)", replaceString);
}
// everything else
return match.ToString();
}
| Author |
: Paul Hayman |
| Published |
: Monday, 26 November, 2007 |
Paul is the COO of kwiboo ltd consultant and has more than a decade of IT consultancy experience. He has consulted for a number of blue chip companies and has been exposed to the folowing sectors: Utilities, Telecommunications, Insurance, Media, Investment Banking, Leisure, Legal, CRM, Pharmaceuticals, Interactive Gaming, Mobile Communications, Online Services.
Paul is the COO and co-founder of kwiboo (http://www.kwiboo.com/) and is also the creator of GeekZilla.