Regex match html content without screwing up the tags

When needing to highlight words in a string containing HTML we found we soon ran into problems when the word we were searching for appeared in the middle of a tag..

Imagine the example:

<a href="geekzilla.aspx">you searched for geekzilla</a>

If I wanted to bold all occurances of geekzilla, I'd usually do this:

String html = @"#<a href=""geekzilla.aspx"">you searched for geekzilla</a>"
Regex.Replace(html,"geekzilla","<b>$1</b>");

.. unfortunately, when dealing with HTML rather than just text, this will screw my tag and produce the following

<a href="<b>geekzilla</b>.aspx">you searched for <b>geekzilla</b></a>

We did a lot of googling and found loads of people discussing ways to ignore the tags. Suggetions rainged from sax parsers to character by character loops (nasty).

Armed with an excellent regex for matching an entire HTML tag we came up with the following solution

Our Solution

Use a custom Regex match evaluator to ignore any tags. This works well and is very fast. There may be a slicker way to do this, I hope someone is inspired enough to figure it out and post a comment

private string replaceString = "";
public string Parse(string content)
{
    const string regTagName = @"<.[^>]*>";
    
    Regex reg = new Regex(@"(" + regTagName + ")|(geekzilla)",
                     RegexOptions.IgnoreCase | RegexOptions.Multiline);

    // this is what I'd like to replace the match with
    replaceString = "<b>$1</b>";

    // do the replace
    content = reg.Replace(content, new MatchEvaluator(MatchEval));

    return content;
}

protected string MatchEval(Match match)
{
    if (match.Groups[1].Success) 
    {
        // the tag
        return match.ToString();
    }
    if (match.Groups[2].Success) 
    {
        // the text we're interested in
        return Regex.Replace(match.ToString(), "(.+)", replaceString);
    }
    // everything else
    return match.ToString();
}
Author Paul Hayman

Paul is the COO of kwiboo ltd and has more than 20 years IT consultancy experience. He has consulted for a number of blue chip companies and has been exposed to the folowing sectors: Utilities, Telecommunications, Insurance, Media, Investment Banking, Leisure, Legal, CRM, Pharmaceuticals, Interactive Gaming, Mobile Communications, Online Services.

Paul is the COO and co-founder of kwiboo (http://www.kwiboo.com/) and is also the creator of GeekZilla.

Comments

tomblos said:

Excellent solution! Just what i was looking for. Thnx!

07/May/2008 14:03 PM

ajay said:

how to do this in javascript. i don't understand with the match evaluator.

i want to change from

<p>bandung padalarang</p>

to

<p>bandung <b>p</b>adalarang</p>

without affecting tag <p>...

thanks..

29/May/2008 17:53 PM

Dominic Turner said:

Ajay - custom match handlers are a feature of the RegEx library of .NET - which I learnt about today (thanks Paul for this excellent piece of work).

This is a VB version of the above:

Private MatchReplacement As String = ""

Private Function Highlight(ByVal inStr As String, ByVal arrTerms() As String) As String

Dim ProcessedText As String = inStr

For i As Integer = 0 To arrTerms.Length - 1

    'RegEx.Escape - Matches the characters literally, suppressing the meaning of special characters.

    Dim Term As String = Regex.Escape(arrTerms(i))

    Dim TagExpression As String = "<.[^>]*>"

    Select Case (i Mod 4)

        Case 0

            MatchReplacement = "<span style='background:" + HighlightColour_YELLOW + ";'>$&</span>"

        Case 1

            MatchReplacement = "<span style='background:" + HighlightColour_GREEN + ";'>$&</span>"

        Case 2

            MatchReplacement = "<span style='background:" + HighlightColour_PINK + ";'>$&</span>"

        Case 3

            MatchReplacement = "<span style='background:" + HighlightColour_CYAN + ";'>$&</span>"

    End Select

    'Highlight Search Term

    Dim reg As New Regex("(" + TagExpression + ")|(" + Term + ")", RegexOptions.IgnoreCase Or RegexOptions.Multiline)

    'Cunningly replace only search terms where they are not within an HTML tag (Match HTML tag OR SearchTerm - but only do the replace if it is NOT a tag match)

    ProcessedText = reg.Replace(ProcessedText, New MatchEvaluator(AddressOf MatchEval))

Next

Return ProcessedText

End Function

Protected Function MatchEval(ByVal match As Match) As String

If match.Groups(1).Success Then

    'A tag

    Return match.ToString()

End If

If match.Groups(2).Success Then

    'The search term

    Return Regex.Replace(match.ToString(), "(.+)", MatchReplacement)

End If

'Anything else

Return match.ToString()

End Function

14/Aug/2008 16:25 PM

Jeff said:

I jacked with trying to get this functionality for half the day. Your elegant solution was just what I needed. Thank you for posting!

09/Feb/2009 01:05 AM

Add Comment

Name
Comment
 

Your comment has been received and will be shown once it passes moderation.