First, note that empty HTML elements are, by definition, not nested.
Update: The solution below now applies the empty element regex recursively to remove "nested-empty-element" structures such as: <p><strong></strong></p>
(subject to the caveats stated below).
Simple version:
This works pretty well (see caveats below) for HTML having no start tag attributes containing <>
funny stuff, in the form of an (untested) VB.NET snippet:
Dim RegexObj As New Regex("<(w+)[^>]*>s*</1s*>")
Do While RegexObj.IsMatch(html)
html = RegexObj.Replace(html, "")
Loop
Enhanced Version
<(w+)(?:s+[w-.:]+(?:s*=s*(?:"[^"]*"|'[^']*'|[w-.:]+))?)*s*/?>s*</1s*>
Here is the uncommented enhanced version in VB.NET (untested):
Dim RegexObj As New Regex("<(w+)(?:s+[w-.:]+(?:s*=s*(?:""[^""]*""|'[^']*'|[w-.:]+))?)*s*/?>s*</1s*>")
Do While RegexObj.IsMatch(html)
html = RegexObj.Replace(html, "")
Loop
This more complex regex correctly matches a valid empty HTML 4.01 element even if it has angle brackets in its attribute values (subject once again, to the caveats below). In other words, this regex correctly handles all start tag attribute values which are quoted (which can have <>
), unquoted (which can't) and empty. Here is a fully commented (and tested) PHP version:
function strip_empty_tags($text) {
// Match empty elements (attribute values may have angle brackets).
$re = '%
# Regex to match an empty HTML 4.01 Transitional element.
< # Opening tag opening "<" delimiter.
(w+) # $1 Tag name.
(?: # Non-capture group for optional attribute(s).
s+ # Attributes must be separated by whitespace.
[w-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
s*=s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string.
| '[^']*' # Single quoted string.
| [w-.:]+ # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value alternatives.
)? # Attribute value is optional.
)* # Allow zero or more attribute=value pairs
s* # Whitespace is allowed before closing delimiter.
> # Opening tag closing ">" delimiter.
s* # Content is zero or more whitespace.
</1s*> # Element closing tag.
%x';
while (preg_match($re, $text)) {
// Recursively remove innermost empty elements.
$text = preg_replace($re, '', $text);
}
}
Caveats: This function does not parse HTML. It simply matches and removes any text pattern sequence corresponding to a valid empty HTML 4.01 element (which, by definition, is not nested). Note that this also erroneously matches and removes the same text pattern which may occur outside normal HTML markup, such as within SCRIPT and STYLE tags and HTML comments and the attributes of other start tags. This regex does not work with short tags. To any bobenc fan about give this answer an automatic down vote, please show me one valid HTML 4.01 empty element that this regex fails to correctly match. This regex follows the W3C spec and really does work.
Update: This regex solution also does not work (and will erroneously remove valid markup) if you do something insanely unlikely (but perfectly valid) like this:
<div att="<p att='">stuff</div><div att="'></p>'">stuff</div>
Summary:
On second thought, just use an HTML parser!