After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, ', " and &
The code below fails resulting in an NSXMLParserUndeclaredEntityError
.
// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys:
[NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
[NSString stringWithFormat:@"%C", 0x00E0], @"agrave",
...
,nil];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];
// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}
Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters
and the è and à characters are dropped.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
<!ENTITY agrave "à">
<!ENTITY egrave "è">
]>
In another experiment, I created a completely valid xml document with an internal DTD
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
<!ELEMENT author (#PCDATA)>
<!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>
I implemented the parser:foundInternalEntityDeclarationWithName:value:;
delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters
is only called for the pre-defined entities.
2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model:
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before:
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document
I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler
that is used by NSXMLParser
allows for a getEntity
callback to be defined. After calling getEntity
, the expansion of the entity is passed to the characters
callback.
NSXMLParser
is missing functionality here. What should happen is that the NSXMLParser
or its delegate
store the entity definitions and provide them to the xmlSAXHandler
getEntity
callback. This is clearly not happening. I will file a bug report.
In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml
parser on your own is worthwhile.
This has been fun.