Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
796 views
in Technique[技术] by (71.8m points)

string - XSLT parse text node as XML?

In the middle of an XML document I'm transforming, there is a CDATA node which I know itself is composed of XML. I would like to have that "recursively parsed" as XML so that I can transform it too. Upon searching, I think my question is very similar to Handling node containing inner escaped XML.

That was a year ago: may I just clarify the following:

  1. It says this cannot be done by some XSLT in one go: rather you need a two-phase approach. I have just bought a shiny new book on XSLT 2.0. Is is still the case that there is no XSLT instruction to "re-parse" a string node as XML?
  2. In my case the XML-string node is just one node in the whole. Therefore in Phase #1 I would only be transforming a fragment of the input XML document; the rest needs passing through unchanged to Phase #2. I see several solutions to passing input to output unchanged, but often it seems they "mostly work", but skip/do not deal with some kind of node inputs. Is there a relaible construct for passing the rest of the input to the output without any changes?
  3. That approach relies on me being able to apply 2 transforms separately. I am limited (existing application) to only being allowed one transform (the XML output is fixed; it is transformed by one XSLT file; the only thing I can do is put whatever I like into that XSLT file, and/or add further XSLT files, but I cannot influence the top-level call to pass the XML through one XSLT file). Is there anything I could put into an XSLT file which could cause the second XSLT transform to be invoked?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

See update at end.

  1. the most important question. It's possible to do; the question is whether you'd have to write an XML parser manually in XSLT, or use an extension function, or whether there's a convenient, portable solution. Update: If you can use Saxon's parse() extension function, that's by far your best bet. Do you have access to that?

  2. is easy to answer: yes, use the identity transform. This will not preserve all lexical details of the input XML, such as order of attributes, or whether <foo/> is written as <foo></foo>. However it will preserve all details that are supposed to matter to XML processors.

    But this won't help you if you can't run 2 stylesheets in a pipeline, right?

  3. Hmm... not robustly. If your output is going to be displayed by a browser, or handled by something else that understands an XML stylesheet processing instruction, you could output one of those, and hope (against the spec's recommendation!) that serialization and parsing would occur in between this stylesheet and the one you associated on output. But this would be very fragile. I say "against the spec's recommendation" because here it says

    When this or any other mechanism yields a sequence of more than one XSLT stylesheet to be applied simultaneously to a XML document, then the effect should be the same as applying a single stylesheet that imports each member of the sequence in order

    which would imply, without serialization and parsing in between. Not recommended.

Update: a new comment says that you don't know in advance which elements will contain CDATA sections. I jumped to the conclusion that this meant you didn't know which elements would contain unparsed data (since XML processors officially don't know or care what elements are in CDATA sections, per se). In that case, all bets are off. As you may know, XML processors are not supposed to know which parts of an XML input doc are in CDATA sections. CDATA is just a different way of escaping markup, an alternative to &lt; etc. Once the data is parsed (which is not properly under the XSLT processor's jurisdiction), you can't tell how it was initially expressed in markup. A left pointy bracket remains a left pointy bracket whether it's expressed as <![CDATA[ < ]]> or &lt;. Just as in C, it doesn't matter whether you specify a character as 'A' or 65 or 0x41; once the program is compiled, your code won't be able to tell the difference.

Therefore, if you don't have another way of determining which data in your input document needs to be parsed, then none of the above methods will help you: you can't know where to apply saxon:parse(), nor manual parsing, nor disable-output-escaping with a following XSLT transformation.

Workarounds:

  • You could guess, e.g. with test="contains(., '&lt;')", which nodes contain unparsed data. (Note this tests for the left pointy bracket, regardless of whether it's expressed as a character entity, or part of a CDATA section, or any other way.) You'd sometimes get false positives, e.g. if a text node contained the string "year < 2001". Or you could attempt to parse every text node (very inefficient), and for those that parse successfully as well-formed XML documents, output the tree instead of the text.

  • Or you could preprocess the XML with a non-XML tool (like LexEv), which therefore can "see" the CDATA markup. But you've said that you can't control anything outside the single XSLT.

  • Or, ideally, you could send the message back up the chain that the XML you're being given is unworkable: they need to flag somehow, other than by using CDATA markup, which sections contain unparsed data. Usually this would be done either by specifying certain element names, or by using attribute flags. Obviously this would depend on who's supplying the XML.

Another update OK, now I understand: so you know which element contains unparsed data (and you know it's marked up with CDATA), but you don't know which other data might be marked up with CDATA.

the idea was to transform [i.e. parse -Lars] the known CDATA node ("fred") into XML nodes while leaving the whole of the rest of the document as original input, so that it could then be piped through the "general" transformation

For this purpose, "leaving the whole of the rest of the document as original input" does not need to mean preserving any CDATA markup. (The general transformation downstream will not know or care what data is CDATA-escaped.) All that is required is that the one unparsed node get parsed, and the rest, not get parsed. The identity transform will do the latter just fine; you can ignore what that page says about CDATA sections on the output... the downstream XSLT will not know or care. (Unless you have additional (non-XML) requirements for the output that you haven't told us about.)

So if you could do a two-stylesheet transform, with serialization and parsing in between (i.e. not in a traditional SAX pipeline, for example), then the identity transform would work: you'd just need an additional template for the known unparsed node, with disable-output-escaping, as in Tomalak's answer here.

But if you can't do a two-step transform... what XSLT processor are you using? There may be other avenues specific to it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...