Load HTML from XML – Part II
Update: had to take another look at this due to Flash busting my character entities!
In the last installment, I showed you a function that will walk through an XML node’s (multi-part) children and return an HTML string. This approach is unfortunately flawed — whitespace collapsing is a bit over-eager resulting in XML such as the following:
This is a <a href="page.html">link</a>
converting to:
This is alink
(Note the missing space.)
Fortunately, I have not only a solution, but since it uses regex, it ought to be a good bit more efficient:
package {
import StringUtils;
import CharacterEntity;
public class XmlUtil {
static public function getHTMLContent (xml:*):String {
//trace (typeof(xml) + " " + xml.toXMLString())
if (typeof (xml) == 'string') xml = new XML(xml)
var html = ""
var prettyPrint = XML.prettyPrinting
XML.prettyPrinting = false
var ignoreWhite = XML.ignoreWhitespace
XML.ignoreWhitespace = false
var children = xml.children()
var len = children.length()
if (len)
{
//trace ('Multiple Children')
for ( var i=0; i<len; i++ )
{
var decoded = CharacterEntity.decodeXHTML(children[i].toXMLString() , true)
html += decoded
}
html = StringUtils.removeExtraWhitespace( html )
}
else
{
//trace ('Simple Content')
var str = StringUtils.removeExtraWhitespace( CharacterEntity.decodeXHTML(xml.toXMLString(), true) )
html += str
}
XML.prettyPrinting = prettyPrint
XML.ignoreWhitespace = ignoreWhite
//logger.info ("HTML " + escape(html))
return html
}
}
}
You’ll need two libraries: StringUtils from the worship-worthy studio of Grant Skinner CharacterEntity, originally written for AS2 by Jim Cheng and kindly converted to AS3 by Thirdparty Labs.
The code is a lot simpler now, but for completeness, I’ll give you a quick run-down. If you pass in a String (accessing an attribute or text node could actually cause this), we convert it to XML first. First, we turn ignoreWhitespace off since it’s the source of the issue above. Walk through the children (if they exist) decoding the entities and remove any additional whitespace. The “true” parameter on the decodeXHTML method is explained in this post.
I was using this today, and noticed some extra whitespace. Be sure to
XML.ignoreWhitespace = false
[…] showed you the getHTMLContent function in a previous article. Wrapping the argument in <doc> tags ensures we get all the content regardless of whether the […]