Load HTML from XML – Part II

 Jan, 28 - 2009   2 comments   Uncategorized

Update: had to take another look at this due to Flash busting my character entities!

In the last installment, I showed you a function that will walk through an XML node’s (multi-part) children and return an HTML string. This approach is unfortunately flawed — whitespace collapsing is a bit over-eager resulting in XML such as the following:
This is a <a href="page.html">link</a>

converting to:
This is alink
(Note the missing space.)

Fortunately, I have not only a solution, but since it uses regex, it ought to be a good bit more efficient:

package {

	import StringUtils;
	import CharacterEntity;

	public class XmlUtil {

		static public function getHTMLContent (xml:*):String {
			//trace (typeof(xml) + "   " + xml.toXMLString())

			if (typeof (xml) == 'string') xml = new XML(xml)

			var html = ""
			var prettyPrint = XML.prettyPrinting
			XML.prettyPrinting = false
			var ignoreWhite = XML.ignoreWhitespace
			XML.ignoreWhitespace = false

			var children = xml.children()
			var len = children.length()
			if (len)
				//trace ('Multiple Children')
				for ( var i=0; i<len; i++ )
					var decoded = CharacterEntity.decodeXHTML(children[i].toXMLString() , true)
					html += decoded
				html = StringUtils.removeExtraWhitespace( html )

				//trace ('Simple Content')
				var str = StringUtils.removeExtraWhitespace( CharacterEntity.decodeXHTML(xml.toXMLString(), true) )
				html += str

			XML.prettyPrinting = prettyPrint
			XML.ignoreWhitespace = ignoreWhite

			//logger.info ("HTML " + escape(html))

			return html

You’ll need two libraries: StringUtils from the worship-worthy studio of Grant Skinner CharacterEntity, originally written for AS2 by Jim Cheng and kindly converted to AS3 by Thirdparty Labs.

The code is a lot simpler now, but for completeness, I’ll give you a quick run-down. If you pass in a String (accessing an attribute or text node could actually cause this), we convert it to XML first. First, we turn ignoreWhitespace off since it’s the source of the issue above. Walk through the children (if they exist) decoding the entities and remove any additional whitespace. The “true” parameter on the decodeXHTML method is explained in this post.

Related articles

 Comments 2 comments

  • jon jon says:

    I was using this today, and noticed some extra whitespace. Be sure to 

    XML.prettyPrinting = false
    XML.ignoreWhitespace = false
    before you go fetch any XML in the first place or it’ll be too late to get rid of it later. In my case, I went ahead and put these lines in Main.as
  • […] showed you the getHTMLContent function in a previous article. Wrapping the argument in <doc> tags ensures we get all the content regardless of whether the […]

  • Leave a Reply

    Your email address will not be published. Fields with * are mandatory.