Monday, April 28, 2008

HTML Comments and JavaScript

So lets say that you want to get at the comments within an HTML document. Why, you ask, would that be useful? Well for several reasons. Perhaps you have settings, variables, or other things within comments that you would want to gather for processing. Perhaps you have trained your website maintenance staff to put identification information (like the day the page was last updated) within comments.

But if you're a Dreamweaver user comments can contain things like where the start and end of template regions are. That's no small thing, so I want to say it again. Web pages built based on Dreamweaver templates will have little HTML comments in them that identify where the beginning and end of an editable region is -- like this:
<!-- InstanceBeginEditable name="PageContent" -->
<!-- InstanceEndEditable -->

These are just comments, just like any other comment. They don't impact that layout or rendering of the page in the web browser. They're just there to provide a little meta information.

Now lets say that you want to grab the headline of a page -- but it could be styled any number of ways. If the page is based on Dreamweaver templates one approach could be to grab the relevant editable region.

If you read the DOM documentation you'll discover that there are node types called comments. So you might think that you could follow the DOM tree and look for nodes with the right type.
node.type== Node.COMMENT_NODE

The problem is that support for this is spotty even in the best browsers. For the most part, it looks like comments are simply removed from the page before it is processed. You can see that there are no comments within the document's html code -- check your favorite browser.
document.body.outerHTML

Before we can even being to figure out how to read the comments, we have to find them. As it turns out you can do it by using document.documentElement. The complete pre-processed html for the page, including comments, can be found by looking at
document.documentElement.outerHTML

Now all you have to do is match the comment using a regular expression or a couple of indexOf statements. Whichever method you prefer or understand better. *
exp="<"+"!"+"--"+"([\\s\\S]+?)"+"--"+">"
rexComment=new RegExp(exp)

Moreover, the really big win is that you can identify data based on its placement within a template. That's really important. Templates are built to control the display of content within a website to make the pages consistent -- sure. But they allow flexibility so editors can change only the parts of the page that are relevant to theri work -- headlines and bylines and stories. So naturally, if we're interested in the parts of the page that are about the content, then we'll be interested in the stuff that's in the template regions.

This really expanded my way of thinking about templates. They're not just about pretty -- they're about data structure.

* A Note on "exp="
You'll notice that I broke the expression apart into pieces -- that's to avoid any browsers biting on the comment tag if it were all together as one piece. This
exp="<!--([\\s\\S]+)?-->"
is the same as this
exp="<"+"!"+"--"+"([\\s\\S]+?)"+"--"+">"

3 comments:

acai Berry said...

Dreamweaver contain moer things where start and end of template regions.It has an important role to build a web page.
Acai Berry

Sam said...

Can you add information on how to execute this regex? basically I just want to retrieve all comments in a page.

Im still puzzled because I can't make outerHTML work.

andrei said...

Hey,
I wasn't able to access the document.documentElement.outerHTML, using firebug on firefox 3.6.
I was however able to access document.documentElement.innerHTML, which seemed to include the html comments.