Whether you’re scraping content from a website, or simply dealing with the “tag soup” generated from your own site’s WYSIWYG, you probably know that reliably parsing HTML is a pain at best, extremely difficult at worst. Not only do you have to contend with unpredictable content, but there’s no guarantee that the content you try to parse will be well-formatted.

In my first pass at CFGloss, I made a pretty big error in parsing the documentation from ColdFusion, namely, I tried to do it with regular expressions. Ultimately, it worked out *okay*, but it was inefficient, unpredictable, and resulted in miles and miles of regular expression soup. Now that I’m overhauling CFGloss, I want to revisit how I parsed the scraped content and try to make it better.

Enter jsoup. This is an awesome little Java library that takes the headache out of parsing HTML. Besides turning unpredictable, and potentially mal-formed HTML into something usable, jsoup is additionally packed with some awesome features, most notably leveraging CSS/jQuery-esque selectors for manipulating parsed HTML content.

An Example

To get a feel of the tip-of-the-iceberg of what jsoup can do, let’s take an example from the Railo documentation. Take a look at the source of this page: http://railodocs.org/index.cfm/function/each/version/current. Our More >