roasted-carrot-soupWhether you’re scraping content from a website, or simply dealing with the “tag soup” generated from your own site’s WYSIWYG, you probably know that reliably parsing HTML is a pain at best, extremely difficult at worst. Not only do you have to contend with unpredictable content, but there’s no guarantee that the content you try to parse will be well-formatted.

In my first pass at CFGloss, I made a pretty big error in parsing the documentation from ColdFusion, namely, I tried to do it with regular expressions. Ultimately, it worked out *okay*, but it was inefficient, unpredictable, and resulted in miles and miles of regular expression soup. Now that I’m overhauling CFGloss, I want to revisit how I parsed the scraped content and try to make it better.

Enter jsoup. This is an awesome little Java library that takes the headache out of parsing HTML. Besides turning unpredictable, and potentially mal-formed HTML into something usable, jsoup is additionally packed with some awesome features, most notably leveraging CSS/jQuery-esque selectors for manipulating parsed HTML content.

An Example

To get a feel of the tip-of-the-iceberg of what jsoup can do, let’s take an example from the Railo documentation. Take a look at the source of this page: http://railodocs.org/index.cfm/function/each/version/current. Our goal, ultimately, will be to retrieve the “main” content of this page.

Include the Library

First, get the library here: http://jsoup.org/download. You can either add this to your class path, or if you’re using JavaLoader, include it that way. I’m using ColdBox for my site (you’re not?), so first I’ll add the following to my ColdBox.cfc config:

settings = {
    javaloader_libpath = "path-to-custom-java-files/includes/java/"
};

Then, using ColdBox’s magic injection capabilities, I can simply inject it wherever I need it like so:

component {
    property name="jSoup" inject="javaLoader:org.jsoup.Jsoup";
    ...
}

That’s way too easy 🙂

Get the Content

Once that we’ve included the jsoup library, we can retrieve the HTML content and start manipulating it. Let’s get the content:

var httpService = new http();
    httpService.setURL( arguments.url );
var html = httpService.send().getPrefix().fileContent;

Now that we have the raw HTML content, the fun can begin. The very first thing we need to do is to parse the content. This will give us a jsoup Document, which is simply a nice object full of all sorts of awesome methods that we can use to further manipulate our content:

var jsoupDocument = jsoup.parse( html );

With our jsoup Document in hand, we’re ready to start manipulating the results. If we look at the source, we’ll find that the meat of the content we want to display is nested in a section tag which has an id of “function_description”. Since jsoup leverages CSS/jQuery-esque selectors, we can use the types of selectors on our jsoup Document that we’re already used to using in CSS and JavaScript. To extract all the content from this selector, we simply use the select() method of our Document, passing the desired selector:

var matchedHTML = jsoupDocument.select( "section##function_description" );

As with jQuery, the select() method will give us an array of matches (if any), so we can simply iterate over the results to get our match:

// if we have a match...
if( isArray( matchedHTML ) && arrayLen( matchedHTML ) ) {
    returnHTML = matchedHTML.html();
}

Remove Unwanted Content

That’s awesome! With only a few lines of code, we were able to parse random HTML content into a sane object that allowed us to specifically target the content–and only the content–that we wanted to retrieve from the page’s HTML.

However, if we look more closely at the content we retrieved, we’ll notice that there is still content within our matched content that we might like to get rid of. For example, we have some script tags, some buttons, and even some forms. Out of the context of the original site, this content is not wanted, so it would nice to remove it.

As expected, jsoup makes this trivial. Once we have a Document, we can use selectors as many times as we need to target content and, if desired, remove it before spitting out our final HTML.

To do this, let’s start by defining an array of selectors that we’d like to match and remove:

// define "remove" selectors
var removeList = [ "script", "button", "##addItem", "form" ];

With our “removelist” in hand, we can simply iterate over the array, and squash the content that we no longer want by calling the remove() method on the matched item:

// loop over remove selectors
for( var item in removeList ) {
    var removeMatches = matchedHTML.select( item );
    // loop over matches
    for( var match in removeMatches ) {
        // remove the match
        match.remove();
    }
}

Pretty awesome, and pretty easy, right?

Wrapping Up

That’s all there is to it. Using jsoup, we can easily transform HTML from any source (especially those sources not under our control) into a sane representation that we can then manipulate at will. This is definitely preferable to trying to do the same with regular expressions, and provides a mechanism for wrapping this functionality into a common service/plugin that can be used anywhere within our project.