Outsharked: June 2012

Insights and discoveries
from deep in the weeds

Outsharked

Wednesday, June 27, 2012

CsQuery Performance vs. Html Agility Pack and Fizzler

I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:

The sizzle test document (about 11 k)
The wikipedia entry for "cheese" (about 170 k)
The single-page HTML 5 spec (about 6 megabytes)

The overall results are:

HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.
CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.
CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).

The most dramatic results are when running a selector of a single ID or a nonexistent ID in a large document. CsQuery returns the result (an empty set) over 100,000 times faster than Fizzler. This is almost certainly because it doesn't index on IDs; other selectors are much faster in Fizzler than this (though still substantially slower than CsQuery).

Size Matters

In the very small documents (the 11k sizzle test document) CsQuery still beats Fizzler, but by much less. The ID selector is still pretty substantial about 15-15x faster. For more complex selectors, the margin is just over 1x to 3x faster.

On the other hand, in very large documents, the edge that Fizzler has in loading the documents seems to mostly disappear. CsQuery is only about 10% slower at loading the 6 megabyte "large" document. This could be an opportunity for optimizing CsQuery - this seems to indicate that overhead just in creating a single document is dragging performance down. Or, it could be indicative of the makeup of the respective test documents. Maybe CsQuery does better with more elements, and Fizzler with more text - or vice versa.

You can see a detailed comparison of all the tests so far here in a google doc:

"FasterRatio" is how much faster the winner was than the loser. Yellow ones are CsQuery; red ones are Fizzler.

Red in the "Same" column means the two engines returned different results.

Try It Out

This output can be created directly from the CsQuery test project under "Performance."

CsQuery is a complete CSS selector engine and jQuery port for .NET4 and C#. It's on NuGet as CsQuery. For documentation and more information please see the GitHub repository and posts about CsQuery on this blog.

Tuesday, June 26, 2012

CsQuery 1.1.2 Released

CsQuery 1.1.2 has been released. You can get it from NuGet or from the source repository on GitHub.

New features

This release includes significant revisions to the HTML parser to enhance compatibility with HTML5 parsing rules for optional opening and closing tags.

When optional closing tags are omitted, such as </p>, CsQuery's HTML parser will use the HTML5 spec rules to determine when to insert a closing tag. When opening tags for required elements such as head and tbody are omitted, the parser will generate the missing tags when parsing in document mode. This means you can expect a very high degree of compatibility between the HTML (and selections) generated by CsQuery, and the DOM rendered by web browsers, when valid HTML is passed.

The HTML5 spec also includes a set of rules for handling invalid markup. While the CsQuery parser usually makes pretty good decisions about how to handle bad HTML, and should be able to parse about anything, it doesn't yet comply with the "bad markup" part of the spec - just the "optional" handling part. Over time, though, I intend to continue improving the parser to comply with other parts of the spec as much as possible.

API Change

Because the HTML parser will generate tags now, it needs to understand context. If you're creating a fragment that's just supposed to be a building block, you obviously don't want it adding html and body tags around your markup.

There are now three static methods for parsing HTML:

CQ.Create(..)
Create a content block

This method is meant to be used for complete HTML blocks that are not self-contained documents. Examples of this are a piece of content retrieved from a CMS, or a template. It should be used for anything that is a compete block, but is intended to be embedded in another document. Using this method, missing tags will be handled according to the HTML5 spec EXCEPT for adding the optional html and body tags. Additionally, any text found at the root of the markup will be wrapped in span tags making it safe to insert into nodes that cannot have text directly as children.
CQ.CreateDocument(..)
Create a document.

This method creates a complete HTML document. If the html, body or head tags are missing, they will be created. Stranded text nodes (e.g. outside of body) will be moved inside the body. If you're parsing HTML from the web or from a file that's supposed to represent a complete HTML document, use this.
CQ.CreateFragment(..)
Create a fragment.

This method interprets the content as a true fragment that you can use for any purpose. No new elements will be created. The rules for optional closing tags are still honored -- to do otherwise would just result in the default handling for any broken/unclosed tag being used instead. But no optional tags like tbody will be generated even if they are expected to be found. This method is the default handling for creating HTML from a selector, e.g.
```
var html = dom["<div></div>"];
```

Other Enhancements

The jQuery :input pseudoclass was added. It had been inadvertently omitted from prior versions.
All selectors can include escaped characters now
HTML parser permits all valid characters in class and attribute names. Previously, the : and . characters were stop characters.
The CQ object's property indexer overloads now align with the Select method overloads.
Migrated all of the tests from Sizzle. (A few of the bugs fixed in this release were found as a result of implementing the Sizzle test suite).

Bug Fixes

Issue #12: CSS class names being output in lowercase
Issue #11: :hidden selector not selecting input[type=hidden]
Issue #8: allow leading + and - signs in nth-child type equations
Corrected a problem with some last-child selectors (found during Sizzle unit test migration, no bug report)

This release has also had some performance optimizations; nth-child type selectors in particular should be an order of magnitude faster as a result of caching the results of each calculation.

Tuesday, June 19, 2012

ImageMapster 1.2.5 released

After 9 months I've finally released an update to ImageMapster. Download the latest release distribution or go to github to see the source.

Since 1.2.4 much has changed. If you've been following along the development, a lot of this may be old news, but this covers most of what's changed since the last official release.

New Features

clickNavigate allows binding a URL to an area, just like a regular HTML imagemap! Seriously, this was a common request - sometimes people just wanted the map to highlight areas on mouseover, but otherwise act the same. You could always do this by capturing a click event and then set window.location, but the method streamlines this.

It offers a few conveniences, e.g. when an area has only href='#' then it will not navigate even when this option is enabled, and if any valid href target is found on any area in a group, then it will be used no matter which area in the group is clicked.
A new keys option allows you to obtain a list of keys associated with an area or area group. That is, you can assign more than one key to an area, e.g. this area:
```
<area href="#" data-key="area1, group1" coords="...">
```
has two keys, "area1" and "group1." This lets you create different, independent groups which you can control separately. The first key on the list is always the primary key, though, and determines whether something is considered "selected". So sometimes, given a key, you want to find out other keys associated with it, so you can select or deselect associated areas in response to an action. This option gives you easy access to data on the relationships between area keys.
mouseoutDelay option lets you specify a time in milliseconds that a highlighted area will remain highlighted after the mouse leaves. (If another area is highlighted before this time elapses, the old one will be removed immediately). This is useful for sparse maps, e.g. maps with large areas that aren't part of the map and only small highlighted areas. Because a users's pointer may only be over the area briefly, the effect could appear flickery or jerky. This allows you to keep it highlighted for some time after they leave to avoid this problem.
Rendering options can be passed on-the-fly with set allowing you to have complete control over the appearance of every area without having to define area options up front.

Bug Fixes, Improvements

Many compatibility and stability improvements to resolve conflicts with browser plugins (AdBlock in particular) and solve some browser issues. Fading effects now work consistently in IE 6-8 too.
More robust binding to handle situations that caused problems such as the imagemap being initially hidden or extremely slow-loading images
Tooltips can be positioned outside the boundaries of the image. A few bugs related to tooltip positioning were fixed.
rebind and snapshot have been cleaned up a lot, allowing you to chain events to create complex initial effects. For example, this code would bind a map using a set of options defined in initial_opts, then highlight "CA" using the "fill" and "fillColor" options shown, then finally take a snapshot and rebind with a different set of options basic_opts. All the effects that were rendered before the snapshot will now be part of a static backdrop. Fiddle with it.
```
    $('img').mapster(initial_opts)
        .mapster('set',true,'CA', {
            fill: true,
            fillColor: '00ff00'
        })
        .mapster('snapshot')
        .mapster('rebind',basic_opts);
```
resize has been improved to increase smoothness and performance. A bug that caused its callback to be fired at the wrong time has been fixed.

What's next?

First, I'm not going to wait 9 months to make a new release next time. This was a result of being dissatisfied with the state of javascript testing frameworks for testing complex UI tools. I never felt comfortable calling this a "release" while the tests were a mess. That was probably a mistake since thousands of people downloaded the old version even as I've known it's got many bugs that have since been fixed. I won't make that mistake again.

The next major release will include a new API as an option. That is, instead of calling mapster with mapster('method',...) you will be able to obtain an actual mapster object and call its methods directly, e.g.


var mapster = $('img').mapster(initial_opts);
    mapster.set('CA', {fill: true, fillColor: '00ff00' })
        .snapshot()
        .rebind(basic_opts);

While sticking to the jQuery model makes sense to a point, this tool has become sufficiently built-out that it's a hinderance when doing anything beyond the basics. The old methods will still be perfectly valid.

There will be panning and zooming. I started coding some more sophisticated zoom effects that work with "resize" to let you easily zoom directly to an area. I stopped when I realized feature creep was preventing me from getting a new release finished and fixing bugs. Now it's time to get back to that.

Better tooltips. Lots of people ask about controlling the position and functionality of tooltips. I plan to add some better integrated support for tooltip manipulation.

Feature selection I broke the source code into modules some time ago because it was becoming unwieldy as a single file. My secondary goal in doing this was to allow one to create custom builds using only the features needed. For example, if you don't care about tooltips, why include that extra code? This is really more of a web site feature than anything else, it is (almost) possible to exclude some modules now.

What else? Let me know if you have ideas, or want to contribute!

Wednesday, June 13, 2012

CsQuery 1.1 Released, and available on NuGet

CsQuery 1.1 has been released. This is a major milestone; the library now implements every CSS2 and CSS3 selector.

Additionally, CsQuery is now available on NuGet:

    PM> Install-Package CsQuery

There are two important API changes from prior versions.

The IDomElement.NodeName method now returns its results in uppercase. Formerly, results were returned in lowercase. So any code that tests for node type with a string will break, e.g.
```
    CQ results = dom["div, span"];
    foreach (IDomObject item in results) {
//        if (item.NodeName=="div") {
        if (item.NodeName=="DIV") {
            ...
        }
    }
```
I realize this can easily break code in ways that the compiler cannot detect and apologize for this; but this is important to be consistent with the browser DOM. This was a long time coming.

The CsQuery.Server object has been removed. Methods for loading a DOM from an http server have been replaced with static methods on the CQ object:

    // synchronous
    var doc = CQ.CreateFromUrl("http://www.jquery.com");
  
    // asynchronous with delegates to call upon completion
    CQ.CreateFromUrlAsync("http://www.jquery.com", responseSuccess => {
        Dom = response.Dom;        
    }, responseFail => {
        ..
    });

    // asynchronous using IPromise (similar to C#5 Task)
    var promise = CQ.CreateFromUrlAsync("http://www.jquery.com");
    var promise2 = CQ.CreateFromUrlAsync("http://www.cnn.com");

    promise.Then(successDelegate);
    promise2.Then(successDelegate,failDelegate);

    When.All(promise,promise2).Then(allFinishedDelegate);

See Creating a new DOM and Promises in the readme for more details.

New Features in 1.1

Implemented all missing CSS pseudoclass selectors:

    :nth-last-of-type(N)              :nth-last-child(N)    
    :nth-of-type(N)                   :only-child
    :only-of-type                     :empty
    :last-of-type                     :first-of-type

Implemented all missing jquery pseudoclass selectors:

    :parent                           :hidden
    :header

Added IDomObject.Name property
Added IDomObject.Type property

Bug Fixes

Don't consider html node a child when targeted by child-targeting selectors (consistent with browser behavior)
Fix checkbox lists in Forms.RestorePost
Pseudoselectors from a descendant combinator only returning direct descendant matches (e.g., div :empty)
Issue #5 - Remove enforcement of unique id attribute when parsing HTML

CsQuery is a complete port of jQuery written in C# for .NET4. For documentation and more information please see the GitHub repository and posts about CsQuery on this blog.

Thursday, June 7, 2012

Async web gets and Promises in CsQuery

More recent versions jQuery introduced a "deferred" object for managing callbacks using a concept called Promises. Though this is less relevant for CsQuery because your work won't be interactive for the most part, there is one important situation where you will have to manage asynchronous events: loading data from a web server.

Making a request to a web server can take a substantial amount of time, and if you are using CsQuery for a real-time application, you probably won't want to make your users wait for the request to finish.

For example, I use CsQuery to provide current status information on the "What's New" section for the ImageMapster web site. I do this by scraping GitHub and parsing out the relevant information. But I certainly do not want to cause anyone to wait while the server makes a remote web request to GitHub (which could be slow or inaccessible). Rather, the code keeps track of when the last time it's updated it's information using a static variable. If it's become "stale", it initiates a new async request, and when that request is completed, it updates the cached data.

So, the http request that actually triggered the update will be shown the old information, but there will be no lag. Any requests coming in after the request to GitHub has finished will of course use the new information. The code looks pretty much like this:

    private static DateTime LastUpdate;
    
    if (LastUpdate.AddHours(4) < DateTime.Now) {

        /// stale - start the update process. The actual code makes three 
        /// independent requests to obtain commit & version info

        var url = "https://github.com/jamietre/ImageMapster/commits/master";
        CQ.CreateFromUrlAsync(url)
           .Then(response => {
               LastUpdate = DateTime.Now;
               var gitHubDOM = response.Dom;
               ... 
               // use CsQuery to extract needed info from the response
           });
    }

    ...

    // render the page using the current data - code flow is never blocked even if an update
    // was requested

Though C# 5 includes some language features that greatly improve asynchronous handling such as `await`, I dind't want to "wait", and the promise API used often in Javascript is actually extraordinarily elegant. Hence I decided to make a basic C# implementation to assist in using this method.

The `CreateFromUrlAsync` method can return an `IPromise` object. The basic promise interface (from CommonJS Promises/A) has only one method:

    then(success,failure,progress)

The basic use in JS is this:

    someAsyncAction().then(successDelegate,failureDelegate);

When the action is completed, "success" is called with an optional parameter from the caller; if it fails, "failure" is called.

I decided to skip progress for now; handling the two callbacks in C# requires a bit of overloading because function delegates can have different signatures. The CsQuery implementation can accept any delegate that has zero or one parameters, and returns void or something. A promise can also be generically typed, with the generic type identifying the type of parameter that is passed to the callback functions. So the signature for `CreateFromUrlAsync` is this:

    IPromise CreateFromUrlAsync(string url, ServerConfig options = null)

This makes it incredibly simple to write code with success & failure handlers inline. By strongly typing the returned promise, you don't have to cast the delegates, as in the original example: the `response` parameter is implicitly typed as `ICsqWebResponse`. If I wanted to add a fail handler, I could do this:

    CQ.CreateFromUrlAsync(url)
        .Then(responseSuccess => {
            LastUpdate = DateTime.Now;
             ...
        }, responseFail => {
             // do something
        });

CsQuery provides one other useful promise-related function called `WhenAll`. This lets you create a new promise that resolves when every one of a set of promises has resolved. This is especially useful for this situation, since it means you can intiate several independent web requests, and have a promise that resolves only when all of them are complete. It works like this:

    var promise1 = CQ.CreateFromUrlAsync(url);
    var promise2 = CQ.CreateFromUrlAsync(url);

    CsQuery.When.All(promise1,promise2).Then(successDelegate, failDelegate);

You can also give it a timeout which will cause the promise to reject if it has not resolved by that time. This is valuable for ensuring that you get a resolution no matter what happens in the client promises:

    // Automatically reject after 5 seconds

    CsQuery.When.All(5000,promise1,promise2)
        .Then(successDelegate, failDelegate);

`When` is a static object that is used to create instances of promise-related functions. You can also use it to create your own deferred entities:

    var deferred = CsQuery.When.Deferred();
    
   // a "deferred" object implements IPromise, and also has methods to resolve or reject

   deferred.Then(successDelegate, failDelegate);
   deferred.Resolve();   // causes successDelegate to run

What's interesting about promises, too, is that they can be resolved *before* the appropriate delegates have been bound and everything still works:

    var deferred = CsQuery.When.Deferred();

    deferred.Resolve();
    deferred.Then(successDelegate, failDelegate);   // successDelegate runs immediately

I may completely revisit this once VS2012 is out; the `await` keyword cleans things up a little but and the `Task.WhenAll` feature does the same thing as `When.All` here. By the way - the basic API and operation for "when" was 100% inspired by Brian Cavalier's excellent when.js project which I use extensively in Javascript.

Monday, June 4, 2012

Using CsQuery with MVC views

Update 7/17/2012: The source repository now includes a complete MVC example project that implements a custom view engine using CsQuery, allowing you to simply add methods to a controller to have access to the page's markup before rendering as a CQ object, e.g.

    public class AboutController : CsQueryController
    {

        public ActionResult Index()
        {
           
            return View();
            
        }

        // runs for the "Index" action after the ActionResult is returned,
        // providing access to the final HTML before it's rendered

        public void Cq_Index()
        { 
            // add the "highlight" class to all anchors

            Doc["a"].AddClass("highlight");
        }
    }

Take a look at the MVC example for more information. The contents of this blog post are accurate but the example provides much more detail as well as a complete implementation, since it's not completely trivial to intercept the final HTML for a page in an MVC application.

Original Post

I've been neglecting CsQuery, the C# jQuery port lately, and I feel bad about that. But I haven't forgotten it. Quite the opposite, I'm gearing up to create the first formal release, get it onto NuGet, and publish a web site with interactive demos and documentation. It's going to take a little while to move this all forward but it's in progress.

There's been a spark of outside interest in the project in the last month or so, which has inspired me to get moving again on some of this stuff. Things always slow down at work in the summer so the timing is good and I hope to have this thing in a more consumer-friendly format soon.

In the meantime, here's a nugget from Rick Strahl about rendering MVC views as strings. If you're using CsQuery with ASP.NET MVC, this is a technique you will almost certainly use to feed your MVC markup into CsQuery for further manhandling.

I described a similar technique in this question. Rick's post encapsulates this cleanly in a class. To get from there to a CsQuery object is a piece of cake:

    string message=ViewRenderer.RenderView("~/views/template/ContactSellerEmail.cshtml",
        model,ControllerContext);

    // create a CsQuery object from the HTML string
    CQ messageDom = CQ.Create(message);

    // do stuff...
    messageDom["#content-placeholder"].ReplaceWith(...);

    // render it back to a string of HTML
    message = messageDom.Render();