Insights and discoveries
from deep in the weeds
Outsharked

Wednesday, June 27, 2012

CsQuery Performance vs. Html Agility Pack and Fizzler

I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:

  • The sizzle test document (about 11 k)
  • The wikipedia entry for "cheese" (about 170 k)
  • The single-page HTML 5 spec (about 6 megabytes)

The overall results are:

  • HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.
  • CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.
  • CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).

The most dramatic results are when running a selector of a single ID or a nonexistent ID in a large document. CsQuery returns the result (an empty set) over 100,000 times faster than Fizzler. This is almost certainly because it doesn't index on IDs; other selectors are much faster in Fizzler than this (though still substantially slower than CsQuery).

Size Matters

In the very small documents (the 11k sizzle test document) CsQuery still beats Fizzler, but by much less. The ID selector is still pretty substantial about 15-15x faster. For more complex selectors, the margin is just over 1x to 3x faster.

On the other hand, in very large documents, the edge that Fizzler has in loading the documents seems to mostly disappear. CsQuery is only about 10% slower at loading the 6 megabyte "large" document. This could be an opportunity for optimizing CsQuery - this seems to indicate that overhead just in creating a single document is dragging performance down. Or, it could be indicative of the makeup of the respective test documents. Maybe CsQuery does better with more elements, and Fizzler with more text - or vice versa.

You can see a detailed comparison of all the tests so far here in a google doc:

"FasterRatio" is how much faster the winner was than the loser. Yellow ones are CsQuery; red ones are Fizzler.

Red in the "Same" column means the two engines returned different results.

Try It Out

This output can be created directly from the CsQuery test project under "Performance."


CsQuery is a complete CSS selector engine and jQuery port for .NET4 and C#. It's on NuGet as CsQuery. For documentation and more information please see the GitHub repository and posts about CsQuery on this blog.

2 comments:

  1. Excellent comparison! Do you know of a similar test comparing changing of the source, not just querying it? For example, changing all images sources and replacing a string in all text elements? A test like that would be complimentary to this methinks! I would do it myself but I don't know how.

    ReplyDelete
  2. I haven't done a test like that, but I suspect HTML Agility Pack would be faster, for the same reasons it's faster when parsing: it doesn't have an index. So when CsQuery makes changes, it also has to update the index to reflect the new DOM. It would certainly be worthwhile though, because I don't even have a good sense for how fast CsQuery is when making substantive DOM changes.

    At the same time performance of making changes to the DOM is likely to be much faster than selector performance (especially for HTML Agility Pack) in general. That is - the reason HAP is so slow for selectors is because it doesn't use an index, so it basically has to seek through the entire node tree to locate matches. For small documents, no big deal, but as you can see it makes a huge difference on larger documents. But changing the DOM is a targeted operation that requires only changing a few pointers in the node you're moving, and perhaps updating references among siblings where it gets moved to. So chances are, for either HAP, even substantial modifications to the DOM will take less time than typical queries. For CsQuery, it might be a closer comparison, since queries are usually pretty optimized. In any event it is definitely something I'm interested in looking at, and will put together some tests whenever I have a chance.

    One other thing - this test is fairly outdated (I should update it) since the HTML parser in CsQuery has been completely replaced as of version 1.3, and many other changes have been made. Generally performance of CsQuery has improved since then, though. If you want to see how things fare now, you can run the tests yourself from the CsQuery.PerformanceTests project.

    ReplyDelete