Insights and discoveries
from deep in the weeds
Outsharked

Tuesday, October 16, 2012

CsQuery 1.3 Released

CsQuery 1.3 has been released. You can get it from NuGet or from the source repository on GitHub.

New HTML5 Compliant Parser

This release replaces the original HTML parser with the validator.nu HTML5 parser. This is a complete, standards-compliant HTML5 parser. This is the same codebase used in gecko-based web browsers (e.g. Firefox). You should expect excellent compatibility with the DOM that a web browser would render from markup. Problems that people have had in the past related to character set encoding, invalid HTML parsing, and other edge cases should simply go away.

In the process of implementing the new parser, some significant changes were made to the input and output API in order to take advantage of the its capabilities. While these revisions are generally backwards compatible with 1.2.1, there are a few potentially breaking changes. These can be summarized as follows:

  • DomDocument.DomRenderingOptions has been removed. The concept of assigning output options to a Document doesn't make sense any more (if it ever did); rather, you define options for how output is rendered at the time you render it.
  • IOutputFormatter interface has changed. This wasn't really used for anything before, so I doubt this will impact anyone, but it's conceivable that someone coded against it. The interface has been revised somewhat, and it is now used extensively to define a model for rendering output.

Hopefully, these changes won't impact you much or at all. But with this small price comes a host of new options for parsing and rendering HTML.

Create Method Options

In the beginning, there was but a single way to create a new DOM from HTML: Create. And it was good. But as the original parser evolved towards HTML5 compliance, the CreateFragment and CreateDocument methods were added, to define intent. Different rules apply depending on the context: a full document must always have an html tag (among others) for example. But you wouldn't want to add any missing tags if your intent was to create a fragment that was not supposed to stand alone.

The new parser has some more toys. It lets us define an expected document type (HTML5, HTML4 Strict, HTML4 Tranistional). We can tell it the context we expect out HTML to be found in when it starts parsing. We can choose to discard comments, and decide to permit self-closing XML tags. All of these things went into the Create method, allowing you complete control over how your input gets processed.

New Overloads

The basic Create method has overloads to accept a number of different kinds of input:

    public static CQ Create(string html)
    public static CQ Create(char[] html)
    public static CQ Create(TextReader html)
    public static CQ Create(Stream html)
    public static CQ Create(IDomObject element)
    public static CQ Create(IEnumerable<IDomObject> elements)

Additionally, there are similar overloads with parameters that let you control each option:


    public static CQ Create(string html, 
            HtmlParsingMode parsingMode =HtmlParsingMode.Auto, 
            HtmlParsingOptions parsingOptions = HtmlParsingOptions.Default,
            DocType docType = DocType.Default)

When calling the basic methods, the "default" values of each of these will be used. The default values are defined on the CsQuery.Config object (the "default defaults" are shown here -- if you change these on the config object, your new values will be used whenever a default is requested):

    CsQuery.Config.HtmlParsingOptions = HtmlParsingOptions.None;
    CsQuery.Config.DocType = DocType.HTML5;
Note that HtmlParsingOptions is a [Flags] enum. This means you can specify more than one option. So you could, for example, call Create like this:
    var dom = CQ.Create(someHtml,HtmlParsingOptions.Default | HtmlParsingOptions.IgnoreComments);

If you pass a method both Default and some other option(s), it will merge the default values with any additional options you specified. On the other hand, passing options that do not include Default will result in only the options you passed being used.

The other methods remain more or less unchanged. CreateDocument and CreateFragment now simply call Create using the appropriate HtmlParsingOption to define the intended document type.

    public static CQ CreateDocument(...)
    public static CQ CreateFragment(...)
    public static CQ CreateFromFile(...)
    public static CQ CreateFromUrl(...)
    public static CQ CreateFromUrlAsync(...)

The Create method offers a wide range of options for input and parsing. These other methods were created for convenience and before an API to handle input features had been thought out. Though I don't intend to deprecate them right away, I will not likely extend them to support the various options. Anything you can do with these methods can be done about as easily with `Create` and a helper of some kind. For example, if you want to load a DOM from a file using options other than the defaults, you can just pass `File.Open(..)` to the standard `Create` method.

Render Method Options

The Render method signatures look pretty much the same as 1.2.1.. but a lot has changed behind the scenes. The IOutputFormatter interface, which used to be more or less a placeholder, now runs the show. All output is controlled by OutputFormatters implementing this interface. Any Render method which doesn't explicitly identify an OutputFormatter will be using the default formatter provided by the service locator CsQuery.Config.GetOutputFormatter.

    public static Func<IOutputFormatter> GetOutputFormatter {get;set;}
You can replace the default locator with any delegate that returns IOutputFormatter.. Additionally, you can assign a single instance of a class to the CsQuery.Config.OutputFormatter property, which, if set, will supercede use of service locator. When using this method, the object must be thread safe, since new instances will not be created for each use.

There are a number of built-in IOutputFormatter objects accessible through the static OutputFormatters factory:

    OutputFormatters.HtmlEncodingBasic
    OutputFormatters.HtmlEncodingFull
    OutputFormatters.HtmlEncodingMinimum
    OutputFormatters.HtmlEncodingMinimumNbsp
    OutputFormatters.HtmlEncodingNone
    OutputFormatters.PlainText

Each of these except the last returns an OutputFormatter configured with a particular HtmlEncoder. The last strips out HTML and returns just the text contents (to the best of its ability). The factory also has Create methods that let you configure it with specific DomRenderingOptions too. Complete details of these options are in the Render method documentation.

Bug Fixes

  • Issue #51: Fix an issue with compound subselectors whose target included CSS matches above the level of the context.
  • Fix for :empty could return false when non-text or non-element nodes are present

Other New Features

The completely new HTML parser, input and output models aren't enough for you? Well, there are a couple other minor new features.

  • CsQuery should compile under Mono now, after implementing a suggestion to change to `CsQuery.Utility.JsonSerializer.Deserialize` to avoid an unimplemented Mono framework feature.
  • Added a HasAttr method to test for the presence of a named attribute.
  • Add CSS descriptor for Paged Media Module per Pull Request #40 from @kaleb
  • `CQ.DefaultDocType` has been marked as obsolete and will be removed in a future version. Use `Config.DocType` instead
  • `CQ.DefaultDomRenderingOptions` has been marked as obsolete and will be removed in a future version. Use `Config.DomRenderingOptions` instead.

There are other changes in the complete change log, however, many of them are related to the deprecated parser and no longer relevant.

Thanks To The Community

This is a big project, and the new parser is a huge step forward. I think you'll find this release is fast, stable, flexible, and standards-compliant. I owe a debt to a number of people who suffered through the development and beta releases for the last couple months, without their patience and feedback, this would not have been possible. A bug report is a gift! So thanks to all the givers. The following is a list of all the people who've contributed code or bug reports recently. (If I missed anyone, it wasn't intentional!) Thanks - please keep it coming.

Vitallium (code), kaleb (code), petterek, ilushka85, laurentlbm, martincarlsson, allroadcole, Nico1234, Uncleed, Vids, Arithmomaniac, CJCannon, muchio7, SaltyDH


CsQuery is a complete CSS selector engine and jQuery port for .NET4 and C#. It's on NuGet as CsQuery. For documentation and more information please see the GitHub repository and posts about CsQuery on this blog.