Insights and discoveries
from deep in the weeds
Outsharked

Tuesday, November 29, 2011

CsQuery 1.0 is imminent

In the last four months I've done a lot of work on CsQuery - on github - a C# jQuery port. I have been using it extensively in a few web site projects and it's quite solid. I've ported most of the jQuery tests that are relevant (dom manipulation, traversing, selection, attributes, utility functions).

Rather than update the list of implemented methods, I've compiled a list of the methods that still remain to be implemented. There are not many. :) Everything else that's not in CsQuery already is browser-DOM specific (e.g. related to events, callbacks, etc.) or is a utility function that I don't think is useful in C#.

jQuery Methods NOT Implemented In CsQuery

        Detach
        Empty
        NextAll
        NextUntil
        End
        WrapAll
        WrapInner
        ParentsUntil
        NextUntil
        OffsetParent
        PrevAll
        PrevUntil
        Prepend
        PrependTo
        Slice
        jquery.Contains
        jquery.Grep

.. plus a few CSS selectors. Additionally, there is extensive support for dynamic/Expando objects using a special JsObject class, and CsQuery.Extend (which works pretty much as you would expect). Though anything that implements IDictionary<string,object> can be used as the target for object creation methods. This lets you work with objects in JSON form, or dynamic objects, almost seamlessly, e.g.:
// Create a new dom from a string of html

var myDom = CsQuery.Create(html);

// "AttrSet" and "CssSet" are the same as Attr(object) and Set(object) - since in C# we can't
//  overload return types. Attr(string) and Css(string) return the values of named items in 
//  CsQuery. This convention is used for methods that can be passed a string of JSON data.

myDom.Select("div.sidebar")
    .AddClass("courier")
    .CssSet("{'border': '1px solid black', 'font-weight': 'bold'}");

// create a new anonymous object. You can also use any conventional object or expando object
// as a source parameter in CsQuery.Extend

var data = new { pageName="My Home Page", url="/myhomepage.html"};

// "null" below is a convention for the empty object {}. You can also pass a new expando object,
// this isjust shorthand. The parameters match jQuery.extend. This merges the properties of data,
// and the object created from the JSON string passed. There's also a CsQuery.ParseJSON method 
// for explicitly creating a new expando object from JSON. Finally, CsQuery.Extend will work
// with conventional objects as the target (first parameter). In this case, it will only update
// existing properties with the new data, since you can't add properties to an existing non-expando
// object.

dynamic dataExtended = CsQuery.Extend(null,data,"{ 'access':'all' }");
myDom.Data("page",dataExtended).Hide();
myDom.RenderSelection();

// outputs: 
//   <div class="sidebar courier" style="border: 1px solid black; font-weight: bold; display: none;"
//       data-page='{"pageName": "My Home Page", "url": "/myhomepage.html", "access": "all" }'>
//   </div>

There are still some other features I want to implement, but I am hoping to get some examples together and create a version 1.0 distribution in the next month or so. The code is solid and well tested, and it makes server-side HTML management a joy compared to WebControls, Razor/HTML helpers, and so on, where you have limited control over server-side HTML layout. And your brain can work with HTML exactly the same way on the server as it needs to on the client. Your whole browser DOM is right in front of you. It's great for scraping too.

I have not done extensive performance testing, but have done a little. It's easily fast enough for real-time HTML parsing. Of course, if you plan to use it on something serving a thousand pages a second, this might matter, but I suspect most people would find it plenty fast. On my laptop, it can parse a 5 megabyte HTML file with over 100,000 unique nodes (the entire HTML 5 spec) into an indexed DOM in 2.5 seconds. Selecting all the DIVs (over 3,300) takes less than 1/100th of a second. Now - 2.5 seconds is an eternity for a web server, but this is meant to be an unrealistic situation, and there would be little reason to parse a big page of static HTML that you had no intention of manipulating. A web page that's 20K, which is more typical, would be less than 1/100th of a second. There's definitely room to make it faster, too, but it's plenty fast now, and I suspect it's a lot faster than manipulating and rendering a page with something like WebControls anyway.

Features that I still want to add:

  • Asynchronous HTTP gets - right now when using CsQuery.Server().CreateFromUrl() to load a DOM from the web, code execution is blocked while the get is performed. This is probably fine for some basic web scraping, but will slow things down a lot for any substantive real-time usage. I started coding for an async model but have not finished yet.
  • Form postback management - there's a basic tool for repopulating form elements from their postback data in the Server() module. This needs to be fleshed out and tested a bit, though, because I have not used it too much as I haven't created a lot of conventional HTML forms lately.
  • Framework and view engine - I've developed a useful, simple framework as part of one project. This includes some custom HTML tags like <csq-include src="..." />, <csq-when [conditions]>...</csq-when> to do things like server-side includes, environment-specific includes, and so on. These are not really specific to CsQuery but rather CsQuery is used to implement them, and they make working with pure HTML a lot easier.
  • Templates - something like the jQuery template plugin. Of course it's a piece of cake to write CsQuery code to do simple substitutions, but it would be nice to integrate some of that functionality into a framework.
  • Client script communication - one of the things that CsQuery makes very convenient is preconfiguring data for client-side controls. For example, say you have a grid control. A typical usage might be to initialize the control with an ajax request upon first page load. This causes the page to be rendered with no data at first, then perhaps an ajax loader shown to the user while it gets the default data. Why not pass the first batch of data directly to the control? It's easy to use CsQuery.Data() to pass data as an attribute of an HTML element, then in your javascript, just grab it with jQuery.Data(). This requires using some HTML element as a payload container. Not a big deal, but I would like to standardize this convention and create methods to abstract it.

Anyway, it's getting close, but feel free to download the project from github and give it a try. The basic usage could not be simpler.

var myDom = CsQuery.Create(htmlString);
var content = myDom["#maincontent > div.title"];
var newContent = myDom["Hello world!"].Css("font-weight","bold");
content.Append(newContent);
Response.Write(content.Render());

No comments:

Post a Comment