Insights and discoveries
from deep in the weeds
Outsharked

Tuesday, July 10, 2012

CsQuery 1.1.3 Released

CsQuery 1.1.3 has been released. You can get it from NuGet or from the source repository on GitHub.

New features

This release adds an API for extending the selector engine with custom pseudo-class selectors. In jQuery, you can do this with code like James Padolsey's :regex extension.. In C#, we can do a little better than this since we have classes and interfaces to make our lives easier. To that end, in the CsQuery.Engine namespace, you can now find:

    interface IPseudoSelector
        IPseudoSelectorFilter        
        IPseudoSelectorChild
   
    abstract class PseudoSelector: IPseudoSelector
        PseudoSelectorFilter: IPseudoSelectorFilter
        PseudoSelectorChild: IPseudoSelectorChild

The two different incarnations of the base IPseudoSelector interface represent two different types of pseudoclass selectors, which jQuery calls basic filters and child filters. Technically there are also content filters but these work the same way as "basic filters" in practice.

If you are only testing characteristics of the element itself, then use a filter-type selector. If an element's inclusion in a set depends on its children (such as :contents, which tests text-node children) or depends on its position in relation to its siblings (such as nth-child) then you should probably use a child-type selector. In many cases you could do it either way. For example, nth-child could be implemented by looking at each element's ElementIndex property and figuring out if it's a match. But it would be much more efficient to start from the parent, and handpick each child that's at the right position.

The basic API

To create a new filter, implement one of the two interfaces. They both share IPseudoSelector:

  • IPseudoSelector Interfaces

        public interface IPseudoSelector
        {
            string Arguments { get; set; }
            int MinimumParameterCount { get; }
            int MaximumParameterCount { get; }
            string Name { get; }
        }
    

    In both cases, you should set the min/max values to the number of parameters you want your filter to accept (the default is 0). "Name" should be the name of this filter as it will be used in a selector. Then choose the one that works best for your filter:

        public interface IPseudoSelectorChild : IPseudoSelector
        {
            bool Matches(IDomObject element);
            IEnumerable<IDomObject> ChildMatches(IDomContainer element);
        }
    
        public interface IPseudoSelectorFilter: IPseudoSelector
        {
            IEnumerable<IDomObject> Filter(IEnumerable selection);
        }
    
  • PseudoSelector Abstract Class

    /// <summary>
        /// Base class for any pseudoselector that implements validation of min/max parameter values, and
        /// argument validation. When implementing a pseudoselector, you must also implement an interface for the type
        /// of pseudoselector
        /// </summary>
    
        public abstract class PseudoSelector : IPseudoSelector
        {
            #region private properties
    
            private string _Arguments;
            
            /// <summary>
            /// Gets or sets criteria (or parameter) data passed with the pseudoselector
            /// </summary>
    
            protected virtual string[] Parameters {get;set;}
    
            /// <summary>
            /// A value to determine how to parse the string for a parameter at a specific index.
            /// </summary>
            ///
            /// <param name="index">
            /// Zero-based index of the parameter.
            /// </param>
            ///
            /// <returns>
            /// NeverQuoted to treat quotes as any other character; AlwaysQuoted to require that a quote
            /// character bounds the parameter; or OptionallyQuoted to accept a string that can (but does not
            /// have to be) quoted. The default abstract implementation returns NeverQuoted.
            /// </returns>
    
            protected virtual QuotingRule ParameterQuoted(int index)
            {
                return QuotingRule.NeverQuoted;
            }
    
            #endregion
    
            #region public properties
    
            /// <summary>
            /// This method is called before any validations are called against this selector. This gives the
            /// developer an opportunity to throw errors based on the configuration outside of the validation
            /// methods.
            /// </summary>
            ///
            /// <value>
            /// The arguments.
            /// </value>
    
            public virtual string Arguments
            {
                get
                {
                    return _Arguments;
                }
                set
                {
    
                    string[] parms=null;
                    if (!String.IsNullOrEmpty(value))
                    {
                        if (MaximumParameterCount > 1 || MaximumParameterCount < 0)
                        {
                            parms = ParseArgs(value);
                        }
                        else
                        {
                            parms = new string[] { ParseSingleArg(value) };
                        }
    
                        
                    }
                    ValidateParameters(parms);
                    _Arguments = value;
                    Parameters = parms;
                    
                }
            }
    
            /// <summary>
            /// The minimum number of parameters that this selector requires. If there are no parameters, return 0
            /// </summary>
            ///
            /// <value>
            /// An integer
            /// </value>
    
            public virtual int MinimumParameterCount { get { return 0; } }
    
            /// <summary>
            /// The maximum number of parameters that this selector can accept. If there is no limit, return -1.
            /// </summary>
            ///
            /// <value>
            /// An integer
            /// </value>
    
            public virtual int MaximumParameterCount { get { return 0; } }
    
            /// <summary>
            /// Return the properly cased name of this selector (the class name in non-camelcase)
            /// </summary>
    
            public virtual string Name
            {
                get
                {
                    return Utility.Support.FromCamelCase(this.GetType().Name);
                }
            }
    
            #endregion
    
            #region private methods
    
            /// <summary>
            /// Parse the arguments using the rules returned by the ParameterQuoted method.
            /// </summary>
            ///
            /// <param name="value">
            /// The arguments
            /// </param>
            ///
            /// <returns>
            /// An array of strings
            /// </returns>
    
            protected string[] ParseArgs(string value)
            {
                List<string> parms = new List<string>();
                int index = 0;
    
    
                IStringScanner scanner = Scanner.Create(value);
               
                while (!scanner.Finished)
                {
                    var quoting = ParameterQuoted(index);
                    switch (quoting)
                    {
                        case QuotingRule.OptionallyQuoted:
                            scanner.Expect(MatchFunctions.OptionallyQuoted(","));
                            break;
                        case QuotingRule.AlwaysQuoted:
                            scanner.Expect(MatchFunctions.Quoted());
                            break;
                        case QuotingRule.NeverQuoted:
                            scanner.Seek(',', true);
                            break;
                        default:
                            throw new NotImplementedException("Unimplemented quoting rule");
                    }
    
                    parms.Add(scanner.Match);
                    if (!scanner.Finished)
                    {
                        scanner.Next();
                        index++;
                    }
                    
                }
                return parms.ToArray();
            }
    
            /// <summary>
            /// Parse single argument passed to a pseudoselector
            /// </summary>
            ///
            /// <exception cref="ArgumentException">
            /// Thrown when one or more arguments have unsupported or illegal values.
            /// </exception>
            /// <exception cref="NotImplementedException">
            /// Thrown when the requested operation is unimplemented.
            /// </exception>
            ///
            /// <param name="value">
            /// The arguments.
            /// </param>
            ///
            /// <returns>
            /// The parsed string
            /// </returns>
    
            protected string ParseSingleArg(string value)
            {
                IStringScanner scanner = Scanner.Create(value);
    
                var quoting = ParameterQuoted(0);
                switch (quoting)
                {
                    case QuotingRule.OptionallyQuoted:
                        scanner.Expect(MatchFunctions.OptionallyQuoted());
                        if (!scanner.Finished)
                        {
                            throw new ArgumentException(InvalidArgumentsError());
                        }
                        return scanner.Match;
                    case QuotingRule.AlwaysQuoted:
    
                        scanner.Expect(MatchFunctions.Quoted());
                        if (!scanner.Finished)
                        {
                            throw new ArgumentException(InvalidArgumentsError());
                        }
                        return scanner.Match;
                    case QuotingRule.NeverQuoted:
                        return value;
                    default:
                        throw new NotImplementedException("Unimplemented quoting rule");
                }
            
            }
    
            /// <summary>
            /// Validates a parameter array against the expected number of parameters.
            /// </summary>
            ///
            /// <exception cref="ArgumentException">
            /// Thrown when the wrong number of parameters is passed.
            /// </exception>
            ///
            /// <param name="parameters">
            /// Criteria (or parameter) data passed with the pseudoselector.
            /// </param>
    
            protected virtual void ValidateParameters(string[] parameters) {
    
                if (parameters == null)
                {
                     if (MinimumParameterCount != 0) {
                         throw new ArgumentException(ParameterCountMismatchError());
                     } else {
                         return;
                     }
                }
    
                if ((parameters.Length < MinimumParameterCount ||
                        (MaximumParameterCount >= 0 &&
                            (parameters.Length > MaximumParameterCount))))
                {
                    throw new ArgumentException(ParameterCountMismatchError());
                }
    
            }
    
            /// <summary>
            /// Gets the string for a parameter count mismatch error.
            /// </summary>
            ///
            /// <returns>
            /// A string to be used as an exception message.
            /// </returns>
    
            protected string ParameterCountMismatchError()
            {
                if (MinimumParameterCount == MaximumParameterCount )
                {
                    if (MinimumParameterCount == 0)
                    {
                        return String.Format("The :{0} pseudoselector cannot have arguments.",
                            Name);
                    }
                    else
                    {
                        return String.Format("The :{0} pseudoselector must have exactly {1} arguments.",
                         Name,
                         MinimumParameterCount);
                    }
                } else if (MaximumParameterCount >= 0)
                {
                    return String.Format("The :{0} pseudoselector must have between {1} and {2} arguments.",
                        Name,
                        MinimumParameterCount,
                        MaximumParameterCount);
                }
                else
                {
                    return String.Format("The :{0} pseudoselector must have between {1} and {2} arguments.",
                         Name,
                         MinimumParameterCount,
                         MaximumParameterCount);
                }
            }
    
            /// <summary>
            /// Get a string for an error when there are invalid arguments
            /// </summary>
            ///
            /// <returns>
            /// A string to be used as an exception message.
            /// </returns>
    
            protected string InvalidArgumentsError()
            {
                return String.Format("The :{0} pseudoselector has some invalid arguments.",
                            Name);
            }
    
            #endregion
    
  • PseudoSelectorChild Abstract Class

        public abstract class PseudoSelectorChild: 
            PseudoSelector, IPseudoSelectorChild
        {
            /// <summary>
            /// Test whether an element matches this selector.
            /// </summary>
            ///
            /// <param name="element">
            /// The element to test.
            /// </param>
            ///
            /// <returns>
            /// true if it matches, false if not.
            /// </returns>
    
            public abstract bool Matches(IDomObject element);
    
            /// <summary>
            /// Basic implementation of ChildMatches, runs the Matches method 
            /// against each child. This should be overridden with something 
            /// more efficient if possible. For example, selectors that inspect
            /// the element's index could get their results more easily by 
            /// picking the correct results from the list of children rather 
            ///  than testing each one.
            /// 
            /// Also note that the default iterator for ChildMatches only 
            /// passed element (e.g. non-text node) children. If you wanted 
            /// to design a filter that worked on other node types, you should
            /// override this to access all children instead of just the elements.
            /// </summary>
            ///
            /// <param name="element">
            /// The parent element.
            /// </param>
            ///
            /// <returns>
            /// A sequence of children that match.
            /// </returns>
    
            public virtual IEnumerable<IDomObject> ChildMatches(IDomContainer element)
            {
                return element.ChildElements.Where(item => Matches(item));
            }
        }
    
    
  • PseudoSelectorFilter Abstract Class

        public abstract class PseudoSelectorFilter: 
            PseudoSelector, IPseudoSelectorFilter
        {
            /// <summary>
            /// Test whether an element matches this selector.
            /// </summary>
            ///
            /// <param name="element">
            /// The element to test.
            /// </param>
            ///
            /// <returns>
            /// true if it matches, false if not.
            /// </returns>
    
            public abstract bool Matches(IDomObject element);
    
            /// <summary>
            /// Basic implementation of ChildMatches, runs the Matches method 
            /// against each child. Same caveats as above.
            /// </summary>
            ///
            /// <param name="element">
            /// The parent element.
            /// </param>
            ///
            /// <returns>
            /// A sequence of children that match.
            /// </returns>
    
            public virtual IEnumerable<IDomObject> Filter(IEnumerable<IDomObject> elements)
            {
                return elements.Where(item => Matches(item));
            }
        }
    

If you implement one of the abstract classes, you get most of the functionality pre-rolled:

  • Name is the un-camel-cased name of the class itself, e.g. class MySpecialSelector would become a selector :my-special-selector
  • MinimumParameterCount and MaximumParameterCount are 0, meaning no parenthesized parameters.
  • Arguments is parsed into a protected property string Parameters[] (using comma as a separator) using the min/max values as a guide. Additionally, you can override QuotingRule ParameterQuoted(int index) and return a value to tell the class how to parse each parameter. The index refers to the zero-based position of the parameter, and QuotingRule is an enum that indicates how quoting should be handled for the parameter at that position: NeverQuoted, AlwaysQuoted or OptionallyQuoted. NeverQuoted means single and double quotes will be treated as regular characters, and AlwaysQuoted means single or double-quote bounds are required. OptionallyQuoted means that if found, they will be treated as bounding quotes, but are not required.
  • The PseudoSelectorChild class implements ChildMatches by simply passing each element child to the Matches function. If you want to test other types of children (like text nodes) or have a smarter way to choose matching children, then override it.

Adding Your Selector to CsQuery

Here's the cool part. To add your selector to CsQuery, you don't need to do anything.. If you include it in a namespace called CsQuery.Extensions, it will automatically be detected. This works as long as this extension can be found in the assembly which first invokes a selector when the application starts. If for some reason this might not be the case, you can force CsQuery to register the extensions explicitly by calling from the assembly in which they're found:

    CsQuery.Config.PseudoClassFilters.Register();
You can also pass an Assembly object to that method. Finally, you can register a filter type explicitly:
    CsQuery.Config.PseudoClassFilters.Register("my-special-selector",typeof(MySpecialSelector));
The Name property isn't used when you register an extension this way.

Example

Here's an port of the :regex selector mentioned above. This can also be found in the test suite under CSharp\Selectors\RegexExtension.cs.

  • Regular Expression Filter Code

        using System.Text.RegularExpressions;
        using CsQuery.ExtensionMethods;
    
        class Regex : PseudoSelectorFilter
        {
            private enum Modes
            {
                Data = 1,
                Css = 2,
                Attr = 3
            }
    
            private string Property;
            private Modes Mode;
            private SysRegex Expression;
    
            public override bool Matches(IDomObject element)
            {
                switch (Mode)
                {
                    case Modes.Attr:
                        return Expression.IsMatch(element[Property] ?? "");
                    case Modes.Css:
                        return Expression.IsMatch(element.Style[Property] ?? "");
                    case Modes.Data:
                        return Expression.IsMatch(element.Cq().DataRaw(Property) ?? "");
                    default:
                        throw new NotImplementedException();
                }
            }
    
            private void Configure()
            {
                var validLabels = new SysRegex("^(data|css):");
    
                if (validLabels.IsMatch(Parameters[0]))
                {
                    string[] subParm = Parameters[0].Split(':');
                    string methodName = subParm[0];
    
                    if (methodName == "data")
                    {
                        Mode = Modes.Data;
                    }
                    else if (methodName == "css")
                    {
                        Mode = Modes.Css;
                    }
                    else
                    {
                        throw new ArgumentException("Unknown mode for regex pseudoselector.");
                    }
                    Property = subParm[1];
                }
                else
                {
                    Mode = Modes.Attr;
                    Property = Parameters[0];
                }
    
                // The expression trims whitespace the same way as the original
                // Trim() would work just as well but left this way to demonstrate
                // the CsQuery "RegexReplace" extension method
    
                Expression = new SysRegex(Parameters[1].RegexReplace(@"^\s+|\s+$",""),
                    RegexOptions.IgnoreCase | RegexOptions.Multiline);
            }
    
    
            // We override "Arguments" to do some setup when this selector
            // is first created, rather than parse the arguments on each 
            // iteration as in the Javascript version. This technique should 
            // be used universally to do any argument setup. Selectors with no
            // arguments by definition should have no instance-specific
            // configuration to do, so there would be no point in overriding 
            // this for that kind of filter.
    
            public override string Arguments
            {
                get 
                {
                    return base.Arguments;
                }
                set
                {
                    base.Arguments = value;
                    Configure();
                }
            }
    
            // Allow either parameter to be optionally quoted since they're both
            // strings: just return null regardless of index.
    
            protected override bool? ParameterQuoted(int index)
            {
                return null;
            }
    
            public override int MaximumParameterCount
            {
                get { return 2; }
            }
            public override int MinimumParameterCount
            {
                get { return 2; }
            }
    
            public override string Name
            {
                get { return "regex"; }
            }
        }
    
    

This is actually a relatively complicated pseduo-selector. To see some simpler examples, just go look at the source code for the CsQuery CSS selector engine. Most of the native selectors have been implemented using this API. The exceptions are pseudoselectors that match only on indexed characteristics, e.g. all the tag and type selectors such as :input and :checkbox. These could have been set up the same way, but they wouldn't be able to take advantage of the index if they were implemented as filters.

Speaking Of Which... Selector Performance

Many of the same rules about selector performance apply here as they do in jQuery. Don't do this:

   var sel = doc[":some-filter"].Filter("div");

Do this:

   var sel = doc["div:some-filter"];

Obviously that's a pretty silly example - most people wouldn't go out of their way to do the first. But generally speaking, you should order your selectors this way:

  • ID, tag, and class selectors first;
  • attribute selectors next;
  • filters last

Unlike jQuery, it doesn't matter whether a filter or selector is "native" to CSS or not - everything is native in CsQuery. What matters is whether it's indexed. All attribute names (but not values), node names (tags), classes and ID values are indexed. It doesn't matter if you combine selectors -- the index can still be used as long as you're selecting on one of those things. But you should try to organize your selectors to chose the most specific indexed criteria first.

It's very fast for CsQuery to pull records from the index. So if you are targeting an ID, that's unique - always use that first. Classes are probably the next best, followed by tag names, and last attributes. Nodes with a certain attribute will be identified in the index just as fast as anything else, but then the engine still has to check the value of each node that has that attribute against your selection criteria.


CsQuery is a complete CSS selector engine and jQuery port for .NET4 and C#. It's on NuGet as CsQuery. For documentation and more information please see the GitHub repository and posts about CsQuery on this blog.