Developer Blog
Articles about Using Microsoft Developer Tools

Converting Text to a URL "Slug"

Wednesday, May 12, 2010 10:01 AM by jonwood

By now, many of you have seen where a URL contains text similar to the the page's title. For example, the URL may look like http://www.domain.com/this-is-my-best-article.aspx instead of http://www.domain.com/bestarticle.aspx. Text converted from a regular string, that can appear within a URL this way is called a "slug."

Not only is a slug a little more human-readable, but it can also help indicate to search engines like Google and Bing what keywords are important to your page.

There's no built-in .NET function to convert a string to a slug. I found a few examples on the web but didn't find any I really liked. So I decided to roll my own in C#. Listing 1 shows my ConvertTextToUrl() method. It takes any string and makes it safe to include as part of a URL.

Initially, I started by looking for an official, comprehensive list of characters that are not valid within a URL. But after thinking about it, I decided the result looked cleaner if I got rid of all punctuation, whether they could appear in a URL or not. So my code rejects characters that could legitimately be included in the URL.

/// <summary>
/// Creates a "slug" from text that can be used as part of a valid URL.
/// 
/// Invalid characters are converted to hyphens. Punctuation that is
/// perfect valid in a URL is also converted to hyphens to keep the
/// result mostly text. Steps are taken to prevent leading, trailing,
/// and consecutive hyphens.
/// </summary>
/// <param name="s">String to convert to a slug</param>
/// <returns></returns>
public static string ConvertTextToSlug(string s)
{
  StringBuilder sb = new StringBuilder();
  bool wasHyphen = true;
  foreach (char c in s)
  {
    if (char.IsLetterOrDigit(c))
    {
      sb.Append(char.ToLower(c));
      wasHyphen = false;
    }
    else if (char.IsWhiteSpace(c) && !wasHyphen)
    {
      sb.Append('-');
      wasHyphen = true;
    }
  }
  // Avoid trailing hyphens
  if (wasHyphen && sb.Length > 0)
    sb.Length--;
  return sb.ToString();
}

Listing 1: ConvertTextToSlug() method. 

Some examples I found on the web used regular expressions. My routine is simpler. It just iterates through each character in the string, appending it to the result if it's either a letter or a character. If I encounter a space, I append a hyphen (-).

The code takes steps to prevent consecutive hyphens, keeping the result looking cleaner. It also takes steps to prevent leading and trailing hyphens.

As you can see, it's a very simple routine. But it seems to produce good results. Of course, if you decide to name your documents this way, it'll be up to you to ensure you correctly handle different titles that resolve to the same slug.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:   ,
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Abbreviating URLs

Thursday, April 29, 2010 7:35 AM by jonwood

Recently, I had a case where an ASP.NET page displayed the user's URL in a side column. This worked fine except that I found some users had very long URLs, which didn't look right.

It occurred to me that I could simple truncate the visible URL while still keeping the underlying link the same. However, when I truncated the URL by trimming excess characters, I realized it could be done more intelligently.

For example, consider the URL http://www.domain.com/here/is/one/long/url/page.apsx. If I wanted to keep it within 40 characters, I could trim it to http://www.domain.com/here/is/one/long/u. The problem is that this abbreviation could be more informative. For example, is it a directory or a page? And, if it's a page, what kind? And what exactly does the "u" at the end stand for?

Wouldn't it be a little better if I instead abbreviated this URL to http://www.domain.com/.../url/page.apsx? We've lost a few characters due to the three dots that show information is missing. But we can still see the domain, and the page name and type.

The code is Listing 1 abbreviates a URL is this way. The UrlHelper class contains just a single, static method, LimitLength(). This method takes a URL string and a maximum length arguments, and attempts to abbreviate the URL so that it will fit within the specified number of characters as described above.

public class UrlHelper
{
  public static char[] Delimiters = { '/', '\\' };
  /// <summary>
  /// Attempts to intelligently short the length of a URL. No attempt is
  /// made to shorten less than 5 characters.
  /// </summary>
  /// <param name="url">The URL to be tested</param>
  /// <param name="maxLength">The maximum length of the result string</param>
  /// <returns></returns>
  public static string LimitLength(string url, int maxLength)
  {
    if (maxLength < 5)
      maxLength = 5;
    if (url.Length > maxLength)
    {
      // Remove protocol
      int i = url.IndexOfAny(new char[] { ':', '.' });
      if (i >= 0 && url[i] == ':')
        url = url.Remove(0, i + 1);
      // Remove leading delimiters
      i = 0;
      while (url.Length > 0 && (url[i] == Delimiters[0]
        || url[0] == Delimiters[1]))
        i++;
      if (i > 0)
        url = url.Remove(0, i);
      // Remove trailing delimiter
      if (url.Length > maxLength && (url.EndsWith("/") || url.EndsWith("\\")))
        url = url.Remove(url.Length - 1);
      // Remove path segments until url is short enough or no more segments:
      //
      // domain.com/abc/def/ghi/jkl.htm
      // domain.com/.../def/ghi/jkl.htm
      // domain.com/.../ghi/jkl.htm
      // domain.com/.../jkl.htm
      if (url.Length > maxLength)
      {
        i = url.IndexOfAny(Delimiters);
        if (i >= 0)
        {
          string first = url.Substring(0, i + 1);
          string last = url.Substring(i);
          bool trimmed = false;
          do
          {
            i = last.IndexOfAny(Delimiters, 1);
            if (i < 0 || i >= (last.Length - 1))
              break;
            last = last.Substring(i);
            trimmed = true;
          } while ((first.Length + 3 + last.Length) > maxLength);
          if (trimmed)
            url = String.Format("{0}...{1}", first, last);
        }
      }
    }
    return url;
  }
}

Listing 1: UrlHelper class.

If the specified maximum length is less than five, LimitLength() simply changes it to five as there is no point in attempting to shorten a URL to less than the length of the protocol (http://).

That's all there is to it. I hope some of you find this code helpful.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:  
Categories:   C# .NET | ASP.NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Encoding Query Arguments

Thursday, April 29, 2010 7:06 AM by jonwood

When passing variables between pages in ASP.NET, you have a few techniques you can choose from. One of the simplest is to use query arguments (e.g. http://www.domain.com/page.aspx?arg1=val1&arg2=val2). In ASP.NET, query arguments are easy to implement and use.

If you spend time browsing sites like Amazon.com, you'll see these query arguments causing the URLs to grow quite long. Long URLs don't generally cause a problem; however, there are some potential problems with query arguments.

For one thing, they are completely visible to the user. If you need to pass sensitive variables, then this could cause problems. For another thing, users can easily modify these values. For example, let's say you have a page that displays the current user's information. If a user ID is passed as a query argument, the user could easily edit that ID, possibly causing information for another user to be displayed. The potential security concerns here are pretty obvious.

Still, query arguments can be so convenient I decided to throw together a class that allows me to use them without the potential issues described above. In order to prevent the arguments from being seen by the user, the arguments are encrypted into a single argument. And in order to prevent the user from tampering with the values, the encrypted value includes a checksum that can detect if the data has been tampered with or corrupted.

Listing 1 shows my EncryptedQueryString class. By inheriting from Dictionary<string, string>, my class is a dictionary class. You can add any number of key/value items to the dictionary and then call ToString() to produce an encrypted string that contains all the values and a simple checksum. The string returned can then be passed to a page as a single query argument.

To restore the values, you can call the constructor that accepts an encrypted string. This constructor extracts the data from the encrypted string and adds it to the dictionary. Note that if this constructor finds an invalid or missing checksum, nothing is added to the dictionary. This prevents the calling code from working with questionable data.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Web;
public class EncryptedQueryString : Dictionary<string, string>
{
  // Change the following keys to ensure uniqueness
  // Must be 8 bytes
  protected byte[] _keyBytes =
    { 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18 };
  // Must be at least 8 characters
  protected string _keyString = "ABC12345";
  // Name for checksum value (unlikely to be used in user arguments)
  protected string _checksumKey = "__$$";
  /// <summary>
  /// Creates an empty dictionary
  /// </summary>
  public EncryptedQueryString()
  {
  }
  /// <summary>
  /// Creates a dictionary from the given, encrypted string
  /// </summary>
  /// <param name="encryptedData"></param>
  public EncryptedQueryString(string encryptedData)
  {
    // Descrypt string
    string data = Decrypt(encryptedData);
    // Parse out key/value pairs and add to dictionary
    string checksum = null;
    string[] args = data.Split('&');
    foreach (string arg in args)
    {
      int i = arg.IndexOf('=');
      if (i != -1)
      {
        string key = arg.Substring(0, i);
        string value = arg.Substring(i + 1);
        if (key == _checksumKey)
          checksum = value;
        else
          base.Add(HttpUtility.UrlDecode(key), HttpUtility.UrlDecode(value));
      }
    }
    // Clear contents if valid checksum not found
    if (checksum == null || checksum != ComputeChecksum())
      base.Clear();
  }
  /// <summary>
  /// Returns an encrypted string that contains the current dictionary
  /// </summary>
  /// <returns></returns>
  public override string ToString()
  {
    // Build query string from current contents
    StringBuilder content = new StringBuilder();
    foreach (string key in base.Keys)
    {
      if (content.Length > 0)
        content.Append('&');
      content.AppendFormat("{0}={1}",  HttpUtility.UrlEncode(key),
        HttpUtility.UrlEncode(base[key]));
    }
    // Add checksum
    if (content.Length > 0)
      content.Append('&');
    content.AppendFormat("{0}={1}", _checksumKey, ComputeChecksum());
    return Encrypt(content.ToString());
  }
  /// <summary>
  /// Returns a simple checksum for all keys and values in the collection
  /// </summary>
  /// <returns></returns>
  protected string ComputeChecksum()
  {
    int checksum = 0;
    foreach (KeyValuePair<string, string> pair in this)
    {
      checksum += pair.Key.Sum(c => c - '0');
      checksum += pair.Value.Sum(c => c - '0');
    }
    return checksum.ToString("X");
  }
  /// <summary>
  /// Encrypts the given text
  /// </summary>
  /// <param name="text">Text to be encrypted</param>
  /// <returns></returns>
  protected string Encrypt(string text)
  {
    try
    {
      byte[] keyData = Encoding.UTF8.GetBytes(_keyString.Substring(0, 8));
      DESCryptoServiceProvider des = new DESCryptoServiceProvider();
      byte[] textData = Encoding.UTF8.GetBytes(text);
      MemoryStream ms = new MemoryStream();
      CryptoStream cs = new CryptoStream(ms,
        des.CreateEncryptor(keyData, _keyBytes), CryptoStreamMode.Write);
      cs.Write(textData, 0, textData.Length);
      cs.FlushFinalBlock();
      return GetString(ms.ToArray());
    }
    catch (Exception)
    {
      return String.Empty;
    }
  }
  /// <summary>
  /// Decrypts the given encrypted text
  /// </summary>
  /// <param name="text">Text to be decrypted</param>
  /// <returns></returns>
  protected string Decrypt(string text)
  {
    try
    {
      byte[] keyData = Encoding.UTF8.GetBytes(_keyString.Substring(0, 8));
      DESCryptoServiceProvider des = new DESCryptoServiceProvider();
      byte[] textData = GetBytes(text);
      MemoryStream ms = new MemoryStream();
      CryptoStream cs = new CryptoStream(ms,
        des.CreateDecryptor(keyData, _keyBytes), CryptoStreamMode.Write);
      cs.Write(textData, 0, textData.Length);
      cs.FlushFinalBlock();
      return Encoding.UTF8.GetString(ms.ToArray());
    }
    catch (Exception)
    {
      return String.Empty;
    }
  }
  /// <summary>
  /// Converts a byte array to a string of hex characters
  /// </summary>
  /// <param name="data"></param>
  /// <returns></returns>
  protected string GetString(byte[] data)
  {
    StringBuilder results = new StringBuilder();
    foreach (byte b in data)
      results.Append(b.ToString("X2"));
    return results.ToString();
  }
  /// <summary>
  /// Converts a string of hex characters to a byte array
  /// </summary>
  /// <param name="data"></param>
  /// <returns></returns>
  protected byte[] GetBytes(string data)
  {
    // GetString() encodes the hex-numbers with two digits
    byte[] results = new byte[data.Length / 2];
    for (int i = 0; i < data.Length; i += 2)
      results[i / 2] = Convert.ToByte(data.Substring(i, 2), 16);
    return results;
  }
}

Listing 1: EncryptedQueryString class.

So, for example, a page that sends encrypted arguments to another page could contain code something like what is shown in Listing 2. This code constructs an empty EncryptedQueryString object, adds a couple of values to the dictionary, and then passes the resulting string as a single query argument to page.aspx.

protected void Button1_Click(object sender, EventArgs e)
{
  EncryptedQueryString args = new EncryptedQueryString();
  args["arg1"] = "val1";
  args["arg2"] = "val2";
  Response.Redirect(String.Format("page.aspx?args={0}", args.ToString()));
}

Listing 2: Code that passes encrypted query arguments.

Finally, Listing 3 shows code that could go in page.aspx to extract the encrypted values from the single argument.

protected void Page_Load(object sender, EventArgs e)
{
  EncryptedQueryString args =
    new EncryptedQueryString(Request.QueryString["args"]);
  Label1.Text = string.Format("arg1={0}, arg2={1}", args["arg1"], args["arg2"]);
}

Listing 3: Code to extract encrypted query arguments.

And that's all there is to it. Be sure to add error checking in case the dictionary objects are not there (either because they were not provided, or because an invalid checksum caused the EncryptedQueryString class to clear all items from the dictionary).

Also, be sure to customize the two keys near the top of Listing 1 so that people who read this article won't be able to decrypt your values.

Query arguments aren't always the best choice. As mentioned, you may choose to use Session variables or other techniques, depending on your requirements. But query arguments are straight forward and easy to implement. Using the class I've presented here, they can also be reasonably secure.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:  
Categories:   C# .NET | ASP.NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Parsing HTML Tags in C#

Sunday, February 07, 2010 10:46 AM by jonwood

The .NET framework provides a plethora of tools for generating HTML markup, and for both generating and parsing XML markup. However, it provides very little in the way of support for parsing HTML markup.

I had some pretty old code (written in classic Visual Basic) for spidering websites and I had ported it over to C#. Spidering generally involves parsing out all the links on a particular web page and then following those links and doing the same for those pages. Spidering is how companies like Google scour the Internet.

My ported code worked pretty well, but it wasn’t very forgiving. For example, I had a website that allowed users to enter a URL of a page that had a link to our site in return for a free promotion. The code would scan the given URL for a backlink. However, sometimes it would report there was no backlink when there really was.

The error was caused when the user’s web page contained syntax errors. For example, an attribute value that had no closing quote. My code would skip ahead past large amounts of markup, looking for that quote.

So I rewrote the code to be more flexible—as most browsers are. In the case of attribute values missing closing quotes, my code assumes the value has terminated whenever it encounters a line break. I made other changes as well, primarily designed to make the code simpler and more robust.

Listing 1 is the HtmlParser class I came up with. Note that there are many ways you can parse HTML. My code is only interested in tags and their attributes and does not look at text that comes between tags. This is perfect for spidering links in a page.

The ParseNext() method is called to find the next occurrence of a tag and returns an HtmlTag object that describes the tag. The caller indicates the type of tag it wants information about (or “*” if it wants information about all tags).

Parsing HTML markup is fairly simple. As I mentioned, much of my time spent was spent making the code handle markup errors intelligently. There were a few other special cases as well. For example, if the code finds a <script> tag, it automatically scans to the closing </script> tag, if any. This is because some scripting can include HTML markup characters that can confuse the parser so I just jump over them. I take similar action with HTML comments and have special handling for !DOCTYPE tags as well.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace HtmlParser
{
  public class HtmlTag
  {
    /// <summary>
    /// Name of this tag
    /// </summary>
    public string Name { get; set; }
    /// <summary>
    /// Collection of attribute names and values for this tag
    /// </summary>
    public Dictionary<string, string> Attributes { get; set; }
    /// <summary>
    /// True if this tag contained a trailing forward slash
    /// </summary>
    public bool TrailingSlash { get; set; }
  };
  public class HtmlParser
  {
    protected string _html;
    protected int _pos;
    protected bool _scriptBegin;
    public HtmlParser(string html)
    {
      Reset(html);
    }
    /// <summary>
    /// Resets the current position to the start of the current document
    /// </summary>
    public void Reset()
    {
      _pos = 0;
    }
    /// <summary>
    /// Sets the current document and resets the current position to the
    /// start of it
    /// </summary>
    /// <param name="html"></param>
    public void Reset(string html)
    {
      _html = html;
      _pos = 0;
    }
    /// <summary>
    /// Indicates if the current position is at the end of the current
    /// document
    /// </summary>
    public bool EOF
    {
      get { return (_pos >= _html.Length); }
    }
    /// <summary>
    /// Parses the next tag that matches the specified tag name
    /// </summary>
    /// <param name="name">Name of the tags to parse ("*" = parse all
    /// tags)</param>
    /// <param name="tag">Returns information on the next occurrence
    /// of the specified tag or null if none found</param>
    /// <returns>True if a tag was parsed or false if the end of the
    /// document was reached</returns>
    public bool ParseNext(string name, out HtmlTag tag)
    {
      tag = null;
      // Nothing to do if no tag specified
      if (String.IsNullOrEmpty(name))
        return false;
      // Loop until match is found or there are no more tags
      while (MoveToNextTag())
      {
        // Skip opening '<'
        Move();
        // Examine first tag character
        char c = Peek();
        if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
        {
          // Skip over comments
          const string endComment = "-->";
          _pos = _html.IndexOf(endComment, _pos);
          NormalizePosition();
          Move(endComment.Length);
        }
        else if (c == '/')
        {
          // Skip over closing tags
          _pos = _html.IndexOf('>', _pos);
          NormalizePosition();
          Move();
        }
        else
        {
          // Parse tag
          bool result = ParseTag(name, ref tag);
          // Because scripts may contain tag characters,
          // we need special handling to skip over
          // script contents
          if (_scriptBegin)
          {
            const string endScript = "</script";
            _pos = _html.IndexOf(endScript, _pos,
              StringComparison.OrdinalIgnoreCase);
            NormalizePosition();
            Move(endScript.Length);
            SkipWhitespace();
            if (Peek() == '>')
              Move();
          }
          // Return true if requested tag was found
          if (result)
            return true;
        }
      }
      return false;
    }
    /// <summary>
    /// Parses the contents of an HTML tag. The current position should
    /// be at the first character following the tag's opening less-than
    /// character.
    /// 
    /// Note: We parse to the end of the tag even if this tag was not
    /// requested by the caller. This ensures subsequent parsing takes
    /// place after this tag
    /// </summary>
    /// <param name="name">Name of the tag the caller is requesting,
    /// or "*" if caller is requesting all tags</param>
    /// <param name="tag">Returns information on this tag if it's one
    /// the caller is requesting</param>
    /// <returns>True if data is being returned for a tag requested by
    /// the caller or false otherwise</returns>
    protected bool ParseTag(string name, ref HtmlTag tag)
    {
      // Get name of this tag
      string s = ParseTagName();
      // Special handling
      bool doctype = _scriptBegin = false;
      if (String.Compare(s, "!DOCTYPE", true) == 0)
        doctype = true;
      else if (String.Compare(s, "script", true) == 0)
        _scriptBegin = true;
      // Is this a tag requested by caller?
      bool requested = false;
      if (name == "*" || String.Compare(s, name, true) == 0)
      {
        // Yes, create new tag object
        tag = new HtmlTag();
        tag.Name = s;
        tag.Attributes = new Dictionary<string, string>();
        requested = true;
      }
      // Parse attributes
      SkipWhitespace();
      while (Peek() != '>')
      {
        if (Peek() == '/')
        {
          // Handle trailing forward slash
          if (requested)
            tag.TrailingSlash = true;
          Move();
          SkipWhitespace();
          // If this is a script tag, it was closed
          _scriptBegin = false;
        }
        else
        {
          // Parse attribute name
          s = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
          SkipWhitespace();
          // Parse attribute value
          string value = String.Empty;
          if (Peek() == '=')
          {
            Move();
            SkipWhitespace();
            value = ParseAttributeValue();
            SkipWhitespace();
          }
          // Add attribute to collection if requested tag
          if (requested)
          {
            // This tag replaces existing tags with same name
            if (tag.Attributes.Keys.Contains(s))
              tag.Attributes.Remove(s);
            tag.Attributes.Add(s, value);
          }
        }
      }
      // Skip over closing '>'
      Move();
      return requested;
    }
    /// <summary>
    /// Parses a tag name. The current position should be the first
    /// character of the name
    /// </summary>
    /// <returns>Returns the parsed name string</returns>
    protected string ParseTagName()
    {
      int start = _pos;
      while (!EOF && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
        Move();
      return _html.Substring(start, _pos - start);
    }
    /// <summary>
    /// Parses an attribute name. The current position should be the
    /// first character of the name
    /// </summary>
    /// <returns>Returns the parsed name string</returns>
    protected string ParseAttributeName()
    {
      int start = _pos;
      while (!EOF && !Char.IsWhiteSpace(Peek()) && Peek() != '>'
        && Peek() != '=')
        Move();
      return _html.Substring(start, _pos - start);
    }
    /// <summary>
    /// Parses an attribute value. The current position should be the
    /// first non-whitespace character following the equal sign.
    /// 
    /// Note: We terminate the name or value if we encounter a new line.
    /// This seems to be the best way of handling errors such as values
    /// missing closing quotes, etc.
    /// </summary>
    /// <returns>Returns the parsed value string</returns>
    protected string ParseAttributeValue()
    {
      int start, end;
      char c = Peek();
      if (c == '"' || c == '\'')
      {
        // Move past opening quote
        Move();
        // Parse quoted value
        start = _pos;
        _pos = _html.IndexOfAny(new char[] { c, '\r', '\n' }, start);
        NormalizePosition();
        end = _pos;
        // Move past closing quote
        if (Peek() == c)
          Move();
      }
      else
      {
        // Parse unquoted value
        start = _pos;
        while (!EOF && !Char.IsWhiteSpace(c) && c != '>')
        {
          Move();
          c = Peek();
        }
        end = _pos;
      }
      return _html.Substring(start, end - start);
    }
    /// <summary>
    /// Moves to the start of the next tag
    /// </summary>
    /// <returns>True if another tag was found, false otherwise</returns>
    protected bool MoveToNextTag()
    {
      _pos = _html.IndexOf('<', _pos);
      NormalizePosition();
      return !EOF;
    }
    /// <summary>
    /// Returns the character at the current position, or a null
    /// character if we're at the end of the document
    /// </summary>
    /// <returns>The character at the current position</returns>
    public char Peek()
    {
      return Peek(0);
    }
    /// <summary>
    /// Returns the character at the specified number of characters
    /// beyond the current position, or a null character if the
    /// specified position is at the end of the document
    /// </summary>
    /// <param name="ahead">The number of characters beyond the
    /// current position</param>
    /// <returns>The character at the specified position</returns>
    public char Peek(int ahead)
    {
      int pos = (_pos + ahead);
      if (pos < _html.Length)
        return _html[pos];
      return (char)0;
    }
    /// <summary>
    /// Moves the current position ahead one character
    /// </summary>
    protected void Move()
    {
      Move(1);
    }
    /// <summary>
    /// Moves the current position ahead the specified number of characters
    /// </summary>
    /// <param name="ahead">The number of characters to move ahead</param>
    protected void Move(int ahead)
    {
      _pos = Math.Min(_pos + ahead, _html.Length);
    }
    /// <summary>
    /// Moves the current position to the next character that is
    // not whitespace
    /// </summary>
    protected void SkipWhitespace()
    {
      while (!EOF && Char.IsWhiteSpace(Peek()))
        Move();
    }
    /// <summary>
    /// Normalizes the current position. This is primarily for handling
    /// conditions where IndexOf(), etc. return negative values when
    /// the item being sought was not found
    /// </summary>
    protected void NormalizePosition()
    {
      if (_pos < 0)
        _pos = _html.Length;
    }
  }
}

Listing 1: The HtmlParse class.

Using the class is very easy. Listing 2 shows sample code that scans a web page for all the HREF values in A (anchor) tags. It downloads a URL and loads the contents into an instance of the HtmlParser class. It then calls ParseNext() with a request to return information about all A tags.

When ParseNext() returns, tag is set to an instance of the HtmlTag class with information about the tag that was found. This class includes a collection of attribute values, which my code uses to locate the value of the HREF attribute.

When ParseNext() returns false, the end of the document has been reached.

  protected void ScanLinks(string url)
  {
    // Download page
    WebClient client = new WebClient();
    string html = client.DownloadString(url);
    // Scan links on this page
    HtmlTag tag;
    HtmlParser parse = new HtmlParser(html);
    while (parse.ParseNext("a", out tag))
    {
      // See if this anchor links to us
      string value;
      if (tag.Attributes.TryGetValue("href", out value))
      {
        // value contains URL referenced by this link
      }
    }
  }

Listing 2: Code that demonstrates using the HtmlParser class

While I’ll probably find a few tweaks and fixes required to this code, it seems to work well. I found similar code on the web but didn’t like it. My code is fairly simple, does not rely on large library routines, and seems to perform well. I hope you are able to benefit from it.

Currently rated 4.8 by 4 people

  • Currently 4.75/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:   ,
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Implementing Word Wrap in C#

Sunday, January 10, 2010 1:44 PM by jonwood

The .NET platform makes it easy to send emails from your code. However, it was bothering me the other day that my emails had no word wrap.

In most cases, modern email readers will word wrap when email lines are too long. But there are still some email readers around that won’t. The industry standard is to wrap email lines, limiting their length to about 65-75 characters. So I decided it was worth implementing word wrap in my code.

As rich as it is, the .NET platform does not appear to have any routines for implementing word wrap. I found some sample code online but, while the code was fairly simple (which is good), I didn’t think it was very efficient.

The .NET platform provides many routines for parsing text and extracting substrings, etc. but these generally involve allocating and moving lots of memory. So my approach was to write simple C# code that would word wrap the code without unnecessarily allocating additional objects.

Of course, I will need a new string in order to save my results. And since I’ll be building that string line-by-line, I used the StringBuilder class for this. The StringBuilder class allows you to more efficiently build a string without allocating new strings each time you make a change. Listing 1 is the code I came up with.

protected const string _newline = "\r\n";
/// <summary>
/// Word wraps the given text to fit within the specified width.
/// </summary>
/// <param name="text">Text to be word wrapped</param>
/// <param name="width">Width, in characters, to which the text
/// should be word wrapped</param>
/// <returns>The modified text</returns>
public static string WordWrap(string text, int width)
{
  int pos, next;
  StringBuilder sb = new StringBuilder();
  // Lucidity check
  if (width < 1)
    return text;
  // Parse each line of text
  for (pos = 0; pos < text.Length; pos = next)
  {
    // Find end of line
    int eol = text.IndexOf(_newline, pos);
    if (eol == -1)
      next = eol = text.Length;
    else
      next = eol + _newline.Length;
    // Copy this line of text, breaking into smaller lines as needed
    if (eol > pos)
    {
      do
      {
        int len = eol - pos;
        if (len > width)
          len = BreakLine(text, pos, width);
        sb.Append(text, pos, len);
        sb.Append(_newline);
        // Trim whitespace following break
        pos += len;
        while (pos < eol && Char.IsWhiteSpace(text[pos]))
          pos++;
      } while (eol > pos);
    }
    else sb.Append(_newline); // Empty line
  }
  return sb.ToString();
}
/// <summary>
/// Locates position to break the given line so as to avoid
/// breaking words.
/// </summary>
/// <param name="text">String that contains line of text</param>
/// <param name="pos">Index where line of text starts</param>
/// <param name="max">Maximum line length</param>
/// <returns>The modified line length</returns>
public static int BreakLine(string text, int pos, int max)
{
  // Find last whitespace in line
  int i = max - 1;
  while (i >= 0 && !Char.IsWhiteSpace(text[pos + i]))
    i--;
  if (i < 0)
    return max; // No whitespace found; break at maximum length
  // Find start of whitespace
  while (i >= 0 && Char.IsWhiteSpace(text[pos + i]))
    i--;
  // Return length of text before whitespace
  return i + 1;
}

Listing 1: Word Wrap Code

The code starts by extracting each line from the original text. It does this by locating the hard-coded line breaks. Note that my code searches for carriage return, line feed pairs (“\r\n”). Some platforms may only use “\n” or other variations for new lines, but the carriage return, line feed pair works in most cases on Windows systems. You can change the _newline constant if you want the code to look for something else.

The code then copies each line to the result string. If a line is too long to fit within the specified width, then it is further broken into smaller lines. Each time through the loop, if the line needs to be broken, the BreakLine method is called to locate the last white space that fits within the maximum line length. This is done to try and break the line between words instead of in the middle of them.

While the string object provides the LastIndexOf() method, which could be used to locate the last space character, I manually coded the loop myself so that I could use Char.IsWhiteSpace() to support all whitespace characters defined on the current system. If no whitespace is found, the line is simply broken at the maximum line length.

As each line is broken, that the code removes any spaces at the break. This avoids trailing spaces on the current line or leading spaces on the next line. Although there is normally only one space between each word, the code tries to correctly handle cases where there might be more.

As each new line is created, a carriage return, line feed pair is also added to separate each line. Note the special case for handling when the line is empty, in which case we just write the carriage return, line feed pair.

There’s nothing complex about this code, but I took a little extra time to make it efficient. Note that the word wrap is based on the number of characters and not the display width. If you were, for example, word wrapping text output to the screen or printer, the code should probably test different line lengths measured on a device context in order to determine the display length.

Currently rated 5.0 by 3 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:  
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Validating Credit Card Numbers

Tuesday, May 12, 2009 2:17 AM by jonwood

When using ASP.NET to process online credit card orders, it is a good idea if you can perform some sort of validation on the credit card number before submitting it to your processor. I recently had to write some code to process credit card orders and thought I’d share a bit of my code.

Fortunately, credit card numbers are created in a way that allows for some basic verification. This verification does not tell you if funds are available on the account and it certainly doesn’t tell whether or not the person submitting the order is committing credit card fraud. In fact, It’s possible that the card number is mistyped in such a way that it just happens to pass verification. But it does catch most typing errors and reduces bandwidth by catching those errors before trying to actually process the credit card.

To validate a credit card number, you start by adding the value of every other digit, starting from the right-most digit and working left. Next, you do the same thing with the digits skipped in the first step, but this time you double the value of each digit and add the value of each digit in the result. Finally, you add both totals together and if the result is evenly divisible by 10, then the card has passed validation.

Of course, this would be clearer with a bit of code and Listing 1 shows my IsCardNumberValid method.

public static bool IsCardNumberValid(string cardNumber)
{
  int i, checkSum = 0;
  // Compute checksum of every other digit starting from right-most digit
  for (i = cardNumber.Length - 1; i >= 0; i -= 2)
    checkSum += (cardNumber[i] - '0');
  // Now take digits not included in first checksum, multiple by two,
  // and compute checksum of resulting digits
  for (i = cardNumber.Length - 2; i >= 0; i -= 2)
  {
    int val = ((cardNumber[i] - '0') * 2);
    while (val > 0)
    {
      checkSum += (val % 10);
      val /= 10;
    }
  }
  // Number is valid if sum of both checksums MOD 10 equals 0
  return ((checkSum % 10) == 0);
}

Listing 1: Validating a credit card.

The IsCardNumberValid method assumes that all spaces and other non-digit characters have been stripped from the card number string. This is a straight forward task but Listing 2 shows the method I use for this.

public static string NormalizeCardNumber(string cardNumber)
{
  if (cardNumber == null)
    cardNumber = String.Empty;
  StringBuilder sb = new StringBuilder();
  foreach (char c in cardNumber)
  {
    if (Char.IsDigit(c))
      sb.Append(c);
  }
  return sb.ToString();
}

Listing 2: Removing all non-digit characters from a credit card number.

You will also be able to reduce bandwidth if you can avoid trying to submit a card that is not supported by the business. So another task that can be useful is determining the credit card type.

public enum CardType
{
  Unknown = 0,
  MasterCard = 1,
  VISA = 2,
  Amex = 3,
  Discover = 4,
  DinersClub = 5,
  JCB = 6,
  enRoute = 7
}
// Class to hold credit card type information
private class CardTypeInfo
{
  public CardTypeInfo(string regEx, int length, CardType type)
  {
    RegEx = regEx;
    Length = length;
    Type = type;
  }
  public string RegEx { get; set; }
  public int Length { get; set; }
  public CardType Type { get; set; }
}
// Array of CardTypeInfo objects. Used by GetCardType() to identify credit card types.
private static CardTypeInfo[] _cardTypeInfo =
{
  new CardTypeInfo("^(51|52|53|54|55)", 16, CardType.MasterCard),
  new CardTypeInfo("^(4)", 16, CardType.VISA),
  new CardTypeInfo("^(4)", 13, CardType.VISA),
  new CardTypeInfo("^(34|37)", 15, CardType.Amex),
  new CardTypeInfo("^(6011)", 16, CardType.Discover),
  new CardTypeInfo("^(300|301|302|303|304|305|36|38)", 14, CardType.DinersClub),
  new CardTypeInfo("^(3)", 16, CardType.JCB),
  new CardTypeInfo("^(2131|1800)", 15, CardType.JCB),
  new CardTypeInfo("^(2014|2149)", 15, CardType.enRoute),
};
public static CardType GetCardType(string cardNumber)
{
  foreach (CardTypeInfo info in _cardTypeInfo)
  {
    if (cardNumber.Length == info.Length && Regex.IsMatch(cardNumber, info.RegEx))
      return info.Type;
  }
  return CardType.Unknown;
}

Listing 3: Determining a credit card’s type.

Listing 3 is my code to determine a credit card’s type. I’m a big fan of table-driven code, when it makes sense, and so I created an array of CardTypeInfo objects. The GetCardType() method simply loops through this array, looking for the first description that would match the credit card number being tested. As before, this routine assumes all non-digit characters have been removed from the credit card number string.

The main reason I like table-driven code is because it makes the code simpler. This results in code that is easier to read and modify. GetCardType() returns a value from the CardType enum. CardType.Unknown is returned if the card number doesn’t match any card descriptions in the table.

Writing code to process credit cards involves a number of issues that need to be addressed. Hopefully, this code will give you a leg up on addressing a couple of them.

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:  
Categories:   C# .NET | ASP.NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Shallow and Deep Object Copying

Sunday, April 12, 2009 1:23 PM by jonwood

In .NET, class objects are reference types. Assigning one object variable to another object variable does not copy that object, it simply causes both object variables to reference the same object.

Sometimes, a copy is required. For example, maybe two routines need to start with the same data but then change that data independently from each other. Copying the data ensures that changes made by one routine will not impact the data being used by the other routine.

When using .NET, two types of copies are possible: shallow and deep. In the case of a shallow copy, a new object is created and each member from the original object is assigned to the corresponding member of the new object. In the case of value members, this is a copy in the truest sense. However, with objects that contain reference members, this does not produce a true copy.

One example of a reference type is a string. When you assign one string variable to another, both variables will reference the same string data. The characters of the strings are not truly copied. So if a class contains reference members, a shallow copy does not create a true copy of all class members.

For many cases, a shallow copy is sufficient. Note that strings are immutable and cannot be changed. When you create a shallow copy of an object that contains strings, and then modify a string in the new object, that would create a new string and would not have any impact on the original string in the original object. Note that other data types such as arrays, class objects, and arrays of class objects can be quite a bit more complicated than strings.

A deep copy is when a copy is created that contains none of the original data. A true copy of each member is created. A deep copy doesn’t need to do anything special with members that are value types. But for reference data types, the new object must reference copies of that data instead of the original data.

There is nothing unique about how either method of copying an object are performed. Consider listing 1. This code declares a class called MyClass, and then shows a short method called Test that performs both a shallow and a deep copy using that class object.

protected class MyClass
{
   public int i;
   public int j;
   public string message;
}
private void Test()
{
   MyClass mc1;
   MyClass mc2;
   mc1 = new MyClass();
   mc1.i = 5;
   mc1.j = 10;
   mc1.message = "Hello, World!";
   // Shallow copy
   mc2 = new MyClass();
   mc2.i = mc1.i;
   mc2.j = mc1.j;
   mc2.message = mc1.message;
   // Deep copy
   mc2 = new MyClass();
   mc2.i = mc1.i;
   mc2.j = mc1.j;
   mc2.message = String.Copy(mc1.message);
}

Listing 1: Shallow and deep copying of an object.

The shallow copy does nothing special. It simply assigns each member from one object to the other. For value members, the deep copy uses the same code. However, for the one reference member, message, the code must create a copy of the string data. (Note that addition steps would be required to perform a deep copy with objects that include reference members with references to additional objects, such as class members, arrays, etc.)

Now that I’ve hopefully explained the difference between a shallow and a deep copy, let’s take a look at some of the tools the .NET frameworks provide to perform these tasks.

protected class MyClass : ICloneable
{
   public int i;
   public int j;
   public string message;
   public object Clone()
   {
      return MemberwiseClone();
   }
}
private void Test()
{
   MyClass mc1 = new MyClass();
   mc1.i = 5;
   mc1.j = 10;
   mc1.message = "Hello, World!";
   // Shallow copy
   MyClass mc2 = (MyClass)mc1.Clone();
}

Listing 2: Using MemberwiseClone() to perform a shallow copy.

Listing 2 uses MemberwiseClone() to perform a shallow copy. MemberwiseClone() is protected and so cannot be called directly from Test. Instead, I’ve modified MyClass to implement the ICloneable interface and implemented the one ICloneable method, Clone. (Normally, ICloneable is associated with a deep copy but I use it here to implement a shallow copy.) The Test method calls this new method to perform the shallow copy. Since Clone() returns type object, a type cast is required.

To perform a deep copy, Listing 3 also implements the ICloneable interface. This listing just modifies the code in the Clone() method to perform a deep copy.

protected class MyClass : ICloneable
{
   public int i;
   public int j;
   public string message;
   public object Clone()
   {
      MyClass mc = new MyClass();
      mc.i = i;
      mc.j = j;
      if (message != null)
         mc.message = String.Copy(message);
      return mc;
   }
}
private void Test()
{
   MyClass mc1 = new MyClass();
   mc1.i = 5;
   mc1.j = 10;
   mc1.message = "Hello, World!";
   // Deep copy
   MyClass mc2 = (MyClass)mc1.Clone();
}

Listing 3: Using ICloneable to perform a deep copy.

The actual code in the Clone() method should be familiar by now. The main advantage to implementing it this way is that it is implemented as part of the class, where it can easily be modified and called from any where in your application.

Nothing too complex here, although the concept behind a shallow and deep copy can be confusing to some. Hopefully, I’ve shown some light on this topic and demonstrated how you might approach the issue using .NET.

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

BackgroundWorker.ReportProgress is Asynchronous

Sunday, April 12, 2009 12:12 PM by jonwood

I never noticed this before but the BackgroundWorker.ReportProgress method returns before the control’s ProgressChanged event has completed. It may return before the ProgressChanged event has even started!

For those not familiar with the BackgroundWorker control, this control simplifies creating a worker thread, especially for the purpose of keeping the user interface responsive while the worker thread performs a lengthy process.

One issue it simplifies relates to the fact that the worker thread cannot directly access the form or its controls because those objects were created by the UI thread. Instead, code running in the worker thread can call the control’s ReportProgress method, which raises the control’s ProgressChanged event. You can pass information to ReportProgress that describes the current state of the lengthy process, and the handler for the ProgressChanged can use that data to display it to the user in your form controls.

I had been using this approach for a lengthy operation that could run for days. A lot was going on so I was passing an instance of a custom class that contained various bits of progress information. But, at one point, I saw that the progress information being displayed was not correct. On further inspection, I could see that my worker thread was updating the progress information object before the ProgressChanged event handler had a chance to display that information.

It is very easy to get caught up with multi-threading issues as some things are just not very intuitive. When I called the ReportProgress method, I had just assumed that it would not return until the ProgressChanged event had completed. But I was wrong.

Thinking about it, the way this control works makes sense. If, instead, the worker thread was blocked until the event had returned, some of the benefits of a worker thread would be lost as one thread would be shut down during that time. Also, note that the ProgressChanged method is overloaded. One version simply takes an integer argument. Since integers are passed by value, there would be no reason to suspend the worker thread when using this version of the ProgressChanged method.

The other version takes an object in addition to the integer argument. That’s the version I was using. Since class objects are passed by reference, changes to this data made in the worker thread would be reflected in the same object being used in the ProgressChanged event.

At first thought, I wondered if maybe I should resolve this by blocking the worker thread somehow until the event had run to completion. But, as I’ve already pointed out, this eliminates some of the advantage of having a worker thread in the first place. A much simpler solution is to simply make a copy of my progress class object. This way, the worker thread can modify its copy as needed while the ProgressChanged event is reading its copy, perhaps both at the same time.

Note that I only required a “shallow” copy. In the case of value members, a shallow copy will create a true copy of those members. In the case of reference members, the copy is actually a reference to the same object. The only reference members in my case were strings. Since strings are immutable and cannot be changed, if my code updated one of these members, that would create a new string and not affect the original one referenced in the object passed to ReportProgress.

protected class ProgressInfo
{
   public int current;
   public int total;
   public string message;
   public object Clone()
   {
      return (ImportStatus)MemberwiseClone();
   }
}
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
   ProgressInfo info = new progressInfo();
   info.total = 1000;
   for (info.current = 0; info.current < info.total; info.current++)
   {
      info.message = String.Format("Processing item {0}",
         info.current + 1);
      backgroundWorker1.ReportProgress(0, info.Clone());
      //
      // Further processing on this item
      //
   }
}

Listing 1: Passing copy of object to BackgroundWorker.ReportProgress

Listing 1 shows some sample code. The ProgressInfo class declares a Clone() method, which calls MemberwiseClone(). MemberwiseClone() performs a shallow copy of the object. Note that this method is protected and, therefore, can only be called from a method of the class (or a derived class). This is why it was necessary to create the additional, public, “wrapper” method in my class, which my worker thread can call.

Using this code, my ProgressChanged event handler can take its time displaying the progress data and will not be affected by my background worker thread making changes to its copy of that data at the same time.

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (0) | Comment RSSRSS comment feed

Selecting a ListView Item

Thursday, January 08, 2009 5:03 AM by jonwood

Recently, I was writing a C# desktop application that processed records from a database. I wanted to provide visual feedback as to what was going on so I decided to add each record to a ListView control as the record was processed.

In order to highlight the current item, and also to scroll the ListView control as needed in order to keep the current item visible, I decided to select each item after I added it to the list.

I was surprised to find that there is no single method that will select an item as though it was clicked with the mouse. You can select an item by setting it's Selected property to true. But the ListView control allows multiple items to be selected and so selecting an item does not unselect any already selected items.

Also, when you click on a ListView item, it also gets a focus rectangle drawn around it. This is a type of dotted line that shows which item is the current item. (Note that the focus rectangle is only drawn when the control itself has the focus.)

And finally, when you click on a ListView item with the mouse, the ListView will scroll, if needed, so that the newly selected item is fully visible within the ListView’s client window area.

As I mentioned, there is no single method to duplicate all the actions that occur when you click on a ListView item with the mouse. The code I ended up with to add an item to a ListView control and fully select that item is shown in Listing 1.

ListViewItem item = new ListViewItem();
lvwInfo.SelectedItems.Clear();
lvwInfo.Items.Add(item);
lvwInfo.EnsureVisible(item.Index);
item.Selected = true;
item.Focused = true;

Listing 1: Adding, and then selecting a ListView item as though it had been clicked with the mouse.

I don’t understand why a higher level method is not provided to accomplish this same task. Not only would that save the time required to find all the individual methods required to accomplish this, but it’s possible that the control could change in the future to require an additional or different step. Having this logic as part of a single method in the control allows the control developers to make changes as needed and all code that uses that control would then work correctly.

Currently rated 3.0 by 2 people

  • Currently 3/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Tags:  
Categories:   C# .NET
Actions:   E-mail | del.icio.us | Permalink | Comments (2) | Comment RSSRSS comment feed