Bob Moore's Coding Tips

Why does the MFC CString::Tokenize method ignore blank fields?

The other weekend I was trying to use the Tokenize member function of the MFC class CString (with slash and comma as its delimiter set), to process some data input that looked a bit like this:

1.23/-1.1,,B16

I was expecting it to return something like:

"1.23" 
"-1.1" 
"" 
"B16"

i.e. every delimited token in the input, including the blank one. I was somewhat suprised then when the function returned

"1.23"
"-1.1"
"B16"

The blank token got swallowed. Why? In order to understand this issue, we need a little history. Every writer of a tokenizer has to decide what to do when faced with blank fields, which just look like a series of delimiters. I think CString::Tokenize's writer decided to duplicate the behaviour of the C runtime function strtok, on the basis that this behaviour is what greybeard C programmers would expect. Now strtok goes back a l-o-n-g way, and carries the slightly rancid whiff of academic computer science with it: it behaves as a compiler writer would expect it to behave. Forget our data processing example above: imagine it's the early seventies, you're at Berkeley Heights and you are attempting to write a lexical parser. You need to extract the tokens from this code:

if (somevariable == someothervariable)
{
   dependentvariable = avalue;
}

In this case, as a compiler writer I don't care if the rbracket is followed by three tabs, a space, a carriage return, a newline and three more spaces, I just want to see lbrace as the next delimited token. So the behaviour of strtok, as per its documentation, is to strip leading delimiters. And CString::Tokenize behaves exactly the same way.

So, taking off our tie-dies and flares and running the clock forward to today, how do I solve the problem of parsing the tokens out of my protocol string, including blank ones? I have found several functions online, but they commonly only deal with a single delimiter. I took the entirely sensible premise of one function I found online - to parse a CString into a CStringArray - and extended it to support multiple delimiters. This is what I ended up with:

void ExtractTokensToArray (CString s, CStringArray& OutputArray, CString Delimiters)
{
   int     Hit;
   int     EarliestHit;
   int     DelimiterIndex;
   CString sExtract;
   BOOL    MoreTokens = TRUE;
   BOOL    GotToken = FALSE;
   TCHAR   CurrDelimiter;

   while (MoreTokens)
   {
      GotToken = FALSE;
      EarliestHit = s.GetLength();

      // Trawl the string looking for the leftmost (earliest) hit in
      // the list of valid separators.
      for (DelimiterIndex = 0; 
           DelimiterIndex < Delimiters.GetLength();
           DelimiterIndex++)
      {
         CurrDelimiter = Delimiters [DelimiterIndex];
         Hit = s.Find (CurrDelimiter);
         if (Hit != -1)
         {
            if (Hit < EarliestHit)
            {
               EarliestHit = Hit;
            }
            GotToken = TRUE;
         }
      }

      if (GotToken)
      {
         sExtract = s.Left(EarliestHit);
         s = s.Right(s.GetLength() - EarliestHit - 1);
         OutputArray.Add(sExtract);
      }
      else
      {
         // Trailing token.
         MoreTokens = FALSE;
         OutputArray.Add(s);
      }
   }
}

In the great tradition of this site, error handling and input sanity checking is left as an exercise for the student :)

Download