The other weekend I was trying to use the Tokenize member function of the MFC class CString (with slash and comma as its delimiter set), to process some data input that looked a bit like this:
1.23/-1.1,,B16
I was expecting it to return something like:
"1.23" "-1.1" "" "B16"
i.e. every delimited token in the input, including the blank one. I was somewhat suprised then when the function returned
"1.23" "-1.1" "B16"
The blank token got swallowed. Why? In order to understand this issue, we need a little history. Every writer of a tokenizer has to decide what to do when faced with blank fields, which just look like a series of delimiters. I think CString::Tokenize's writer decided to duplicate the behaviour of the C runtime function strtok, on the basis that this behaviour is what greybeard C programmers would expect. Now strtok goes back a l-o-n-g way, and carries the slightly rancid whiff of academic computer science with it: it behaves as a compiler writer would expect it to behave. Forget our data processing example above: imagine it's the early seventies, you're at Berkeley Heights and you are attempting to write a lexical parser. You need to extract the tokens from this code:
if (somevariable == someothervariable)
{
dependentvariable = avalue;
}
In this case, as a compiler writer I don't care if the rbracket is followed by three tabs, a space, a carriage return, a newline and three more spaces, I just want to see lbrace as the next delimited token. So the behaviour of strtok, as per its documentation, is to strip leading delimiters. And CString::Tokenize behaves exactly the same way.
So, taking off our tie-dies and flares and running the clock forward to today, how do I solve the problem of parsing the tokens out of my protocol string, including blank ones? I have found several functions online, but they commonly only deal with a single delimiter. I took the entirely sensible premise of one function I found online - to parse a CString into a CStringArray - and extended it to support multiple delimiters. This is what I ended up with:
void ExtractTokensToArray (CString s, CStringArray& OutputArray, CString Delimiters)
{
int Hit;
int EarliestHit;
int DelimiterIndex;
CString sExtract;
BOOL MoreTokens = TRUE;
BOOL GotToken = FALSE;
TCHAR CurrDelimiter;
while (MoreTokens)
{
GotToken = FALSE;
EarliestHit = s.GetLength();
// Trawl the string looking for the leftmost (earliest) hit in
// the list of valid separators.
for (DelimiterIndex = 0;
DelimiterIndex < Delimiters.GetLength();
DelimiterIndex++)
{
CurrDelimiter = Delimiters [DelimiterIndex];
Hit = s.Find (CurrDelimiter);
if (Hit != -1)
{
if (Hit < EarliestHit)
{
EarliestHit = Hit;
}
GotToken = TRUE;
}
}
if (GotToken)
{
sExtract = s.Left(EarliestHit);
s = s.Right(s.GetLength() - EarliestHit - 1);
OutputArray.Add(sExtract);
}
else
{
// Trailing token.
MoreTokens = FALSE;
OutputArray.Add(s);
}
}
}
In the great tradition of this site, error handling and input sanity checking is left as an exercise for the student :)
Download