The other weekend I was trying to use the Tokenize member function of the MFC class CString (with slash and comma as its delimiter set), to process some data input that looked a bit like this:
1.23/-1.1,,B16
I was expecting it to return something like:
"1.23" "-1.1" "" "B16"
i.e. every delimited token in the input, including the blank one. I was somewhat suprised then when the function returned
"1.23" "-1.1" "B16"
The blank token got swallowed. Why? In order to understand this issue, we need a little history. Every writer of a tokenizer has to decide what to do when faced with blank fields, which just look like a series of delimiters. I think CString::Tokenize's writer decided to duplicate the behaviour of the C runtime function strtok, on the basis that this behaviour is what greybeard C programmers would expect. Now strtok goes back a l-o-n-g way, and carries the slightly rancid whiff of academic computer science with it: it behaves as a compiler writer would expect it to behave. Forget our data processing example above: imagine it's the early seventies, you're at Berkeley Heights and you are attempting to write a lexical parser. You need to extract the tokens from this code:
if (somevariable == someothervariable) { dependentvariable = avalue; }
In this case, as a compiler writer I don't care if the rbracket is followed by three tabs, a space, a carriage return, a newline and three more spaces, I just want to see lbrace as the next delimited token. So the behaviour of strtok, as per its documentation, is to strip leading delimiters. And CString::Tokenize behaves exactly the same way.
So, taking off our tie-dies and flares and running the clock forward to today, how do I solve the problem of parsing the tokens out of my protocol string, including blank ones? I have found several functions online, but they commonly only deal with a single delimiter. I took the entirely sensible premise of one function I found online - to parse a CString into a CStringArray - and extended it to support multiple delimiters. This is what I ended up with:
void ExtractTokensToArray (CString s, CStringArray& OutputArray, CString Delimiters) { int Hit; int EarliestHit; int DelimiterIndex; CString sExtract; BOOL MoreTokens = TRUE; BOOL GotToken = FALSE; TCHAR CurrDelimiter; while (MoreTokens) { GotToken = FALSE; EarliestHit = s.GetLength(); // Trawl the string looking for the leftmost (earliest) hit in // the list of valid separators. for (DelimiterIndex = 0; DelimiterIndex < Delimiters.GetLength(); DelimiterIndex++) { CurrDelimiter = Delimiters [DelimiterIndex]; Hit = s.Find (CurrDelimiter); if (Hit != -1) { if (Hit < EarliestHit) { EarliestHit = Hit; } GotToken = TRUE; } } if (GotToken) { sExtract = s.Left(EarliestHit); s = s.Right(s.GetLength() - EarliestHit - 1); OutputArray.Add(sExtract); } else { // Trailing token. MoreTokens = FALSE; OutputArray.Add(s); } } }
In the great tradition of this site, error handling and input sanity checking is left as an exercise for the student :)
Download