Parsing delimited strings

June 19, 2012

While reviewing the dzen source code I came across a function called fill_ev_table, which parses a string of action and event keywords (see Events and actions). For example, the following is a default event string for dzen.

char edef[]  = "entertitle=uncollapse,grabkeys;"
    "enterslave=grabkeys;leaveslave=collapse,ungrabkeys;"
    "button1=menuexec;button2=togglestick;button3=exit:13;"
    "button4=scrollup;button5=scrolldown;"
    "key_Up=scrollup;key_Down=scrolldown;"
    "key_Escape=ungrabkeys,exit";

String token parsing

The parsing is done by calling the strtok_r function, which is a member of the C Standard Library.

char* strtok(char * restrict s1, const char *restrict s2);

The original strtok function accepts two strings, the first is the string to be parsed and the second is the delimiter. The first time that strtok is called with a new string it will created a static buffer. That static buffer is used to return successive tokens with each preceding call made to strtok. The existence of a static buffer that could be accessed by multiple threads means this function is neither reentrant, nor thread-safe. This issue is solved by the strtok_r function.

char* strtok_r(char * restrict s, const char *restrict sep, char **restrict lasts);

The reentrant version of strtok forces the caller to save the state of a string that is being parsed. It does this by adding a third argument, which is a pointer to the string that is being manipulated. The first call to strtok_r will return the leading token in the string and store address of the buffer in the third parameter.¹ Each following call will return the next token in the string, unless no more tokens exist, in which case the function returns NULL.

Dzen parsing of action string

The process of parsing a string with multiple delimiters can easily lead to terse branching code. This is clearly evidenced by the function I came across in the dzen source code.

void
fill_ev_table(char *input) {
    char *str1, *str2, *str3, *str4,
         *token, *subtoken, *kommatoken, *dptoken;
    char *saveptr1=NULL,
         *saveptr2=NULL,
         *saveptr3=NULL,
         *saveptr4=NULL;
    int j, i=0, k=0;
    long eid=0;
    handlerf *ah=0;

    for (j = 1, str1 = input; ; j++, str1 = NULL) {
        token = strtok_r(str1, ";", &saveptr1);
        if (token == NULL)
            break;

        for (str2 = token; ; str2 = NULL) {
            subtoken = strtok_r(str2, "=", &saveptr2);
            if (subtoken == NULL)
                break;
            if( (str2 == token) && ((eid = get_ev_id(subtoken)) != -1))
                ;
            else if(eid == -1)
                break;

            for (str3 = subtoken; ; str3 = NULL) {
                kommatoken = strtok_r(str3, ",", &saveptr3);
                if (kommatoken == NULL)
                    break;

                for (str4 = kommatoken; ; str4 = NULL) {
                    dptoken = strtok_r(str4, ":", &saveptr4);
                    if (dptoken == NULL) {
                        break;
                    }
                    if(str4 == kommatoken && str4 != token && eid != -1) {
                        if((ah = get_action_handler(dptoken)) != NULL) {
                            new_event(eid);
                            add_handler(eid, i, ah);
                            i++;
                        }
                    }
                    else if(str4 != token && eid != -1 && ah) {
                        add_option(eid, i-1, k, dptoken);
                        k++;
                    }
                    else if(!ah)
                        break;
                }
                k=0;
            }
            new_event(eid);
            add_handler(eid, i, NULL);
            i=0;
        }
    }
}

This is clearly written like the example source code provided in the Linux man-page STRTOK(3), which is very suspect. While searching for an alternative to strtok I came across a post by MrMichaelWill who wrote the following on LinuxQuestions.org:²

Unfortunately the manpage was updated by an amateur. The original author of the functions would certainly disagree.

This is a great example of when to check other sources. An excellent starting place is the POSIX STRTOK(3P) man-page. Although I could not find any applicable implementations of strtok I decided to design a more elegant algorithm than the one presented above.

Algorithm design

Implementing strtok_r to parse the aforementioned event string is surprisingly difficult.