json-lexer: make lexer error-recovery more deterministic

Currently when we reach an error state we effectively flush everything fed to the lexer, which can put us in a state where we keep feeding tokens into the parser at arbitrary offsets in the stream. This makes it difficult for the lexer/tokenizer/parser to get back in sync when bad input is made by the client. With these changes we emit an error state/token up to the tokenizer as soon as we reach an error state, and continue processing any data passed in rather than bailing out. The reset token will be used to reset the tokenizer and parser, such that they'll recover state as soon as the lexer begins generating valid token sequences again. We also map chr(192,193,245-255) to an error state here, since they are invalid UTF-8 characters. QMP guest proxy/agent will use chr(255) to force a flush/reset of previous input for reliable delivery of certain events, so also we document that thoroughly here. Signed-off-by: N Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: N Anthony Liguori <aliguori@us.ibm.com>

json-lexer: make lexer error-recovery more deterministic
Currently when we reach an error state we effectively flush everything fed to the lexer, which can put us in a state where we keep feeding tokens into the parser at arbitrary offsets in the stream. This makes it difficult for the lexer/tokenizer/parser to get back in sync when bad input is made by the client. With these changes we emit an error state/token up to the tokenizer as soon as we reach an error state, and continue processing any data passed in rather than bailing out. The reset token will be used to reset the tokenizer and parser, such that they'll recover state as soon as the lexer begins generating valid token sequences again. We also map chr(192,193,245-255) to an error state here, since they are invalid UTF-8 characters. QMP guest proxy/agent will use chr(255) to force a flush/reset of previous input for reliable delivery of certain events, so also we document that thoroughly here. Signed-off-by: N Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: N Anthony Liguori <aliguori@us.ibm.com>
b011f619 · Michael Roth · Anthony Liguori · bd3924a3 · b011f619 · b011f619
隐藏空白更改
内联并排

Showing with 22 addition and 4 deletion

json-lexer.c json-lexer.c +21 -4

json-lexer.h json-lexer.h +1 -0

未找到文件。
--- a/json-lexer.c
+++ b/json-lexer.c
@@ -105,7 +105,8 @@ static const uint8_t json_lexer[][256] =  {
        ['u'] = IN_DQ_UCODE0,
    },
    [IN_DQ_STRING] = {
-        [1 ... 0xFF] = IN_DQ_STRING,
+        [1 ... 0xBF] = IN_DQ_STRING,
+        [0xC2 ... 0xF4] = IN_DQ_STRING,
        ['\\'] = IN_DQ_STRING_ESCAPE,
        ['"'] = JSON_STRING,
    },
@@ -144,7 +145,8 @@ static const uint8_t json_lexer[][256] =  {
        ['u'] = IN_SQ_UCODE0,
    },
    [IN_SQ_STRING] = {
-        [1 ... 0xFF] = IN_SQ_STRING,
+        [1 ... 0xBF] = IN_SQ_STRING,
+        [0xC2 ... 0xF4] = IN_SQ_STRING,
        ['\\'] = IN_SQ_STRING_ESCAPE,
        ['\''] = JSON_STRING,
    },
@@ -305,10 +307,25 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch, bool flush)
            new_state = IN_START;
            break;
        case IN_ERROR:
+            /* XXX: To avoid having previous bad input leaving the parser in an
+             * unresponsive state where we consume unpredictable amounts of
+             * subsequent "good" input, percolate this error state up to the
+             * tokenizer/parser by forcing a NULL object to be emitted, then
+             * reset state.
+             *
+             * Also note that this handling is required for reliable channel
+             * negotiation between QMP and the guest agent, since chr(0xFF)
+             * is placed at the beginning of certain events to ensure proper
+             * delivery when the channel is in an unknown state. chr(0xFF) is
+             * never a valid ASCII/UTF-8 sequence, so this should reliably
+             * induce an error/flush state.
+             */
+            lexer->emit(lexer, lexer->token, JSON_ERROR, lexer->x, lexer->y);
            QDECREF(lexer->token);
            lexer->token = qstring_new();
            new_state = IN_START;
-            return -EINVAL;
+            lexer->state = new_state;
+            return 0;
        default:
            break;
        }
@@ -346,7 +363,7 @@ int json_lexer_feed(JSONLexer *lexer, const char *buffer, size_t size)

 int json_lexer_flush(JSONLexer *lexer)
 {
-    return lexer->state == IN_START ? 0 : json_lexer_feed_char(lexer, 0);
+    return lexer->state == IN_START ? 0 : json_lexer_feed_char(lexer, 0, true);
 }

 void json_lexer_destroy(JSONLexer *lexer)

--- a/json-lexer.h
+++ b/json-lexer.h
@@ -25,6 +25,7 @@ typedef enum json_token_type {
    JSON_STRING,
    JSON_ESCAPE,
    JSON_SKIP,
+    JSON_ERROR,
 } JSONTokenType;

 typedef struct JSONLexer JSONLexer;