• M
    json: Leave rejecting invalid UTF-8 to parser · de930f45
    Markus Armbruster 提交于
    Both the lexer and the parser (attempt to) validate UTF-8 in JSON
    strings.
    
    The lexer rejects bytes that can't occur in valid UTF-8: \xC0..\xC1,
    \xF5..\xFF.  This rejects some, but not all invalid UTF-8.  It also
    rejects ASCII control characters \x00..\x1F, in accordance with RFC
    8259 (see recent commit "json: Reject unescaped control characters").
    
    When the lexer rejects, it ends the token right after the first bad
    byte.  Good when the bad byte is a newline.  Not so good when it's
    something like an overlong sequence in the middle of a string.  For
    instance, input
    
        {"abc\xC0\xAFijk": 1}\n
    
    produces the tokens
    
        JSON_LCURLY   {
        JSON_ERROR    "abc\xC0
        JSON_ERROR    \xAF
        JSON_KEYWORD  ijk
        JSON_ERROR   ": 1}\n
    
    The parser then reports four errors
    
        Invalid JSON syntax
        Invalid JSON syntax
        JSON parse error, invalid keyword 'ijk'
        Invalid JSON syntax
    
    before it recovers at the newline.
    
    The commit before previous made the parser reject invalid UTF-8
    sequences.  Since then, anything the lexer rejects, the parser would
    reject as well.  Thus, the lexer's rejecting is unnecessary for
    correctness, and harmful for error reporting.
    
    However, we want to keep rejecting ASCII control characters in the
    lexer, because that produces the behavior we want for unclosed
    strings.
    
    We also need to keep rejecting \xFF in the lexer, because we
    documented that as a way to reset the JSON parser
    (docs/interop/qmp-spec.txt section 2.6 QGA Synchronization), which
    means we can't change how we recover from this error now.  I wish we
    hadn't done that.
    
    I think we should treat \xFE the same as \xFF.
    
    Change the lexer to accept \xC0..\xC1 and \xF5..\xFD.  It now rejects
    only \x00..\x1F and \xFE..\xFF.  Error reporting for invalid UTF-8 in
    strings is much improved, except for \xFE and \xFF.  For the example
    above, the lexer now produces
    
        JSON_LCURLY   {
        JSON_STRING   "abc\xC0\xAFijk"
        JSON_COLON    :
        JSON_INTEGER  1
        JSON_RCURLY
    
    and the parser reports just
    
        JSON parse error, invalid UTF-8 sequence in string
    Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
    Reviewed-by: NEric Blake <eblake@redhat.com>
    Message-Id: <20180823164025.12553-25-armbru@redhat.com>
    de930f45
json-lexer.c 12.2 KB