Searched hist:de930f45cb56ccf7535cbacee3f3686d416f5283 (Results 1 – 1 of 1) sorted by relevance
/openbmc/qemu/qobject/ |
H A D | json-lexer.c | diff de930f45cb56ccf7535cbacee3f3686d416f5283 Thu Aug 23 11:39:51 CDT 2018 Markus Armbruster <armbru@redhat.com> json: Leave rejecting invalid UTF-8 to parser
Both the lexer and the parser (attempt to) validate UTF-8 in JSON strings.
The lexer rejects bytes that can't occur in valid UTF-8: \xC0..\xC1, \xF5..\xFF. This rejects some, but not all invalid UTF-8. It also rejects ASCII control characters \x00..\x1F, in accordance with RFC 8259 (see recent commit "json: Reject unescaped control characters").
When the lexer rejects, it ends the token right after the first bad byte. Good when the bad byte is a newline. Not so good when it's something like an overlong sequence in the middle of a string. For instance, input
{"abc\xC0\xAFijk": 1}\n
produces the tokens
JSON_LCURLY { JSON_ERROR "abc\xC0 JSON_ERROR \xAF JSON_KEYWORD ijk JSON_ERROR ": 1}\n
The parser then reports four errors
Invalid JSON syntax Invalid JSON syntax JSON parse error, invalid keyword 'ijk' Invalid JSON syntax
before it recovers at the newline.
The commit before previous made the parser reject invalid UTF-8 sequences. Since then, anything the lexer rejects, the parser would reject as well. Thus, the lexer's rejecting is unnecessary for correctness, and harmful for error reporting.
However, we want to keep rejecting ASCII control characters in the lexer, because that produces the behavior we want for unclosed strings.
We also need to keep rejecting \xFF in the lexer, because we documented that as a way to reset the JSON parser (docs/interop/qmp-spec.txt section 2.6 QGA Synchronization), which means we can't change how we recover from this error now. I wish we hadn't done that.
I think we should treat \xFE the same as \xFF.
Change the lexer to accept \xC0..\xC1 and \xF5..\xFD. It now rejects only \x00..\x1F and \xFE..\xFF. Error reporting for invalid UTF-8 in strings is much improved, except for \xFE and \xFF. For the example above, the lexer now produces
JSON_LCURLY { JSON_STRING "abc\xC0\xAFijk" JSON_COLON : JSON_INTEGER 1 JSON_RCURLY
and the parser reports just
JSON parse error, invalid UTF-8 sequence in string
Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20180823164025.12553-25-armbru@redhat.com>
|