Read logic must consider unicode character lengths
The SourceTextStream type was operating under the assumption that Encoder.Convert was a non-throwing method so long as it was passed a destination buffer with at least one byte available for writing. The actual contract for Convert is it will not throw so long as it is able to write the result of converting at least one character to the destination buffer (or there is nothing to convert). In that case it will throw an ArgumentException indicating it attempting to do work but was unable to do so. The SourceTextStream type processes the characters in chunks according to the count passed into Read. This caused a bug when a character which was represented with more than one byte value was at the end of a logical chunk of text. The Converter would convert all the chars except the last one. But SourceTextStream continued processing because there was at least one byte left in the destination buffer and hence an exception was thrown. The fix is to not check for count > 0 when processing but instead count >= the maximum number of bytes the encoding could produce for a single character. Note: I did consider calling GetByteCount here instead but decided against it. It essentially forces the encoder to do the work of decoding the lead byte twice on every iteration of the loop. Seemed better to keep the simple worst case check here. closes #1197 closes #1221
Showing
想要评论请 注册 或 登录