1. 24 9月, 2015 1 次提交
  2. 16 6月, 2015 1 次提交
    • R
      byte-based C locale, phase 1: multibyte character handling functions · 1507ebf8
      Rich Felker 提交于
      this patch makes the functions which work directly on multibyte
      characters treat the high bytes as individual abstract code units
      rather than as multibyte sequences when MB_CUR_MAX is 1. since
      MB_CUR_MAX is presently defined as a constant 4, all of the new code
      added is dead code, and optimizing compilers' code generation should
      not be affected at all. a future commit will activate the new code.
      
      as abstract code units, bytes 0x80 to 0xff are represented by wchar_t
      values 0xdf80 to 0xdfff, at the end of the surrogates range. this
      ensures that they will never be misinterpreted as Unicode characters,
      and that all wctype functions return false for these "characters"
      without needing locale-specific logic. a high range outside of Unicode
      such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's
      char16_t also needs to be able to represent conversions of these
      bytes, the surrogate range was the natural choice.
      1507ebf8
  3. 28 3月, 2015 1 次提交
    • S
      regex: fix character class repetitions · c498efe1
      Szabolcs Nagy 提交于
      Internally regcomp needs to copy some iteration nodes before
      translating the AST into TNFA representation.
      
      Literal nodes were not copied correctly: the class type and list
      of negated class types were not copied so classes were ignored
      (in the non-negated case an ignored char class caused the literal
      to match everything).
      
      This affects iterations when the upper bound is finite, larger
      than one or the lower bound is larger than one. So eg. the EREs
      
       [[:digit:]]{2}
       [^[:space:]ab]{1,4}
      
      were treated as
      
       .{2}
       [^ab]{1,4}
      
      The fix is done with minimal source modification to copy the
      necessary fields, but the AST preparation and node handling
      code of tre will need to be cleaned up for clarity.
      c498efe1
  4. 24 3月, 2015 1 次提交
    • S
      do not treat \0 as a backref in BRE · 32dee9b9
      Szabolcs Nagy 提交于
      The valid BRE backref tokens are \1 .. \9, and 0 is not a special
      character either so \0 is undefined by the standard.
      
      Such undefined escaped characters are treated as literal characters
      currently, following existing practice, so \0 is the same as 0.
      32dee9b9
  5. 21 3月, 2015 2 次提交
    • R
      suppress backref processing in ERE regcomp · 7c8c86f6
      Rich Felker 提交于
      one of the features of ERE is that it's actually a regular language
      and does not admit expressions which cannot be matched in linear time.
      introduction of \n backref support into regcomp's ERE parsing was
      unintentional.
      7c8c86f6
    • R
      fix memory-corruption in regcomp with backslash followed by high byte · 39dfd584
      Rich Felker 提交于
      the regex parser handles the (undefined) case of an unexpected byte
      following a backslash as a literal. however, instead of correctly
      decoding a character, it was treating the byte value itself as a
      character. this was not only semantically unjustified, but turned out
      to be dangerous on archs where plain char is signed: bytes in the
      range 252-255 alias the internal codes -4 through -1 used for special
      types of literal nodes in the AST.
      39dfd584
  6. 18 12月, 2014 1 次提交
  7. 13 9月, 2014 1 次提交
    • S
      rewrite the regex pattern parser in regcomp · ec1aed0a
      Szabolcs Nagy 提交于
      The new code is a bit simpler and the generated code is about 1KB
      smaller (on i386). The basic design was kept including internal
      interfaces, TNFA generation was not touched.
      
      The old tre parser had various issues:
      
      [^aa-z]
      negated overlapping ranges in a bracket expression were handled
      incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z])
      
      a{,2}
      missing lower bound in a counted repetition should be an error,
      but it was accepted with broken semantics: a{,2} was treated as
      a{0,3}, the new parser rejects it
      
      a{999,}
      large min count was not rejected (a{5000,} failed with REG_ESPACE
      due to reaching a stack limit), the new parser enforces the
      RE_DUP_MAX limit
      
      \xff
      regcomp used to accept a pattern with illegal sequences in it
      (treated them as empty expression so p\xffq matched pq) the new
      parser rejects such patterns with REG_BADPAT or REG_ERANGE
      
      [^b-fD-H] with REG_ICASE
      old parser turned this into [^b-fB-F] because of the negated
      overlapping range issue (see above), the new parser treats it
      as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical
      implementations do case-folding first and negate the character
      set later instead of the other way around. (Supporting the posix
      way efficiently would require significant changes so it was left
      as is, it is unclear if any application actually expects the
      posix behaviour, this issue is raised on the austingroup tracker:
      http://austingroupbugs.net/view.php?id=872 ).
      
      another case-insensitive matching issue is that unicode case
      folding rules can group more than two characters together while
      towupper and towlower can only work for a pair of upper and
      lower case characters, this is a limitation of POSIX so it is
      not fixed.
      
      invalid bracket and brace expressions may return different error
      codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead
      of REG_EBRACE) otherwise the new parser should be compatible with
      the old one.
      
      regcomp should be able to handle arbitrary pattern input if the
      pattern length is limited, the only exception is the use of large
      repetition counts (eg. (a{255}){255}) which require exp amount
      of memory and there is no easy workaround.
      ec1aed0a
  8. 06 9月, 2014 1 次提交
  9. 26 7月, 2014 1 次提交
    • R
      add support for LC_TIME and LC_MESSAGES translations · c5b8f193
      Rich Felker 提交于
      for LC_MESSAGES, translation of strerror and similar literal message
      functions is supported. for messages in other places (particularly the
      dynamic linker) that use format strings, translation is not yet
      supported. in order to make it possible and safe, such messages will
      need to be refactored to separate the textual content from the format.
      
      for LC_TIME, the day and month names and strftime-style format strings
      provided by nl_langinfo are supported for translation. however there
      may be limitations, as some of the original C-locale nl_langinfo
      strings are non-unique and thus perhaps non-suitable as keys.
      
      overall, the locale support activated by this commit should not be
      seen as complete and polished but as a basis for beginning to test
      locale functionality and implement locales.
      c5b8f193
  10. 18 7月, 2014 1 次提交
  11. 12 12月, 2013 1 次提交
  12. 02 12月, 2013 3 次提交
    • R
      implement FNM_LEADING_DIR extension flag in fnmatch · a4e10e30
      Rich Felker 提交于
      previously this flag was defined and accepted as a no-op, possibly
      breaking some software that uses it. given the choice to remove the
      definition and possibly break applications that were already working,
      or simply implement the feature, the latter turned out to be easy
      enough to make the decision easy.
      
      in the case where the FNM_PATHNAME flag is also set, this
      implementation is clean and essentially optimal. otherwise, it's an
      inefficient "brute force" implementation. at some point, when cleaning
      up and refactoring this code, I may add a more direct code path for
      handling FNM_LEADING_DIR in the non-FNM_PATHNAME case, but at this
      point my main interest is avoiding introducing new bugs in the code
      that implements the standard fnmatch features specified by POSIX.
      a4e10e30
    • R
      fix fnmatch corner cases related to escaping · 6ec82a3b
      Rich Felker 提交于
      the FNM_PATHNAME logic for advancing by /-delimited components was
      incorrect when the / character was escaped (i.e. \/), and a final \ at
      the end of pattern was not handled correctly.
      6ec82a3b
    • S
      fix the end of string matching in fnmatch with FNM_PATHNAME · da0fcdb8
      Szabolcs Nagy 提交于
      a '/' in the pattern could be incorrectly matched against the
      terminating null byte in the string causing arbitrarily long
      sequence of out-of-bounds access in fnmatch("/","",FNM_PATHNAME)
      da0fcdb8
  13. 07 10月, 2013 1 次提交
  14. 01 2月, 2013 1 次提交
  15. 15 1月, 2013 1 次提交
  16. 14 1月, 2013 1 次提交
  17. 07 9月, 2012 1 次提交
    • R
      use restrict everywhere it's required by c99 and/or posix 2008 · 400c5e5c
      Rich Felker 提交于
      to deal with the fact that the public headers may be used with pre-c99
      compilers, __restrict is used in place of restrict, and defined
      appropriately for any supported compiler. we also avoid the form
      [restrict] since older versions of gcc rejected it due to a bug in the
      original c99 standard, and instead use the form *restrict.
      400c5e5c
  18. 25 5月, 2012 1 次提交
    • R
      fix regex on arm · 8b4c232e
      Rich Felker 提交于
      TRE has a broken assumption that wchar_t is signed, which is a sane
      expectation, but not required by the standard, and false on ARM's ABI.
      
      i leave tre_char_t as wchar_t for now, since a pointer to it is
      directly passed to functions that need pointer to wchar_t. it does not
      seem to break anything. and since the maximum unicode scalar value is
      0x10ffff, just use that explicitly rather than using the max value of
      any particular C type.
      8b4c232e
  19. 14 5月, 2012 2 次提交
    • R
      remove some no-op end of string tests from regex parser · 13b2945a
      Rich Felker 提交于
      these are cruft from the original code which used an explicit string
      length rather than null termination. i blindly converted all the
      checks to null terminator checks, without noticing that in several
      cases, the subsequent switch statement would automatically handle the
      null byte correctly.
      13b2945a
    • R
      another BRE fix: in ^*, * is literal · e9cddc8e
      Rich Felker 提交于
      i don't understand why this has to be conditional on being in BRE
      mode, but enabling this code unconditionally breaks a huge number of
      ERE test cases.
      e9cddc8e
  20. 08 5月, 2012 4 次提交
  21. 29 4月, 2012 1 次提交
    • R
      new fnmatch implementation · 45b38550
      Rich Felker 提交于
      unlike the old one, this one's algorithm does not suffer from
      potential stack overflow issues or pathologically bad performance on
      certain patterns. instead of backtracking, it uses a matching
      algorithm which I have not seen before (unsure whether I invented or
      re-invented it) that runs in O(1) space and O(nm) time. it may be
      possible to improve the time to O(n), but not without significantly
      greater complexity.
      45b38550
  22. 27 4月, 2012 1 次提交
    • R
      update fnmatch to POSIX 2008 semantics · 2b87a5db
      Rich Felker 提交于
      an invalid bracket expression must be treated as if the opening
      bracket were just a literal character. this is to fix a bug whereby
      POSIX left the behavior of the "[" shell command undefined due to it
      being an invalid bracket expression.
      2b87a5db
  23. 15 4月, 2012 1 次提交
    • R
      fix signedness error handling invalid multibyte sequences in regexec · b9dd43db
      Rich Felker 提交于
      the "< 0" test was always false due to use of an unsigned type. this
      resulted in infinite loops on 32-bit machines (adding -1U to a pointer
      is the same as adding -1) and crashes on 64-bit machines (offsetting
      the string pointer by 4gb-1b when an illegal sequence was hit).
      b9dd43db
  24. 14 4月, 2012 2 次提交
    • R
      remove invalid code from TRE · 386b34a0
      Rich Felker 提交于
      TRE wants to treat + and ? after a +, ?, or * as special; ? means
      ungreedy and + is reserved for future use. however, this is
      non-conformant. although redundant, these redundant characters have
      well-defined (no-op) meaning for POSIX ERE, and are actually _literal_
      characters (which TRE is wrongly ignoring) in POSIX BRE mode.
      
      the simplest fix is to simply remove the unneeded nonstandard
      functionality. as a plus, this shaves off a small amount of bloat.
      386b34a0
    • R
      fix broken regerror (typo) and missing message · b6dbdc69
      Rich Felker 提交于
      b6dbdc69
  25. 21 3月, 2012 1 次提交
    • R
      upgrade to latest upstream TRE regex code (0.8.0) · ad47d45e
      Rich Felker 提交于
      the main practical results of this change are
      1. the regex code is no longer subject to LGPL; it's now 2-clause BSD
      2. most (all?) popular nonstandard regex extensions are supported
      
      I hesitate to call this a "sync" since both the old and new code are
      heavily modified. in one sense, the old code was "more severely"
      modified, in that it was actively hostile to non-strictly-conforming
      expressions. on the other hand, the new code has eliminated the
      useless translation of the entire regex string to wchar_t prior to
      compiling, and now only converts multibyte character literals as
      needed.
      
      in the future i may use this modified TRE as a basis for writing the
      long-planned new regex engine that will avoid multibyte-to-wide
      character conversion entirely by compiling multibyte bracket
      expressions specific to UTF-8.
      ad47d45e
  26. 24 1月, 2012 1 次提交
  27. 23 1月, 2012 1 次提交
  28. 17 6月, 2011 1 次提交
  29. 07 6月, 2011 1 次提交
    • R
      fix handling of d_name in struct dirent · da88b16a
      Rich Felker 提交于
      basically there are 3 choices for how to implement this variable-size
      string member:
      1. C99 flexible array member: breaks using dirent.h with pre-C99 compiler.
      2. old way: length-1 string: generates array bounds warnings in caller.
      3. new way: length-NAME_MAX string. no problems, simplifies all code.
      
      of course the usable part in the pointer returned by readdir might be
      shorter than NAME_MAX+1 bytes, but that is allowed by the standard and
      doesn't hurt anything.
      da88b16a
  30. 06 6月, 2011 2 次提交
  31. 08 4月, 2011 1 次提交