1. 24 9月, 2015 1 次提交
  2. 28 3月, 2015 1 次提交
    • S
      regex: fix character class repetitions · c498efe1
      Szabolcs Nagy 提交于
      Internally regcomp needs to copy some iteration nodes before
      translating the AST into TNFA representation.
      
      Literal nodes were not copied correctly: the class type and list
      of negated class types were not copied so classes were ignored
      (in the non-negated case an ignored char class caused the literal
      to match everything).
      
      This affects iterations when the upper bound is finite, larger
      than one or the lower bound is larger than one. So eg. the EREs
      
       [[:digit:]]{2}
       [^[:space:]ab]{1,4}
      
      were treated as
      
       .{2}
       [^ab]{1,4}
      
      The fix is done with minimal source modification to copy the
      necessary fields, but the AST preparation and node handling
      code of tre will need to be cleaned up for clarity.
      c498efe1
  3. 24 3月, 2015 1 次提交
    • S
      do not treat \0 as a backref in BRE · 32dee9b9
      Szabolcs Nagy 提交于
      The valid BRE backref tokens are \1 .. \9, and 0 is not a special
      character either so \0 is undefined by the standard.
      
      Such undefined escaped characters are treated as literal characters
      currently, following existing practice, so \0 is the same as 0.
      32dee9b9
  4. 21 3月, 2015 2 次提交
    • R
      suppress backref processing in ERE regcomp · 7c8c86f6
      Rich Felker 提交于
      one of the features of ERE is that it's actually a regular language
      and does not admit expressions which cannot be matched in linear time.
      introduction of \n backref support into regcomp's ERE parsing was
      unintentional.
      7c8c86f6
    • R
      fix memory-corruption in regcomp with backslash followed by high byte · 39dfd584
      Rich Felker 提交于
      the regex parser handles the (undefined) case of an unexpected byte
      following a backslash as a literal. however, instead of correctly
      decoding a character, it was treating the byte value itself as a
      character. this was not only semantically unjustified, but turned out
      to be dangerous on archs where plain char is signed: bytes in the
      range 252-255 alias the internal codes -4 through -1 used for special
      types of literal nodes in the AST.
      39dfd584
  5. 13 9月, 2014 1 次提交
    • S
      rewrite the regex pattern parser in regcomp · ec1aed0a
      Szabolcs Nagy 提交于
      The new code is a bit simpler and the generated code is about 1KB
      smaller (on i386). The basic design was kept including internal
      interfaces, TNFA generation was not touched.
      
      The old tre parser had various issues:
      
      [^aa-z]
      negated overlapping ranges in a bracket expression were handled
      incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z])
      
      a{,2}
      missing lower bound in a counted repetition should be an error,
      but it was accepted with broken semantics: a{,2} was treated as
      a{0,3}, the new parser rejects it
      
      a{999,}
      large min count was not rejected (a{5000,} failed with REG_ESPACE
      due to reaching a stack limit), the new parser enforces the
      RE_DUP_MAX limit
      
      \xff
      regcomp used to accept a pattern with illegal sequences in it
      (treated them as empty expression so p\xffq matched pq) the new
      parser rejects such patterns with REG_BADPAT or REG_ERANGE
      
      [^b-fD-H] with REG_ICASE
      old parser turned this into [^b-fB-F] because of the negated
      overlapping range issue (see above), the new parser treats it
      as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical
      implementations do case-folding first and negate the character
      set later instead of the other way around. (Supporting the posix
      way efficiently would require significant changes so it was left
      as is, it is unclear if any application actually expects the
      posix behaviour, this issue is raised on the austingroup tracker:
      http://austingroupbugs.net/view.php?id=872 ).
      
      another case-insensitive matching issue is that unicode case
      folding rules can group more than two characters together while
      towupper and towlower can only work for a pair of upper and
      lower case characters, this is a limitation of POSIX so it is
      not fixed.
      
      invalid bracket and brace expressions may return different error
      codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead
      of REG_EBRACE) otherwise the new parser should be compatible with
      the old one.
      
      regcomp should be able to handle arbitrary pattern input if the
      pattern length is limited, the only exception is the use of large
      repetition counts (eg. (a{255}){255}) which require exp amount
      of memory and there is no easy workaround.
      ec1aed0a
  6. 12 12月, 2013 1 次提交
  7. 07 10月, 2013 1 次提交
  8. 15 1月, 2013 1 次提交
  9. 07 9月, 2012 1 次提交
    • R
      use restrict everywhere it's required by c99 and/or posix 2008 · 400c5e5c
      Rich Felker 提交于
      to deal with the fact that the public headers may be used with pre-c99
      compilers, __restrict is used in place of restrict, and defined
      appropriately for any supported compiler. we also avoid the form
      [restrict] since older versions of gcc rejected it due to a bug in the
      original c99 standard, and instead use the form *restrict.
      400c5e5c
  10. 14 5月, 2012 2 次提交
    • R
      remove some no-op end of string tests from regex parser · 13b2945a
      Rich Felker 提交于
      these are cruft from the original code which used an explicit string
      length rather than null termination. i blindly converted all the
      checks to null terminator checks, without noticing that in several
      cases, the subsequent switch statement would automatically handle the
      null byte correctly.
      13b2945a
    • R
      another BRE fix: in ^*, * is literal · e9cddc8e
      Rich Felker 提交于
      i don't understand why this has to be conditional on being in BRE
      mode, but enabling this code unconditionally breaks a huge number of
      ERE test cases.
      e9cddc8e
  11. 08 5月, 2012 4 次提交
  12. 14 4月, 2012 1 次提交
    • R
      remove invalid code from TRE · 386b34a0
      Rich Felker 提交于
      TRE wants to treat + and ? after a +, ?, or * as special; ? means
      ungreedy and + is reserved for future use. however, this is
      non-conformant. although redundant, these redundant characters have
      well-defined (no-op) meaning for POSIX ERE, and are actually _literal_
      characters (which TRE is wrongly ignoring) in POSIX BRE mode.
      
      the simplest fix is to simply remove the unneeded nonstandard
      functionality. as a plus, this shaves off a small amount of bloat.
      386b34a0
  13. 21 3月, 2012 1 次提交
    • R
      upgrade to latest upstream TRE regex code (0.8.0) · ad47d45e
      Rich Felker 提交于
      the main practical results of this change are
      1. the regex code is no longer subject to LGPL; it's now 2-clause BSD
      2. most (all?) popular nonstandard regex extensions are supported
      
      I hesitate to call this a "sync" since both the old and new code are
      heavily modified. in one sense, the old code was "more severely"
      modified, in that it was actively hostile to non-strictly-conforming
      expressions. on the other hand, the new code has eliminated the
      useless translation of the entire regex string to wchar_t prior to
      compiling, and now only converts multibyte character literals as
      needed.
      
      in the future i may use this modified TRE as a basis for writing the
      long-planned new regex engine that will avoid multibyte-to-wide
      character conversion entirely by compiling multibyte bracket
      expressions specific to UTF-8.
      ad47d45e
  14. 17 6月, 2011 1 次提交
  15. 12 2月, 2011 1 次提交