1. 28 9月, 2022 1 次提交
  2. 11 6月, 2021 1 次提交
  3. 11 3月, 2021 1 次提交
  4. 09 9月, 2020 1 次提交
  5. 22 3月, 2017 1 次提交
  6. 17 12月, 2016 1 次提交
    • S
      handle ^ and $ in BRE subexpression start and end as anchors · 7a4c25d7
      Szabolcs Nagy 提交于
      In BRE, ^ is an anchor at the beginning of an expression, optionally
      it may be an anchor at the beginning of a subexpression and must be
      treated as a literal otherwise.
      
      Previously musl treated ^ in subexpressions as literal, but at least
      glibc and gnu sed treats it as an anchor and that's the more useful
      behaviour: it can always be escaped to get back the literal meaning.
      
      Same for $ at the end of a subexpression.
      
      Portable BRE should not rely on this, but there are sed commands in
      build scripts which do.
      
      This changes the meaning of the BREs:
      
      	\(^a\)
      	\(a\|^b\)
      	\(a$\)
      	\(a$\|b\)
      7a4c25d7
  7. 23 5月, 2016 1 次提交
    • S
      fix the use of uninitialized value in regcomp · 51eeb6eb
      Szabolcs Nagy 提交于
      the num_submatches field of some ast nodes was not initialized in
      tre_add_tag_{left,right}, but was accessed later.
      
      this was a benign bug since the uninitialized values were never used
      (these values are created during tre_add_tags and copied around during
      tre_expand_ast where they are also used in computations, but nothing
      in the final tnfa depends on them).
      51eeb6eb
  8. 02 3月, 2016 2 次提交
    • S
      fix ^* at the start of a complete BRE · 29b13575
      Szabolcs Nagy 提交于
      This is a workaround to treat * as literal * at the start of a BRE.
      
      Ideally ^ would be treated as an anchor at the start of any BRE
      subexpression and similarly $ would be an anchor at the end of any
      subexpression.  This is not required by the standard and hard to do
      with the current code, but it's the existing practice.  If it is
      changed, * should be treated as literal after such anchor as well.
      29b13575
    • S
      fix * at the start of a BRE subexpression · 39ea71fb
      Szabolcs Nagy 提交于
      commit 7eaa76fc made * invalid at
      the start of a BRE subexpression, but it should be accepted as
      literal * there according to the standard.
      
      This patch does not fix subexpressions starting with ^*.
      39ea71fb
  9. 01 2月, 2016 1 次提交
    • S
      regex: increase the stack tre uses for tnfa creation · 2810b30f
      Szabolcs Nagy 提交于
      10k elements stack is increased to 1000k, otherwise tnfa creation fails
      for reasonable sized patterns: a single literal char can add 7 elements
      to this stack, so regcomp of an 1500 char long pattern (with only litral
      chars) fails with REG_ESPACE. (the new limit allows about < 150k chars,
      this arbitrary limit allows most command line regex usage.)
      
      ideally there would be no upper bound: regcomp dynamically reallocates
      this buffer, every reallocation checks for allocation failure and at
      the end this stack is freed so there is no reason for special bound.
      however that may have unwanted effect on regcomp and regexec runtime
      so this is a conservative change.
      2810b30f
  10. 31 1月, 2016 6 次提交
    • S
      regex: simplify the {,} repetition parsing logic · 831e9d9e
      Szabolcs Nagy 提交于
      831e9d9e
    • S
      regex: treat \+, \? as repetitions in BRE · 25160f1c
      Szabolcs Nagy 提交于
      These are undefined escape sequences by the standard, but often
      used in sed scripts.
      25160f1c
    • S
      regex: rewrite the repetition parsing code · 03498ec2
      Szabolcs Nagy 提交于
      The goto logic was hard to follow and modify. This is
      in preparation for the BRE \+ and \? support.
      03498ec2
    • S
      regex: treat \| in BRE as alternation · da4cc13b
      Szabolcs Nagy 提交于
      The standard does not define semantics for \| in BRE, but some code
      depends on it meaning alternation. Empty alternative expression is
      allowed to be consistent with ERE.
      
      Based on a patch by Rob Landley.
      da4cc13b
    • S
      regex: reject repetitions in some cases with REG_BADRPT · 7eaa76fc
      Szabolcs Nagy 提交于
      Previously repetitions were accepted after empty expressions like
      in (*|?)|{2}, but in BRE the handling of * and \{\} were not
      consistent: they were accepted as literals in some cases and
      repetitions in others.
      
      It is better to treat repetitions after an empty expression as an
      error (this is allowed by the standard, and glibc mostly does the
      same). This is hard to do consistently with the current logic so
      the new rule is:
      
      Reject repetitions after empty expressions, except after assertions
      ^*, $? and empty groups ()+ and never treat them as literals.
      
      Empty alternation (|a) is undefined by the standard, but it can be
      useful so that should be accepted.
      7eaa76fc
    • S
      regex: clean up position accounting for literal nodes · a8cc2253
      Szabolcs Nagy 提交于
      This should not change the meaning of the code, just make the intent
      clearer: advancing position is tied to adding a new literal.
      a8cc2253
  11. 24 9月, 2015 1 次提交
  12. 28 3月, 2015 1 次提交
    • S
      regex: fix character class repetitions · c498efe1
      Szabolcs Nagy 提交于
      Internally regcomp needs to copy some iteration nodes before
      translating the AST into TNFA representation.
      
      Literal nodes were not copied correctly: the class type and list
      of negated class types were not copied so classes were ignored
      (in the non-negated case an ignored char class caused the literal
      to match everything).
      
      This affects iterations when the upper bound is finite, larger
      than one or the lower bound is larger than one. So eg. the EREs
      
       [[:digit:]]{2}
       [^[:space:]ab]{1,4}
      
      were treated as
      
       .{2}
       [^ab]{1,4}
      
      The fix is done with minimal source modification to copy the
      necessary fields, but the AST preparation and node handling
      code of tre will need to be cleaned up for clarity.
      c498efe1
  13. 24 3月, 2015 1 次提交
    • S
      do not treat \0 as a backref in BRE · 32dee9b9
      Szabolcs Nagy 提交于
      The valid BRE backref tokens are \1 .. \9, and 0 is not a special
      character either so \0 is undefined by the standard.
      
      Such undefined escaped characters are treated as literal characters
      currently, following existing practice, so \0 is the same as 0.
      32dee9b9
  14. 21 3月, 2015 2 次提交
    • R
      suppress backref processing in ERE regcomp · 7c8c86f6
      Rich Felker 提交于
      one of the features of ERE is that it's actually a regular language
      and does not admit expressions which cannot be matched in linear time.
      introduction of \n backref support into regcomp's ERE parsing was
      unintentional.
      7c8c86f6
    • R
      fix memory-corruption in regcomp with backslash followed by high byte · 39dfd584
      Rich Felker 提交于
      the regex parser handles the (undefined) case of an unexpected byte
      following a backslash as a literal. however, instead of correctly
      decoding a character, it was treating the byte value itself as a
      character. this was not only semantically unjustified, but turned out
      to be dangerous on archs where plain char is signed: bytes in the
      range 252-255 alias the internal codes -4 through -1 used for special
      types of literal nodes in the AST.
      39dfd584
  15. 13 9月, 2014 1 次提交
    • S
      rewrite the regex pattern parser in regcomp · ec1aed0a
      Szabolcs Nagy 提交于
      The new code is a bit simpler and the generated code is about 1KB
      smaller (on i386). The basic design was kept including internal
      interfaces, TNFA generation was not touched.
      
      The old tre parser had various issues:
      
      [^aa-z]
      negated overlapping ranges in a bracket expression were handled
      incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z])
      
      a{,2}
      missing lower bound in a counted repetition should be an error,
      but it was accepted with broken semantics: a{,2} was treated as
      a{0,3}, the new parser rejects it
      
      a{999,}
      large min count was not rejected (a{5000,} failed with REG_ESPACE
      due to reaching a stack limit), the new parser enforces the
      RE_DUP_MAX limit
      
      \xff
      regcomp used to accept a pattern with illegal sequences in it
      (treated them as empty expression so p\xffq matched pq) the new
      parser rejects such patterns with REG_BADPAT or REG_ERANGE
      
      [^b-fD-H] with REG_ICASE
      old parser turned this into [^b-fB-F] because of the negated
      overlapping range issue (see above), the new parser treats it
      as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical
      implementations do case-folding first and negate the character
      set later instead of the other way around. (Supporting the posix
      way efficiently would require significant changes so it was left
      as is, it is unclear if any application actually expects the
      posix behaviour, this issue is raised on the austingroup tracker:
      http://austingroupbugs.net/view.php?id=872 ).
      
      another case-insensitive matching issue is that unicode case
      folding rules can group more than two characters together while
      towupper and towlower can only work for a pair of upper and
      lower case characters, this is a limitation of POSIX so it is
      not fixed.
      
      invalid bracket and brace expressions may return different error
      codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead
      of REG_EBRACE) otherwise the new parser should be compatible with
      the old one.
      
      regcomp should be able to handle arbitrary pattern input if the
      pattern length is limited, the only exception is the use of large
      repetition counts (eg. (a{255}){255}) which require exp amount
      of memory and there is no easy workaround.
      ec1aed0a
  16. 12 12月, 2013 1 次提交
  17. 07 10月, 2013 1 次提交
  18. 15 1月, 2013 1 次提交
  19. 07 9月, 2012 1 次提交
    • R
      use restrict everywhere it's required by c99 and/or posix 2008 · 400c5e5c
      Rich Felker 提交于
      to deal with the fact that the public headers may be used with pre-c99
      compilers, __restrict is used in place of restrict, and defined
      appropriately for any supported compiler. we also avoid the form
      [restrict] since older versions of gcc rejected it due to a bug in the
      original c99 standard, and instead use the form *restrict.
      400c5e5c
  20. 14 5月, 2012 2 次提交
    • R
      remove some no-op end of string tests from regex parser · 13b2945a
      Rich Felker 提交于
      these are cruft from the original code which used an explicit string
      length rather than null termination. i blindly converted all the
      checks to null terminator checks, without noticing that in several
      cases, the subsequent switch statement would automatically handle the
      null byte correctly.
      13b2945a
    • R
      another BRE fix: in ^*, * is literal · e9cddc8e
      Rich Felker 提交于
      i don't understand why this has to be conditional on being in BRE
      mode, but enabling this code unconditionally breaks a huge number of
      ERE test cases.
      e9cddc8e
  21. 08 5月, 2012 4 次提交
  22. 14 4月, 2012 1 次提交
    • R
      remove invalid code from TRE · 386b34a0
      Rich Felker 提交于
      TRE wants to treat + and ? after a +, ?, or * as special; ? means
      ungreedy and + is reserved for future use. however, this is
      non-conformant. although redundant, these redundant characters have
      well-defined (no-op) meaning for POSIX ERE, and are actually _literal_
      characters (which TRE is wrongly ignoring) in POSIX BRE mode.
      
      the simplest fix is to simply remove the unneeded nonstandard
      functionality. as a plus, this shaves off a small amount of bloat.
      386b34a0
  23. 21 3月, 2012 1 次提交
    • R
      upgrade to latest upstream TRE regex code (0.8.0) · ad47d45e
      Rich Felker 提交于
      the main practical results of this change are
      1. the regex code is no longer subject to LGPL; it's now 2-clause BSD
      2. most (all?) popular nonstandard regex extensions are supported
      
      I hesitate to call this a "sync" since both the old and new code are
      heavily modified. in one sense, the old code was "more severely"
      modified, in that it was actively hostile to non-strictly-conforming
      expressions. on the other hand, the new code has eliminated the
      useless translation of the entire regex string to wchar_t prior to
      compiling, and now only converts multibyte character literals as
      needed.
      
      in the future i may use this modified TRE as a basis for writing the
      long-planned new regex engine that will avoid multibyte-to-wide
      character conversion entirely by compiling multibyte bracket
      expressions specific to UTF-8.
      ad47d45e
  24. 17 6月, 2011 1 次提交
  25. 12 2月, 2011 1 次提交