/* * Copyright (c) 1999, 2010, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 only, as * published by the Free Software Foundation. Oracle designates this * particular file as subject to the "Classpath" exception as provided * by Oracle in the LICENSE file that accompanied this code. * * This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * version 2 for more details (a copy is included in the LICENSE file that * accompanied this code). * * You should have received a copy of the GNU General Public License version * 2 along with this work; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. * * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA * or visit www.oracle.com if you need additional information or have any * questions. */ package java.util.regex; import java.security.AccessController; import java.security.PrivilegedAction; import java.text.CharacterIterator; import java.text.Normalizer; import java.util.Locale; import java.util.Map; import java.util.ArrayList; import java.util.HashMap; import java.util.Arrays; /** * A compiled representation of a regular expression. * *
A regular expression, specified as a string, must first be compiled into
* an instance of this class. The resulting pattern can then be used to create
* a {@link Matcher} object that can match arbitrary {@link
* java.lang.CharSequence character sequences A typical invocation sequence is thus
*
* A {@link #matches matches} method is defined by this class as a
* convenience for when a regular expression is used just once. This method
* compiles an expression and matches an input sequence against it in a single
* invocation. The statement
*
* Instances of this class are immutable and are safe for use by multiple
* concurrent threads. Instances of the {@link Matcher} class are not safe for
* such use.
*
*
*
* The backslash character ('\') serves to introduce escaped
* constructs, as defined in the table above, as well as to quote characters
* that otherwise would be interpreted as unescaped constructs. Thus the
* expression \\ matches a single backslash and \{ matches a
* left brace.
*
* It is an error to use a backslash prior to any alphabetic character that
* does not denote an escaped construct; these are reserved for future
* extensions to the regular-expression language. A backslash may be used
* prior to a non-alphabetic character regardless of whether that character is
* part of an unescaped construct.
*
* Backslashes within string literals in Java source code are interpreted
* as required by
* The Java™ Language Specification
* as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6)
* It is therefore necessary to double backslashes in string
* literals that represent regular expressions to protect them from
* interpretation by the Java bytecode compiler. The string literal
* "\b", for example, matches a single backspace character when
* interpreted as a regular expression, while "\\b" matches a
* word boundary. The string literal "\(hello\)" is illegal
* and leads to a compile-time error; in order to match the string
* (hello) the string literal "\\(hello\\)"
* must be used.
*
*
* Character classes may appear within other character classes, and
* may be composed by the union operator (implicit) and the intersection
* operator (&&).
* The union operator denotes a class that contains every character that is
* in at least one of its operand classes. The intersection operator
* denotes a class that contains every character that is in both of its
* operand classes.
*
* The precedence of character-class operators is as follows, from
* highest to lowest:
*
* Note that a different set of metacharacters are in effect inside
* a character class than outside a character class. For instance, the
* regular expression . loses its special meaning inside a
* character class, while the expression - becomes a range
* forming metacharacter.
*
*
* A line terminator is a one- or two-character sequence that marks
* the end of a line of the input character sequence. The following are
* recognized as line terminators:
*
* If {@link #UNIX_LINES} mode is activated, then the only line terminators
* recognized are newline characters.
*
* The regular expression . matches any character except a line
* terminator unless the {@link #DOTALL} flag is specified.
*
* By default, the regular expressions ^ and $ ignore
* line terminators and only match at the beginning and the end, respectively,
* of the entire input sequence. If {@link #MULTILINE} mode is activated then
* ^ matches at the beginning of input and after any line terminator
* except at the end of input. When in {@link #MULTILINE} mode $
* matches just before a line terminator or the end of the input sequence.
*
*
* Capturing groups are numbered by counting their opening parentheses from
* left to right. In the expression ((A)(B(C))), for example, there
* are four such groups: Group zero always stands for the entire expression.
*
* Capturing groups are so named because, during a match, each subsequence
* of the input sequence that matches such a group is saved. The captured
* subsequence may be used later in the expression, via a back reference, and
* may also be retrieved from the matcher once the match operation is complete.
*
*
* A capturing group can also be assigned a "name", a named-capturing group,
* and then be back-referenced later by the "name". Group names are composed of
* the following characters. The first character must be a letter.
*
* A named-capturing group is still numbered as described in
* Group number.
*
* The captured input associated with a group is always the subsequence
* that the group most recently matched. If a group is evaluated a second time
* because of quantification then its previously-captured value, if any, will
* be retained if the second evaluation fails. Matching the string
* "aba" against the expression (a(b)?)+, for example, leaves
* group two set to "b". All captured input is discarded at the
* beginning of each match.
*
* Groups beginning with (? are either pure, non-capturing groups
* that do not capture text and do not count towards the group total, or
* named-capturing group.
*
* This class is in conformance with Level 1 of Unicode Technical
* Standard #18: Unicode Regular Expression Guidelines, plus RL2.1
* Canonical Equivalents.
*
* Unicode escape sequences such as \u2014 in Java source code
* are processed as described in section 3.3 of
* The Java™ Language Specification.
* Such escape sequences are also
* implemented directly by the regular-expression parser so that Unicode
* escapes can be used in expressions that are read from files or from the
* keyboard. Thus the strings "\u2014" and "\\u2014",
* while not equal, compile into the same pattern, which matches the character
* with hexadecimal value 0x2014.
*
* A Unicode character can also be represented in a regular-expression by
* using its hexadecimal code point value directly as described in construct
* \x{...}, for example a supplementary character U+2011F
* can be specified as \x{2011F}, instead of two consecutive
* Unicode escape sequences of the surrogate pair
* \uD840\uDD1F.
*
*
* Unicode scripts, blocks and categories are written with the \p and
* \P constructs as in Perl. \p{prop} matches if
* the input has the property prop, while \P{prop}
* does not match if the input has that property.
*
* Scripts are specified either with the prefix {@code Is}, as in
* {@code IsHiragana}, or by using the {@code script} keyword (or its short
* form {@code sc})as in {@code script=Hiragana} or {@code sc=Hiragana}.
*
* Blocks are specified with the prefix {@code In}, as in
* {@code InMongolian}, or by using the keyword {@code block} (or its short
* form {@code blk}) as in {@code block=Mongolian} or {@code blk=Mongolian}.
*
* Categories may be specified with the optional prefix {@code Is}:
* Both {@code \p{L}} and {@code \p{IsL}} denote the category of Unicode
* letters. Same as scripts and blocks, categories can also be specified
* by using the keyword {@code general_category} (or its short form
* {@code gc}) as in {@code general_category=Lu} or {@code gc=Lu}.
*
* Scripts, blocks and categories can be used both inside and outside of a
* character class.
* The supported categories are those of
*
* The Unicode Standard in the version specified by the
* {@link java.lang.Character Character} class. The category names are those
* defined in the Standard, both normative and informative.
* The script names supported by
* Categories that behave like the java.lang.Character
* boolean ismethodname methods (except for the deprecated ones) are
* available through the same \p{prop} syntax where
* the specified property has the name javamethodname.
*
* The Perl constructs not supported by this class: The conditional constructs (?{X}) and
* (?(condition)X|Y),
* The embedded code constructs (?{code})
* and (??{code}), The embedded comment syntax (?#comment), and The preprocessing operations \l \u,
* \L, and \U. Constructs supported by this class but not by Perl: Possessive quantifiers, which greedily match as much as they can
* and do not back off, even when doing so would allow the overall match to
* succeed. Character-class union and intersection as described
* above. Notable differences from Perl: In Perl, \1 through \9 are always interpreted
* as back references; a backslash-escaped number greater than 9 is
* treated as a back reference if at least that many subexpressions exist,
* otherwise it is interpreted, if possible, as an octal escape. In this
* class octal escapes must always begin with a zero. In this class,
* \1 through \9 are always interpreted as back
* references, and a larger number is accepted as a back reference if at
* least that many subexpressions exist at that point in the regular
* expression, otherwise the parser will drop digits until the number is
* smaller or equal to the existing number of groups or it is one digit.
* Perl uses the g flag to request a match that resumes
* where the last match left off. This functionality is provided implicitly
* by the {@link Matcher} class: Repeated invocations of the {@link
* Matcher#find find} method will resume where the last match left off,
* unless the matcher is reset. In Perl, embedded flags at the top level of an expression affect
* the whole expression. In this class, embedded flags always take effect
* at the point at which they appear, whether they are at the top level or
* within a group; in the latter case, flags are restored at the end of the
* group just as in Perl. Perl is forgiving about malformed matching constructs, as in the
* expression *a, as well as dangling brackets, as in the
* expression abc], and treats them as literals. This
* class also accepts dangling brackets but is strict about dangling
* metacharacters like +, ? and *, and will throw a
* {@link PatternSyntaxException} if it encounters them. For a more precise description of the behavior of regular expression
* constructs, please see
* Mastering Regular Expressions, 3nd Edition, Jeffrey E. F. Friedl,
* O'Reilly and Associates, 2006.
* In this mode, only the '\n' line terminator is recognized
* in the behavior of ., ^, and $.
*
* Unix lines mode can also be enabled via the embedded flag
* expression (?d).
*/
public static final int UNIX_LINES = 0x01;
/**
* Enables case-insensitive matching.
*
* By default, case-insensitive matching assumes that only characters
* in the US-ASCII charset are being matched. Unicode-aware
* case-insensitive matching can be enabled by specifying the {@link
* #UNICODE_CASE} flag in conjunction with this flag.
*
* Case-insensitive matching can also be enabled via the embedded flag
* expression (?i).
*
* Specifying this flag may impose a slight performance penalty. In this mode, whitespace is ignored, and embedded comments starting
* with # are ignored until the end of a line.
*
* Comments mode can also be enabled via the embedded flag
* expression (?x).
*/
public static final int COMMENTS = 0x04;
/**
* Enables multiline mode.
*
* In multiline mode the expressions ^ and $ match
* just after or just before, respectively, a line terminator or the end of
* the input sequence. By default these expressions only match at the
* beginning and the end of the entire input sequence.
*
* Multiline mode can also be enabled via the embedded flag
* expression (?m). When this flag is specified then the input string that specifies
* the pattern is treated as a sequence of literal characters.
* Metacharacters or escape sequences in the input sequence will be
* given no special meaning.
*
* The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on
* matching when used in conjunction with this flag. The other flags
* become superfluous.
*
* There is no embedded flag character for enabling literal parsing.
* @since 1.5
*/
public static final int LITERAL = 0x10;
/**
* Enables dotall mode.
*
* In dotall mode, the expression . matches any character,
* including a line terminator. By default this expression does not match
* line terminators.
*
* Dotall mode can also be enabled via the embedded flag
* expression (?s). (The s is a mnemonic for
* "single-line" mode, which is what this is called in Perl.) When this flag is specified then case-insensitive matching, when
* enabled by the {@link #CASE_INSENSITIVE} flag, is done in a manner
* consistent with the Unicode Standard. By default, case-insensitive
* matching assumes that only characters in the US-ASCII charset are being
* matched.
*
* Unicode-aware case folding can also be enabled via the embedded flag
* expression (?u).
*
* Specifying this flag may impose a performance penalty. When this flag is specified then two characters will be considered
* to match if, and only if, their full canonical decompositions match.
* The expression "a\u030A", for example, will match the
* string "\u00E5" when this flag is specified. By default,
* matching does not take canonical equivalence into account.
*
* There is no embedded flag character for enabling canonical
* equivalence.
*
* Specifying this flag may impose a performance penalty. Returns the string representation of this pattern. This
* is the regular expression from which this pattern was
* compiled. An invocation of this convenience method of the form
*
* If a pattern is to be used multiple times, compiling it once and reusing
* it will be more efficient than invoking this method each time. The array returned by this method contains each substring of the
* input sequence that is terminated by another subsequence that matches
* this pattern or is terminated by the end of the input sequence. The
* substrings in the array are in the order in which they occur in the
* input. If this pattern does not match any subsequence of the input then
* the resulting array has just one element, namely the input sequence in
* string form.
*
* The limit parameter controls the number of times the
* pattern is applied and therefore affects the length of the resulting
* array. If the limit n is greater than zero then the pattern
* will be applied at most n - 1 times, the array's
* length will be no greater than n, and the array's last entry
* will contain all input beyond the last matched delimiter. If n
* is non-positive then the pattern will be applied as many times as
* possible and the array can have any length. If n is zero then
* the pattern will be applied as many times as possible, the array can
* have any length, and trailing empty strings will be discarded.
*
* The input "boo:and:foo", for example, yields the following
* results with these parameters:
*
* Regex Limit Result This method works as if by invoking the two-argument {@link
* #split(java.lang.CharSequence, int) split} method with the given input
* sequence and a limit argument of zero. Trailing empty strings are
* therefore not included in the resulting array. The input "boo:and:foo", for example, yields the following
* results with these expressions:
*
* Regex Result This method produces a
* The pattern is compared to the input one character at a time, from
* the rightmost character in the pattern to the left. If the characters
* all match the pattern has been found. If a character does not match,
* the pattern is shifted right a distance that is the maximum of two
* functions, the bad character shift and the good suffix shift. This
* shift moves the attempted match position through the input more
* quickly than a naive one position at a time check.
*
* The bad character shift is based on the character from the text that
* did not match. If the character does not appear in the pattern, the
* pattern can be shifted completely beyond the bad character. If the
* character does occur in the pattern, the pattern can be shifted to
* line the pattern up with the next occurrence of that character.
*
* The good suffix shift is based on the idea that some subset on the right
* side of the pattern has matched. When a bad character is found, the
* pattern can be shifted right by the pattern length if the subset does
* not occur again in pattern, or by the amount of distance to the
* next occurrence of the subset in the pattern.
*
* Boyer-Moore search methods adapted from code by Amy Yu.
*/
static class BnM extends Node {
int[] buffer;
int[] lastOcc;
int[] optoSft;
/**
* Pre calculates arrays needed to generate the bad character
* shift and the good suffix shift. Only the last seven bits
* are used to see if chars match; This keeps the tables small
* and covers the heavily used ASCII range, but occasionally
* results in an aliased match for the bad character shift.
*/
static Node optimize(Node node) {
if (!(node instanceof Slice)) {
return node;
}
int[] src = ((Slice) node).buffer;
int patternLength = src.length;
// The BM algorithm requires a bit of overhead;
// If the pattern is short don't use it, since
// a shift larger than the pattern length cannot
// be used anyway.
if (patternLength < 4) {
return node;
}
int i, j, k;
int[] lastOcc = new int[128];
int[] optoSft = new int[patternLength];
// Precalculate part of the bad character shift
// It is a table for where in the pattern each
// lower 7-bit value occurs
for (i = 0; i < patternLength; i++) {
lastOcc[src[i]&0x7F] = i + 1;
}
// Precalculate the good suffix shift
// i is the shift amount being considered
NEXT: for (i = patternLength; i > 0; i--) {
// j is the beginning index of suffix being considered
for (j = patternLength - 1; j >= i; j--) {
// Testing for good suffix
if (src[j] == src[j-i]) {
// src[j..len] is a good suffix
optoSft[j-1] = i;
} else {
// No match. The array has already been
// filled up with correct values before.
continue NEXT;
}
}
// This fills up the remaining of optoSft
// any suffix can not have larger shift amount
// then its sub-suffix. Why???
while (j > 0) {
optoSft[--j] = i;
}
}
// Set the guard value because of unicode compression
optoSft[patternLength-1] = 1;
if (node instanceof SliceS)
return new BnMS(src, lastOcc, optoSft, node.next);
return new BnM(src, lastOcc, optoSft, node.next);
}
BnM(int[] src, int[] lastOcc, int[] optoSft, Node next) {
this.buffer = src;
this.lastOcc = lastOcc;
this.optoSft = optoSft;
this.next = next;
}
boolean match(Matcher matcher, int i, CharSequence seq) {
int[] src = buffer;
int patternLength = src.length;
int last = matcher.to - patternLength;
// Loop over all possible match positions in text
NEXT: while (i <= last) {
// Loop over pattern from right to left
for (int j = patternLength - 1; j >= 0; j--) {
int ch = seq.charAt(i+j);
if (ch != src[j]) {
// Shift search to the right by the maximum of the
// bad character shift and the good suffix shift
i += Math.max(j + 1 - lastOcc[ch&0x7F], optoSft[j]);
continue NEXT;
}
}
// Entire pattern matched starting at i
matcher.first = i;
boolean ret = next.match(matcher, i + patternLength, seq);
if (ret) {
matcher.first = i;
matcher.groups[0] = matcher.first;
matcher.groups[1] = matcher.last;
return true;
}
i++;
}
// BnM is only used as the leading node in the unanchored case,
// and it replaced its Start() which always searches to the end
// if it doesn't find what it's looking for, so hitEnd is true.
matcher.hitEnd = true;
return false;
}
boolean study(TreeInfo info) {
info.minLength += buffer.length;
info.maxValid = false;
return next.study(info);
}
}
/**
* Supplementary support version of BnM(). Unpaired surrogates are
* also handled by this class.
*/
static final class BnMS extends BnM {
int lengthInChars;
BnMS(int[] src, int[] lastOcc, int[] optoSft, Node next) {
super(src, lastOcc, optoSft, next);
for (int x = 0; x < buffer.length; x++) {
lengthInChars += Character.charCount(buffer[x]);
}
}
boolean match(Matcher matcher, int i, CharSequence seq) {
int[] src = buffer;
int patternLength = src.length;
int last = matcher.to - lengthInChars;
// Loop over all possible match positions in text
NEXT: while (i <= last) {
// Loop over pattern from right to left
int ch;
for (int j = countChars(seq, i, patternLength), x = patternLength - 1;
j > 0; j -= Character.charCount(ch), x--) {
ch = Character.codePointBefore(seq, i+j);
if (ch != src[x]) {
// Shift search to the right by the maximum of the
// bad character shift and the good suffix shift
int n = Math.max(x + 1 - lastOcc[ch&0x7F], optoSft[x]);
i += countChars(seq, i, n);
continue NEXT;
}
}
// Entire pattern matched starting at i
matcher.first = i;
boolean ret = next.match(matcher, i + lengthInChars, seq);
if (ret) {
matcher.first = i;
matcher.groups[0] = matcher.first;
matcher.groups[1] = matcher.last;
return true;
}
i += countChars(seq, i, 1);
}
matcher.hitEnd = true;
return false;
}
}
///////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////
/**
* This must be the very first initializer.
*/
static Node accept = new Node();
static Node lastAccept = new LastNode();
private static class CharPropertyNames {
static CharProperty charPropertyFor(String name) {
CharPropertyFactory m = map.get(name);
return m == null ? null : m.make();
}
private static abstract class CharPropertyFactory {
abstract CharProperty make();
}
private static void defCategory(String name,
final int typeMask) {
map.put(name, new CharPropertyFactory() {
CharProperty make() { return new Category(typeMask);}});
}
private static void defRange(String name,
final int lower, final int upper) {
map.put(name, new CharPropertyFactory() {
CharProperty make() { return rangeFor(lower, upper);}});
}
private static void defCtype(String name,
final int ctype) {
map.put(name, new CharPropertyFactory() {
CharProperty make() { return new Ctype(ctype);}});
}
private static abstract class CloneableProperty
extends CharProperty implements Cloneable
{
public CloneableProperty clone() {
try {
return (CloneableProperty) super.clone();
} catch (CloneNotSupportedException e) {
throw new AssertionError(e);
}
}
}
private static void defClone(String name,
final CloneableProperty p) {
map.put(name, new CharPropertyFactory() {
CharProperty make() { return p.clone();}});
}
private static final HashMap} against the regular
* expression. All of the state involved in performing a match resides in the
* matcher, so many matchers can share the same pattern.
*
*
*
*
* Pattern p = Pattern.{@link #compile compile}("a*b");
* Matcher m = p.{@link #matcher matcher}("aaaaab");
* boolean b = m.{@link Matcher#matches matches}();
*
* is equivalent to the three statements above, though for repeated matches it
* is less efficient since it does not allow the compiled pattern to be reused.
*
*
* boolean b = Pattern.matches("a*b", "aaaaab");
Summary of regular-expression constructs
*
*
*
*
*
*
*
*
* Construct
* Matches
*
*
*
* Characters
* x
* The character x
* \\
* The backslash character
* \0n
* The character with octal value 0n
* (0 <= n <= 7)
* \0nn
* The character with octal value 0nn
* (0 <= n <= 7)
* \0mnn
* The character with octal value 0mnn
* (0 <= m <= 3,
* 0 <= n <= 7)
* \xhh
* The character with hexadecimal value 0xhh
* \uhhhh
* The character with hexadecimal value 0xhhhh
* \x{h...h}
* The character with hexadecimal value 0xh...h
* ({@link java.lang.Character#MIN_CODE_POINT Character.MIN_CODE_POINT}
* <= 0xh...h <= 
* {@link java.lang.Character#MAX_CODE_POINT Character.MAX_CODE_POINT})
* \t
* The tab character ('\u0009')
* \n
* The newline (line feed) character ('\u000A')
* \r
* The carriage-return character ('\u000D')
* \f
* The form-feed character ('\u000C')
* \a
* The alert (bell) character ('\u0007')
* \e
* The escape character ('\u001B')
*
* \cx
* The control character corresponding to x
*
*
* Character classes
* [abc]
* a, b, or c (simple class)
* [^abc]
* Any character except a, b, or c (negation)
* [a-zA-Z]
* a through z
* or A through Z, inclusive (range)
* [a-d[m-p]]
* a through d,
* or m through p: [a-dm-p] (union)
* [a-z&&[def]]
* d, e, or f (intersection)
* [a-z&&[^bc]]
* a through z,
* except for b and c: [ad-z] (subtraction)
* [a-z&&[^m-p]]
* a through z,
* and not m through p: [a-lq-z](subtraction)
*
*
*
* Predefined character classes
* .
* Any character (may or may not match line terminators)
* \d
* A digit: [0-9]
* \D
* A non-digit: [^0-9]
* \s
* A whitespace character: [ \t\n\x0B\f\r]
* \S
* A non-whitespace character: [^\s]
* \w
* A word character: [a-zA-Z_0-9]
*
* \W
* A non-word character: [^\w]
*
*
* POSIX character classes (US-ASCII only)
* \p{Lower}
* A lower-case alphabetic character: [a-z]
* \p{Upper}
* An upper-case alphabetic character:[A-Z]
* \p{ASCII}
* All ASCII:[\x00-\x7F]
* \p{Alpha}
* An alphabetic character:[\p{Lower}\p{Upper}]
* \p{Digit}
* A decimal digit: [0-9]
* \p{Alnum}
* An alphanumeric character:[\p{Alpha}\p{Digit}]
*
* \p{Punct}
* Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
* \p{Graph}
* A visible character: [\p{Alnum}\p{Punct}]
* \p{Print}
* A printable character: [\p{Graph}\x20]
* \p{Blank}
* A space or a tab: [ \t]
* \p{Cntrl}
* A control character: [\x00-\x1F\x7F]
* \p{XDigit}
* A hexadecimal digit: [0-9a-fA-F]
*
* \p{Space}
* A whitespace character: [ \t\n\x0B\f\r]
*
*
* java.lang.Character classes (simple java character type)
* \p{javaLowerCase}
* Equivalent to java.lang.Character.isLowerCase()
* \p{javaUpperCase}
* Equivalent to java.lang.Character.isUpperCase()
* \p{javaWhitespace}
* Equivalent to java.lang.Character.isWhitespace()
*
* \p{javaMirrored}
* Equivalent to java.lang.Character.isMirrored()
*
* * Classes for Unicode scripts, blocks and categories
* \p{IsLatin}
* A Latin script character (simple script)
* \p{InGreek}
* A character in the Greek block (simple block)
* \p{Lu}
* An uppercase letter (simple category)
* \p{Sc}
* A currency symbol
* \P{InGreek}
* Any character except one in the Greek block (negation)
*
* [\p{L}&&[^\p{Lu}]]
* Any letter except an uppercase letter (subtraction)
*
*
* Boundary matchers
* ^
* The beginning of a line
* $
* The end of a line
* \b
* A word boundary
* \B
* A non-word boundary
* \A
* The beginning of the input
* \G
* The end of the previous match
* \Z
* The end of the input but for the final
* terminator, if any
*
* \z
* The end of the input
*
*
* Greedy quantifiers
* X?
* X, once or not at all
* X*
* X, zero or more times
* X+
* X, one or more times
* X{n}
* X, exactly n times
* X{n,}
* X, at least n times
*
* X{n,m}
* X, at least n but not more than m times
*
*
* Reluctant quantifiers
* X??
* X, once or not at all
* X*?
* X, zero or more times
* X+?
* X, one or more times
* X{n}?
* X, exactly n times
* X{n,}?
* X, at least n times
*
* X{n,m}?
* X, at least n but not more than m times
*
*
* Possessive quantifiers
* X?+
* X, once or not at all
* X*+
* X, zero or more times
* X++
* X, one or more times
* X{n}+
* X, exactly n times
* X{n,}+
* X, at least n times
*
* X{n,m}+
* X, at least n but not more than m times
*
*
* Logical operators
* XY
* X followed by Y
* X|Y
* Either X or Y
*
* (X)
* X, as a capturing group
*
*
* Back references
*
* \n
* Whatever the nth
* capturing group matched
*
* \k<name>
* Whatever the
* named-capturing group "name" matched
*
*
* Quotation
* \
* Nothing, but quotes the following character
* \Q
* Nothing, but quotes all characters until \E
*
*
* \E
* Nothing, but ends quoting started by \Q
*
*
* Special constructs (named-capturing and non-capturing)
* (?<name>X)
* X, as a named-capturing group
* (?:X)
* X, as a non-capturing group
* (?idmsux-idmsux)
* Nothing, but turns match flags i
* d m s
* u x on - off
* (?idmsux-idmsux:X)
* X, as a non-capturing group with the
* given flags i d
* m s u
* x on - off
* (?=X)
* X, via zero-width positive lookahead
* (?!X)
* X, via zero-width negative lookahead
* (?<=X)
* X, via zero-width positive lookbehind
* (?<!X)
* X, via zero-width negative lookbehind
*
* (?>X)
* X, as an independent, non-capturing group
*
*
*
* Backslashes, escapes, and quoting
*
* Character Classes
*
*
*
*
*
* 1
* Literal escape
* \x
* 2
* Grouping
* [...]
* 3
* Range
* a-z
* 4
* Union
* [a-e][i-u]
* 5
* Intersection
* [a-z&&[aeiou]] Line terminators
*
*
*
*
* Groups and capturing
*
*
* Group number
*
*
*
*
* 1
* ((A)(B(C)))
* 2
* (A)
* 3
* (B(C))
* 4
* (C) Group name
*
*
*
* Unicode support
*
* Pattern
are the valid script names
* accepted and defined by
* {@link java.lang.Character.UnicodeScript#forName(String) UnicodeScript.forName}.
* The block names supported by Pattern
are the valid block names
* accepted and defined by
* {@link java.lang.Character.UnicodeBlock#forName(String) UnicodeBlock.forName}.
* Comparison to Perl 5
*
* Pattern
engine performs traditional NFA-based matching
* with ordered alternation as occurs in Perl 5.
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* RegExp r1 = RegExp.compile("abc", Pattern.I|Pattern.M);
* RegExp r2 = RegExp.compile("(?im)abc", 0);
*
*
* The flags are duplicated so that the familiar Perl match flag
* names are available.
*/
/**
* Enables Unix lines mode.
*
*
*
* behaves in exactly the same way as the expression
*
*
* Pattern.matches(regex, input);
*
*
* Pattern.compile(regex).matcher(input).matches()
*
*
* @param input
* The character sequence to be split
*
* @param limit
* The result threshold, as described above
*
* @return The array of strings computed by splitting the input
* around matches of this pattern
*/
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList
*
*
*
*
* :
* 2
* { "boo", "and:foo" }
* :
* 5
* { "boo", "and", "foo" }
* :
* -2
* { "boo", "and", "foo" }
* o
* 5
* { "b", "", ":and:f", "", "" }
* o
* -2
* { "b", "", ":and:f", "", "" }
* o
* 0
* { "b", "", ":and:f" }
*
*
* @param input
* The character sequence to be split
*
* @return The array of strings computed by splitting the input
* around matches of this pattern
*/
public String[] split(CharSequence input) {
return split(input, 0);
}
/**
* Returns a literal pattern
*
*
*
* :
* { "boo", "and", "foo" }
* o
* { "b", "", ":and:f" } String
for the specified
* String
.
*
* String
that can be used to
* create a Pattern
that would match the string
* s
as if it were a literal pattern.