Some regular-expression performance hacking

Started by Tom Lanealmost 5 years ago38 messages

tgl@sss.pgh.pa.us

almost 5 years ago

2 attachment(s)

As I mentioned in connection with adding the src/test/modules/test_regex
test code, I've been fooling with some performance improvements to our
regular expression engine. Here's the first fruits of that labor.
This is mostly concerned with cutting the overhead for handling trivial
unconstrained patterns like ".*".

0001 creates the concept of a "rainbow" arc within regex NFAs. You can
read background info about this in the "Colors and colormapping" part of
regex/README, but the basic point is that right now, representing a dot
(".", match anything) within an NFA requires a separate arc for each
"color" (character equivalence class) that the regex needs. This uses
up a fair amount of storage and processing effort, especially in larger
regexes which tend to have a lot of colors. We can replace such a
"rainbow" of arcs with a single arc labeled with a special color
RAINBOW. This is worth doing on its own account, just because it saves
space and time. For example, on the reg-33.15.1 test case in
test_regex.sql (a moderately large real-world RE), I find that HEAD
requires 1377614 bytes to represent the compiled RE, and the peak space
usage during pg_regcomp() is 3124376 bytes. With this patch, that drops
to 1077166 bytes for the RE (21% savings) with peak compilation space
2800752 bytes (10% savings). Moreover, the runtime for that test case
drops from ~57ms to ~44ms, a 22% savings. (This is mostly measuring the
RE compilation time. Execution time should drop a bit too since miss()
need consider fewer arcs; but that savings is in a cold code path so it
won't matter much.) These aren't earth-shattering numbers of course,
but for the amount of code needed, it seems well worth while.

A possible point of contention is that I exposed the idea of a rainbow
arc in the regexport.h APIs, which will force consumers of that API
to adapt --- see the changes to contrib/pg_trgm for an example. I'm
not too concerned about this because I kinda suspect that pg_trgm is
the only consumer of that API anywhere. (codesearch.debian.net knows
of no others, anyway.) We could in principle hide the change by
having the regexport functions expand a rainbow arc into one for
each color, but that seems like make-work. pg_trgm would certainly
not see it as an improvement, and in general users of that API should
appreciate recognizing rainbows as such, since they might be able to
apply optimizations that depend on doing so.

Which brings us to 0002, which is exactly such an optimization.
The idea here is to short-circuit character-by-character scanning
when matching a sub-NFA that is like "." or ".*" or variants of
that, ie it will match any sequence of some number of characters.
This requires the ability to recognize that a particular pair of
NFA states are linked by a rainbow, so it's a lot less painful
to do when rainbows are represented explicitly. The example that
got me interested in this is adapted from a Tcl trouble report:

select array_dims(regexp_matches(repeat('x',40) || '=' || repeat('y',50000),
'^(.*)=(.*)$'));

On my machine, this takes about 6 seconds in HEAD, because there's an
O(N^2) effect: we try to match the sub-NFA for the first "(.*)" capture
group to each possible starting string, and only after expensively
verifying that tautological match do we check to see if the next
character is "=". By not having to do any per-character work to decide
that .* matches a substring, the O(N^2) behavior is removed and the time
drops to about 7 msec.

(One could also imagine fixing this by rearranging things to check for
the "=" match before verifying the capture-group matches. That's an
idea I hope to look into in future, because it could help for cases
where the variable parts are not merely ".*". But I don't have clear
ideas about how to do that, and in any case ".*" is common enough that
the present change should still be helpful.)

There are two non-boilerplate parts of the 0002 patch. One is the
checkmatchall() function that determines whether an NFA is match-all,
and if so what the min and max match lengths are. This is actually not
very complicated once you understand what the regex engine does at the
"pre" and "post" states. (See the "Detailed semantics" part of
regex/README for some info about that, which I tried to clarify as part
of the patch.) Other than those endpoint conditions it's just a
recursive graph search. The other hard part is the changes in
rege_dfa.c to provide the actual short-circuit behavior at runtime.
That's ticklish because it's trying to emulate some overly complicated
and underly documented code, particularly in longest() and shortest().
I think that stuff is right; I've studied it and tested it. But it
could use more eyeballs.

Notably, I had to add some more test cases to test_regex.sql to exercise
the short-circuit part of matchuntil() properly. That's only used for
lookbehind constraints, so we won't hit the short-circuit path except
with something like '(?<=..)', which is maybe a tad silly.

I'll add this to the upcoming commitfest.

regards, tom lane

Attachments:

0001-invent-rainbow-arcs.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs.patchDownload

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;
 
 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN	(-1)
-#define COLOR_BLANK		(-2)
+#define COLOR_UNKNOWN	(-3)
+#define COLOR_BLANK		(-4)
 
 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
 		palloc0(colorsCount * sizeof(TrgmColorInfo));
 
 	/*
-	 * Loop over colors, filling TrgmColorInfo about each.
+	 * Loop over colors, filling TrgmColorInfo about each.  Note we include
+	 * WHITE (0) even though we know it'll be reported as non-expandable.
 	 */
 	for (i = 0; i < colorsCount; i++)
 	{
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 			/* Add enter key to this state */
 			addKeyToQueue(trgmNFA, &destKey);
 		}
-		else
+		else if (arc->co >= 0)
 		{
-			/* Regular color */
+			/* Regular color (including WHITE) */
 			TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
 
 			if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 				addKeyToQueue(trgmNFA, &destKey);
 			}
 		}
+		else
+		{
+			/* RAINBOW: treat as unexpandable color */
+			destKey.prefix.colors[0] = COLOR_UNKNOWN;
+			destKey.prefix.colors[1] = COLOR_UNKNOWN;
+			destKey.nstate = arc->to;
+			addKeyToQueue(trgmNFA, &destKey);
+		}
 	}
 
 	pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
 			/*
 			 * Ignore non-expandable colors; addKey already handled the case.
 			 *
-			 * We need no special check for begin/end pseudocolors here.  We
-			 * don't need to do any processing for them, and they will be
-			 * marked non-expandable since the regex engine will have reported
-			 * them that way.
+			 * We need no special check for WHITE or begin/end pseudocolors
+			 * here.  We don't need to do any processing for them, and they
+			 * will be marked non-expandable since the regex engine will have
+			 * reported them that way.
 			 */
 			if (!colorInfo->expandable)
 				continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
 	states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.
 
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:
 
@@ -349,6 +361,8 @@ The possible arc types are:
 
     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".
 
     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".
 
     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".
 
     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
 	struct colordesc *cd = &cm->cd[a->co];
 
+	assert(a->co >= 0);
 	if (cd->arcs != NULL)
 		cd->arcs->colorchainRev = a;
 	a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
 	struct colordesc *cd = &cm->cd[a->co];
 	struct arc *aa = a->colorchainRev;
 
+	assert(a->co >= 0);
 	if (aa == NULL)
 	{
 		assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,
 
 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
 	struct colordesc *end = CDEND(cm);
 	color		co;
 
+	if (but == COLORLESS)
+	{
+		newarc(nfa, type, RAINBOW, from, to);
+		return;
+	}
+
+	/* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
 			!(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
 				struct colormap *cm,
 				int type,
-				struct state *of,	/* complements of this guy's PLAIN outarcs */
+				struct state *of,
 				struct state *from,
 				struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
 	color		co;
 
 	assert(of != from);
+
+	/* A RAINBOW arc matches all colors, making the complement empty */
+	if (findarc(of, PLAIN, RAINBOW) != NULL)
+		return;
+
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
 			if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 92c9c4d795..1ac030570d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,
 
 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
 	struct arc *a;
 
 	assert(old != from);
+	assert(type == AHEAD || type == BEHIND);
 
 	for (a = old->outs; a != NULL; a = a->outchain)
+	{
+		assert(a->type == PLAIN);
 		newarc(nfa, type, a->co, from, to);
+	}
 }
 
 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
 	for (a = from->ins; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->inchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
 				cparc(nfa, a, s, to);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, a->from, to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
 	for (a = to->outs; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->outchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
 				cparc(nfa, a, from, s);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, from, a->to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE	1	// destroys arc
  * #def SATISFIED		2	// constraint satisfied
  * #def COMPATIBLE		3	// compatible but not satisfied yet
+ * #def REPLACEARC		4	// replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+		struct arc *con,
 		struct arc *a)
 {
 #define  CA(ct,at)	 (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
 		case CA(BEHIND, PLAIN):
 			if (con->co == a->co)
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', '^'):		/* collision, similar constraints */
 		case CA('$', '$'):
-		case CA(AHEAD, AHEAD):
+			if (con->co == a->co)	/* true duplication */
+				return SATISFIED;
+			return INCOMPATIBLE;
+			break;
+		case CA(AHEAD, AHEAD):	/* collision, similar constraints */
 		case CA(BEHIND, BEHIND):
 			if (con->co == a->co)	/* true duplication */
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', BEHIND):	/* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
 					break;
 				case LACON:
 					assert(s->no != cnfa->pre);
+					assert(a->co >= 0);
 					ca->co = (color) (cnfa->ncolors + a->co);
 					ca->to = a->to->no;
 					ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
 					break;
 				default:
 					NERR(REG_ASSERT);
-					break;
+					return;
 			}
 		carcsort(first, ca - first);
 		ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
 	switch (a->type)
 	{
 		case PLAIN:
-			fprintf(f, "[%ld]", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "[*]");
+			else
+				fprintf(f, "[%ld]", (long) a->co);
 			break;
 		case AHEAD:
-			fprintf(f, ">%ld>", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, ">*>");
+			else
+				fprintf(f, ">%ld>", (long) a->co);
 			break;
 		case BEHIND:
-			fprintf(f, "<%ld<", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "<*<");
+			else
+				fprintf(f, "<%ld<", (long) a->co);
 			break;
 		case LACON:
 			fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
 	pos = 1;
 	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
 	{
-		if (ca->co < cnfa->ncolors)
+		if (ca->co == RAINBOW)
+			fprintf(f, "\t[*]->%d", ca->to);
+		else if (ca->co < cnfa->ncolors)
 			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		else
 			fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 91078dcd80..5956b86026 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int	push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE	1		/* destroys arc */
 #define SATISFIED	2			/* constraint satisfied */
 #define COMPATIBLE	3			/* compatible but not satisfied yet */
-static int	combine(struct arc *, struct arc *);
+#define REPLACEARC	4			/* replace arc's color with constraint color */
+static int	combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
 									struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN	'A'				/* beginning of string (even if not BOL) */
 #define SEND	'Z'				/* end of string (even if not EOL) */
 
-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-	((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+	((a)->co >= 0 && \
+	 ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
 
 
 /* static function list */
@@ -1390,7 +1393,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
 	unsigned	h;
 	struct carc *ca;
 	struct sset *p;
+	int			ispseudocolor;
 	int			ispost;
 	int			noprogress;
 	int			gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
 	 */
 	for (i = 0; i < d->wordsper; i++)
 		d->work[i] = 0;			/* build new stateset bitmap in d->work */
+	ispseudocolor = d->cm->cd[co].flags & PSEUDO;
 	ispost = 0;
 	noprogress = 1;
 	gotstate = 0;
 	for (i = 0; i < d->nstates; i++)
 		if (ISBSET(css->states, i))
 			for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-				if (ca->co == co)
+				if (ca->co == co ||
+					(ca->co == RAINBOW && !ispseudocolor))
 				{
 					BSET(d->work, ca->to);
 					gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
 	assert(regex != NULL && regex->re_magic == REMAGIC);
 	cm = &((struct guts *) regex->re_guts)->cmap;
 
-	if (co <= 0 || co > cm->max)	/* we reject 0 which is WHITE */
+	if (co <= 0 || co > cm->max)	/* <= 0 rejects WHITE and RAINBOW */
 		return -1;
 	if (cm->cd[co].flags & PSEUDO)	/* also pseudocolors (BOS etc) */
 		return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
 			/* We can ignore BOS/BOL arcs */
 			if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 				continue;
-			/* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+			/*
+			 * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+			 * and LACONs
+			 */
 			if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-				ca->co >= cnfa->ncolors)
+				ca->co == RAINBOW || ca->co >= cnfa->ncolors)
 			{
 				thiscolor = COLORLESS;
 				break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@
 
 #include "regex/regex.h"
 
+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE		0		/* color for chars not appearing in regex */
+#define COLOR_RAINBOW	(-2)	/* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5d0e7a961c..5bcd669d59 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;			/* colors of characters */
 
 #define MAX_COLOR	32767		/* max color (must fit in 'color' datatype) */
 #define COLORLESS	(-1)		/* impossible color */
+#define RAINBOW		(-2)		/* represents all colors except pseudocolors */
 #define WHITE		0			/* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */
 
@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
 	int			type;			/* 0 if free, else an NFA arc type code */
-	color		co;
+	color		co;				/* color the arc matches (possibly RAINBOW) */
 	struct state *from;			/* where it's from (and contained within) */
 	struct state *to;			/* where it's to */
 	struct arc *outchain;		/* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain	outchain	/* we do not maintain "freechainRev" */
 	struct arc *inchain;		/* link in *to's ins chain */
 	struct arc *inchainRev;		/* back-link in *to's ins chain */
+	/* these fields are not used when co == RAINBOW: */
 	struct arc *colorchain;		/* link in color's arc chain */
 	struct arc *colorchainRev;	/* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {

0002-recognize-matchall-NFAs.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs.patchDownload

diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.
 
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 1ac030570d..3ebcd9855c 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
 	nfa->v = v;
 	nfa->bos[0] = nfa->bos[1] = COLORLESS;
 	nfa->eos[0] = nfa->eos[1] = COLORLESS;
+	nfa->flags = 0;
+	nfa->minmatchall = nfa->maxmatchall = -1;
 	nfa->parent = parent;		/* Precedes newfstate so parent is valid. */
 	nfa->post = newfstate(nfa, '@');	/* number 0 */
 	nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
 	if (NISERR())
 		return 0;
 
+	/* Detect whether NFA can't match anything */
 	if (nfa->pre->outs == NULL)
 		return REG_UIMPOSSIBLE;
+
+	/* Detect whether NFA matches all strings (possibly with length bounds) */
+	checkmatchall(nfa);
+
+	/* Detect whether NFA can possibly match a zero-length string */
 	for (a = nfa->pre->outs; a != NULL; a = a->outchain)
 		for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
 			if (aa->to == nfa->post)
@@ -2884,6 +2892,186 @@ analyze(struct nfa *nfa)
 	return 0;
 }
 
+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except that
+ * we can ignore PLAIN arcs for pseudocolors, knowing that such arcs will
+ * appear only at appropriate places in the graph.  We must be able to reach
+ * the post state via RAINBOW arcs, and if there are any loops in the graph,
+ * they must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+	bool		hasmatch[DUPINF + 1];
+	int			minmatch,
+				maxmatch,
+				morematch;
+
+	/*
+	 * hasmatch[i] will be set true if a match of length i is feasible, for i
+	 * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+	 * length of DUPINF or more is feasible.
+	 */
+	memset(hasmatch, 0, sizeof(hasmatch));
+
+	/*
+	 * Recursively search the graph for all-RAINBOW paths to the "post" state,
+	 * starting at the "pre" state.  The -1 initial depth accounts for the
+	 * fact that transitions out of the "pre" state are not part of the
+	 * matched string.  We likewise don't count the final transition to the
+	 * "post" state as part of the match length.  (But we still insist that
+	 * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+	 * lookahead constraints at the start/end of the pattern.)
+	 */
+	if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+		return;
+
+	/*
+	 * We found some all-RAINBOW paths, and not anything that we couldn't
+	 * handle.  hasmatch[] now represents the set of possible match lengths;
+	 * but we want to reduce that to a min and max value, because it doesn't
+	 * seem worth complicating regexec.c to deal with nonconsecutive possible
+	 * match lengths.  Find min and max of first run of lengths, then verify
+	 * there are no nonconsecutive lengths.
+	 */
+	for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+	{
+		if (hasmatch[minmatch])
+			break;
+	}
+	assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+	for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+	{
+		if (!hasmatch[maxmatch + 1])
+			break;
+	}
+	for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+	{
+		if (hasmatch[morematch])
+			return;				/* fail, there are nonconsecutive lengths */
+	}
+
+	/* Success, so record the info */
+	nfa->minmatchall = minmatch;
+	nfa->maxmatchall = maxmatch;
+	nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+					  bool foundloop, int depth,
+					  bool *hasmatch)
+{
+	bool		result = false;
+	struct arc *a;
+
+	/*
+	 * Since this is recursive, it could be driven to stack overflow.  But we
+	 * need not treat that as a hard failure; just deem the NFA non-matchall.
+	 */
+	if (STACK_TOO_DEEP(nfa->v->re))
+		return false;
+
+	/*
+	 * Likewise, if we get to a depth too large to represent correctly in
+	 * maxmatchall, fail quietly.
+	 */
+	if (depth >= DUPINF)
+		return false;
+
+	/*
+	 * Scan the outarcs to detect cases we can't handle, and to see if there
+	 * is a loop-to-self here.  We need to know about any such loop before we
+	 * recurse, so it's hard to avoid making two passes over the outarcs.  In
+	 * any case, checking for showstoppers before we recurse is probably best.
+	 */
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->type != PLAIN)
+			return false;		/* any LACONs make it non-matchall */
+		if (a->co != RAINBOW)
+		{
+			if (nfa->cm->cd[a->co].flags & PSEUDO)
+				continue;		/* ignore pseudocolor transitions */
+			return false;		/* any other color makes it non-matchall */
+		}
+		if (a->to == s)
+		{
+			/*
+			 * We found a cycle of length 1, so remember that to pass down to
+			 * successor states.  (It doesn't matter if there was also such a
+			 * loop at a predecessor state.)
+			 */
+			foundloop = true;
+		}
+		else if (a->to->tmp)
+		{
+			/* We found a cycle of length > 1, so fail. */
+			return false;
+		}
+	}
+
+	/* We need to recurse, so mark state as under consideration */
+	assert(s->tmp == NULL);
+	s->tmp = s;
+
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->co != RAINBOW)
+			continue;			/* ignore pseudocolor transitions */
+		if (a->to == nfa->post)
+		{
+			/* We found an all-RAINBOW path to the post state */
+			result = true;
+			/* Record potential match lengths */
+			assert(depth >= 0);
+			hasmatch[depth] = true;
+			if (foundloop)
+			{
+				/* A predecessor loop makes all larger lengths match, too */
+				int			i;
+
+				for (i = depth + 1; i <= DUPINF; i++)
+					hasmatch[i] = true;
+			}
+		}
+		else if (a->to != s)
+		{
+			/* This is a new path forward; recurse to investigate */
+			result = checkmatchall_recurse(nfa, a->to,
+										   foundloop, depth + 1,
+										   hasmatch);
+			/* Fail if any recursive path fails */
+			if (!result)
+				break;
+		}
+	}
+
+	s->tmp = NULL;
+	return result;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3118,9 @@ compact(struct nfa *nfa,
 	cnfa->eos[0] = nfa->eos[0];
 	cnfa->eos[1] = nfa->eos[1];
 	cnfa->ncolors = maxcolor(nfa->cm) + 1;
-	cnfa->flags = 0;
+	cnfa->flags = nfa->flags;
+	cnfa->minmatchall = nfa->minmatchall;
+	cnfa->maxmatchall = nfa->maxmatchall;
 
 	ca = cnfa->arcs;
 	for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3224,11 @@ dumpnfa(struct nfa *nfa,
 		fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
 	if (nfa->eos[1] != COLORLESS)
 		fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+	if (nfa->flags & HASLACONS)
+		fprintf(f, ", haslacons");
+	if (nfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				nfa->minmatchall, nfa->maxmatchall);
 	fprintf(f, "\n");
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
@@ -3201,6 +3396,9 @@ dumpcnfa(struct cnfa *cnfa,
 		fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
 	if (cnfa->flags & HASLACONS)
 		fprintf(f, ", haslacons");
+	if (cnfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				cnfa->minmatchall, cnfa->maxmatchall);
 	fprintf(f, "\n");
 	for (st = 0; st < cnfa->nstates; st++)
 		dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 5956b86026..6ca5f5cf4c 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,9 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+								  bool, int, bool *);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int	carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = stop - start;
+		size_t		maxmatchall = d->cnfa->maxmatchall;
+
+		if (nchr < d->cnfa->minmatchall)
+			return NULL;
+		if (maxmatchall == DUPINF)
+		{
+			if (stop == v->stop && hitstopp != NULL)
+				*hitstopp = 1;
+		}
+		else
+		{
+			if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+				*hitstopp = 1;
+			if (nchr > maxmatchall)
+				return start + maxmatchall;
+		}
+		return stop;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = min - start;
+
+		if (d->cnfa->maxmatchall != DUPINF &&
+			nchr > d->cnfa->maxmatchall)
+			return NULL;
+		if ((max - start) < d->cnfa->minmatchall)
+			return NULL;
+		if (nchr < d->cnfa->minmatchall)
+			min = start + d->cnfa->minmatchall;
+		if (coldp != NULL)
+			*coldp = start;
+		/* there is no case where we should set *hitstopp */
+		return min;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
 	struct sset *ss;
 	struct colormap *cm = d->cm;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = probe - v->start;
+
+		/*
+		 * It might seem that we should check maxmatchall too, but the
+		 * implicit .* at the front of the pattern absorbs any extra
+		 * characters.  Thus, we should always match as long as there are at
+		 * least minmatchall characters.
+		 */
+		if (nchr < d->cnfa->minmatchall)
+			return 0;
+		return 1;
+	}
+
 	/* initialize and startup, or restart, if necessary */
 	if (cp == NULL || cp > probe)
 	{
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
 	assert(g->tree != NULL);
 	cnfa = &g->tree->cnfa;
 
+	/* matchall NFAs never have a fixed prefix */
+	if (cnfa->flags & MATCHALL)
+		return REG_NOMATCH;
+
 	/*
 	 * Since a correct NFA should never contain any exit-free loops, it should
 	 * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5bcd669d59..956b37b72d 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
 	struct colormap *cm;		/* the color map */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
+	int			flags;			/* flags to pass forward to cNFA */
+	int			minmatchall;	/* min number of chrs to match, if matchall */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 	struct vars *v;				/* simplifies compile error reporting */
 	struct nfa *parent;			/* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
 	int			ncolors;		/* number of colors (max color in use + 1) */
 	int			flags;
 #define  HASLACONS	01			/* uses lookaround constraints */
+#define  MATCHALL	02			/* matches all strings of a range of lengths */
 	int			pre;			/* setup state number */
 	int			post;			/* teardown state number */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
 	struct carc **states;		/* vector of pointers to outarc lists */
 	/* states[n] are pointers into a single malloc'd array of arcs */
 	struct carc *arcs;			/* the area for the lists */
+	/* these fields are used only in a MATCHALL NFA (else they're -1): */
+	int			minmatchall;	/* min number of chrs to match */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 };
 
 #define ZAPCNFA(cnfa)	((cnfa).nstates = 0)
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..90dec92019 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3376,6 +3376,31 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)
 
+-- expectMatch	23.19 HP		(?<=..)a*	aaabb
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch	23.20 HP		(?<=..)b*	aaabb
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch	23.21 HP		(?<=..)b+	aaabb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch	24.1  PT	ab+?		abb	ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..506924e904 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1068,6 +1068,13 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch	23.18 HP		(?<=b)b		b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch	23.19 HP		(?<=..)a*	aaabb
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch	23.20 HP		(?<=..)b*	aaabb
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch	23.21 HP		(?<=..)b+	aaabb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
 
 -- doing 24 "non-greedy quantifiers"

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#1)

Re: Some regular-expression performance hacking

Hi Tom,

On Thu, Feb 11, 2021, at 05:39, Tom Lane wrote:

0001-invent-rainbow-arcs.patch
0002-recognize-matchall-NFAs.patch

Many thanks for working on the regex engine,
this looks like an awesome optimization.

To test the correctness of the patches,
I thought it would be nice with some real-life regexes,
and just as important, some real-life text strings,
to which the real-life regexes are applied to.

I therefore patched Chromium's v8 regexes engine,
to log the actual regexes that get compiled when
visiting websites, and also the text strings that
are the regexes are applied to during run-time
when the regexes are executed.

I logged the regex and text strings as base64 encoded
strings to STDOUT, to make it easy to grep out the data,
so it could be imported into PostgreSQL for analytics.

In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.

Here are some statistics on the different flags used:

SELECT *, SUM(COUNT) OVER () FROM (SELECT flags, COUNT(*) FROM patterns GROUP BY flags) AS x ORDER BY COUNT DESC;
flags | count | sum
-------+--------+--------
| 150097 | 235204
i | 43537 | 235204
g | 22029 | 235204
gi | 15416 | 235204
gm | 2411 | 235204
gim | 602 | 235204
m | 548 | 235204
im | 230 | 235204
y | 193 | 235204
gy | 60 | 235204
giy | 29 | 235204
giu | 26 | 235204
u | 11 | 235204
iy | 6 | 235204
gu | 5 | 235204
gimu | 2 | 235204
iu | 1 | 235204
my | 1 | 235204
(18 rows)

As we can see, no flag at all is the most common, followed by the "i" flag.

Most of the Javascript-regexes (97%) could be understood by PostgreSQL,
only 3% produced some kind of error, which is not unexpected,
since some Javascript-regex features like \w and \W have different
syntax in PostgreSQL:

SELECT *, SUM(COUNT) OVER () FROM (SELECT is_match,error,COUNT(*) FROM subjects GROUP BY is_match,error) AS x ORDER BY count DESC;
is_match | error | count | sum
----------+---------------------------------------------------------------+--------+---------
f | | 973987 | 1489489
t | | 474225 | 1489489
| invalid regular expression: invalid escape \ sequence | 39141 | 1489489
| invalid regular expression: invalid character range | 898 | 1489489
| invalid regular expression: invalid backreference number | 816 | 1489489
| invalid regular expression: brackets [] not balanced | 327 | 1489489
| invalid regular expression: invalid repetition count(s) | 76 | 1489489
| invalid regular expression: quantifier operand invalid | 17 | 1489489
| invalid regular expression: parentheses () not balanced | 1 | 1489489
| invalid regular expression: regular expression is too complex | 1 | 1489489
(10 rows)

Having had some fun looking at statistics, let's move on to look at if there are any
observable differences between HEAD (8063d0f6f56e53edd991f53aadc8cb7f8d3fdd8f)
and when these two patches have been applied.

To detect any differences,
for each (regex pattern, text string subject) pair,
the columns,
is_match boolean
captured text[]
error text
were set by a PL/pgSQL function running HEAD:

BEGIN
_is_match := _subject ~ _pattern;
_captured := regexp_match(_subject, _pattern);
EXCEPTION WHEN OTHERS THEN
UPDATE subjects SET
error = SQLERRM
WHERE subject_id = _subject_id;
CONTINUE;
END;
UPDATE subjects SET
is_match = _is_match,
captured = _captured
WHERE subject_id = _subject_id;

The patches

0001-invent-rainbow-arcs.patch
0002-recognize-matchall-NFAs.patch

were then applied and this query was executed to spot any differences:

SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern) AS captured_diff,
COUNT(*)
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern) OR captured IS DISTINCT FROM regexp_match(subject, pattern))
GROUP BY 1,2
ORDER BY 3 DESC
;

The query was first run on the unpatched HEAD to verify it detects no results.
0 rows indeed, and it took this long to finish the query:

Time: 186077.866 ms (03:06.078)

Running the same query with the two patches, was significantly faster:

Time: 111785.735 ms (01:51.786)

No is_match differences were detected, good!

However, there were 23 cases where what got captured differed:

-[ RECORD 1 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-cloak
is_match_head | t
captured_head | {cloak,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-cloak}
-[ RECORD 2 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (?:^v-([a-z0-9-]+))?(?:(?::|^@|^#)(\[[^\]]+\]|[^\.]+))?(.+)?$
subject | v-if
is_match_head | t
captured_head | {if,NULL,NULL}
is_match_patch | t
captured_patch | {NULL,NULL,v-if}
-[ RECORD 3 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?a5oc.com).*
subject | https://a5oc.com/attachments/6b184e79-6a7f-43e0-ac59-7ed9d0a8eb7e-jpeg.179582/
is_match_head | t
captured_head | {https://,a5oc.com,NULL <https://%2Ca5oc.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 4 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?allfordmustangs.com).*
subject | https://allfordmustangs.com/attachments/e463e329-0397-4e13-ad41-f30c6bc0659e-jpeg.779299/
is_match_head | t
captured_head | {https://,allfordmustangs.com,NULL <https://%2Callfordmustangs.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 5 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?audi-forums.com).*
subject | https://audi-forums.com/attachments/screenshot_20210207-151100_ebay-jpg.11506/
is_match_head | t
captured_head | {https://,audi-forums.com,NULL <https://%2Caudi-forums.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 6 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?can-amforum.com).*
subject | https://can-amforum.com/attachments/resized_20201214_163325-jpeg.101395/
is_match_head | t
captured_head | {https://,can-amforum.com,NULL <https://%2Ccan-amforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 7 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?contractortalk.com).*
subject | https://contractortalk.com/attachments/maryann-porch-roof-quote-12feb2021-jpg.508976/
is_match_head | t
captured_head | {https://,contractortalk.com,NULL <https://%2Ccontractortalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 8 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?halloweenforum.com).*
subject | https://halloweenforum.com/attachments/dead-fred-head-before-and-after-jpg.744080/
is_match_head | t
captured_head | {https://,halloweenforum.com,NULL <https://%2Challoweenforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 9 ]--+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?horseforum.com).*
subject | https://horseforum.com/attachments/dd90f089-9ae9-4521-98cd-27bda9ad38e9-jpeg.1109329/
is_match_head | t
captured_head | {https://,horseforum.com,NULL <https://%2Chorseforum.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 10 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?passatworld.com).*
subject | https://passatworld.com/attachments/clean-passat-jpg.102337/
is_match_head | t
captured_head | {https://,passatworld.com,NULL <https://%2Cpassatworld.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 11 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?plantedtank.net).*
subject | https://plantedtank.net/attachments/brendon-60p-jpg.1026075/
is_match_head | t
captured_head | {https://,plantedtank.net,NULL <https://%2Cplantedtank.net%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 12 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vauxhallownersnetwork.co.uk).*
subject | https://vauxhallownersnetwork.co.uk/attachments/opelnavi-jpg.96639/
is_match_head | t
captured_head | {https://,vauxhallownersnetwork.co.uk,NULL <https://%2Cvauxhallownersnetwork.co.uk%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 13 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?volvov40club.com).*
subject | https://volvov40club.com/attachments/img_20210204_164157-jpg.17356/
is_match_head | t
captured_head | {https://,volvov40club.com,NULL <https://%2Cvolvov40club.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 14 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?vwidtalk.com).*
subject | https://vwidtalk.com/attachments/1613139846689-png.1469/
is_match_head | t
captured_head | {https://,vwidtalk.com,NULL <https://%2Cvwidtalk.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 15 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^.*://)?((www.)?yellowbullet.com).*
subject | https://yellowbullet.com/attachments/20210211_133934-jpg.204604/
is_match_head | t
captured_head | {https://,yellowbullet.com,NULL <https://%2Cyellowbullet.com%2Cnull/>}
is_match_patch | t
captured_patch |
-[ RECORD 16 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[^\?]*)?(\?[^#]*)?(#.*$)?
subject | https://www.disneyonice.com/oneIdResponder.html
is_match_head | t
captured_head | {https://www.disneyonice.com/oneIdResponder.html,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 17 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /
is_match_head | t
captured_head | {/,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 18 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^[a-zA-Z0-9\/_-]+)*(\.[a-zA-Z]+)?
subject | /en.html
is_match_head | t
captured_head | {/en,.html}
is_match_patch | t
captured_patch |
-[ RECORD 19 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | (^https?:\/\/)?(((\[[^\]]+\])|([^:\/\?#]+))(:(\d+))?)?([^\?#]*)(.*)?
subject | https://e.echatsoft.com/mychat/visitor
is_match_head | t
captured_head | {https://,e.echatsoft.com,e.echatsoft.com,NULL,e.echatsoft.com,NULL,NULL,/mychat/visitor <https://%2Ce.echatsoft.com%2Ce.echatsoft.com%2Cnull%2Ce.echatsoft.com%2Cnull%2Cnull%2C/mychat/visitor>,""}
is_match_patch | t
captured_patch | {NULL,https,https,NULL,https,NULL,NULL,://e.echatsoft.com/mychat/visitor,""}
-[ RECORD 20 ]-+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
pattern | (^|.)41nbc.com$|(^|.)41nbc.dev$|(^|.)52.23.179.12$|(^|.)52.3.245.221$|(^|.)clipsyndicate.com$|(^|.)michaelbgiordano.com$|(^|.)syndicaster.tv$|(^|.)wdef.com$|(^|.)wdef.dev$|(^|.)wxxv.mysiteserver.net$|(^|.)wxxv25.dev$|(^|.)clipsyndicate.com$|(^|.)syndicaster.tv$
subject | wdef.com
is_match_head | t
captured_head | {NULL,NULL,NULL,NULL,NULL,NULL,NULL,"",NULL,NULL,NULL,NULL,NULL}
is_match_patch | t
captured_patch |
-[ RECORD 21 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.deputy.com/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 22 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^((^\w+:|^)\/\/)?(?:www\.)?
subject | https://www.westernsydney.edu.au/
is_match_head | t
captured_head | {https://,https <https://%2Chttps/>:}
is_match_patch | t
captured_patch |
-[ RECORD 23 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pattern | ^(https?:){0,1}\/\/|
subject | https://ui.powerreviews.com/api/
is_match_head | t
captured_head | {https:}
is_match_patch | t
captured_patch | {NULL}

The code to reproduce the results have been pushed here:
https://github.com/truthly/regexes-in-the-wild

Let me know if you want access to the dataset,
I could open up a port to my PostgreSQL so you could take a dump.

SELECT
pg_size_pretty(pg_relation_size('patterns')) AS patterns,
pg_size_pretty(pg_relation_size('subjects')) AS subjects;
patterns | subjects
----------+----------
20 MB | 568 MB
(1 row)

/Joel

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#2)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.

This seems like an incredibly useful test dataset.
I'd definitely like a copy.

No is_match differences were detected, good!

Cool ...

However, there were 23 cases where what got captured differed:

I shall take a closer look at that.

Many thanks for doing this work!

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#2)

2 attachment(s)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

No is_match differences were detected, good!
However, there were 23 cases where what got captured differed:

These all stem from the same oversight: checkmatchall() was being
too cavalier by ignoring "pseudocolor" arcs, which are arcs that
match start-of-string or end-of-string markers. I'd supposed that
pseudocolor arcs necessarily match parallel RAINBOW arcs, because
they start out that way (cf. newnfa). But it turns out that
some edge-of-string constraints can be optimized in such a way that
they only appear in the final NFA in the guise of missing or extra
pseudocolor arcs. We have to actually check that the pseudocolor arcs
match the RAINBOW arcs, otherwise our "matchall" NFA isn't one because
it acts differently at the start or end of the string than it does
elsewhere.

So here's a revised pair of patches (0001 is actually the same as
before).

Thanks again for testing!

regards, tom lane

Attachments:

0001-invent-rainbow-arcs-2.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs-2.patchDownload

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;
 
 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN	(-1)
-#define COLOR_BLANK		(-2)
+#define COLOR_UNKNOWN	(-3)
+#define COLOR_BLANK		(-4)
 
 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
 		palloc0(colorsCount * sizeof(TrgmColorInfo));
 
 	/*
-	 * Loop over colors, filling TrgmColorInfo about each.
+	 * Loop over colors, filling TrgmColorInfo about each.  Note we include
+	 * WHITE (0) even though we know it'll be reported as non-expandable.
 	 */
 	for (i = 0; i < colorsCount; i++)
 	{
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 			/* Add enter key to this state */
 			addKeyToQueue(trgmNFA, &destKey);
 		}
-		else
+		else if (arc->co >= 0)
 		{
-			/* Regular color */
+			/* Regular color (including WHITE) */
 			TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
 
 			if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 				addKeyToQueue(trgmNFA, &destKey);
 			}
 		}
+		else
+		{
+			/* RAINBOW: treat as unexpandable color */
+			destKey.prefix.colors[0] = COLOR_UNKNOWN;
+			destKey.prefix.colors[1] = COLOR_UNKNOWN;
+			destKey.nstate = arc->to;
+			addKeyToQueue(trgmNFA, &destKey);
+		}
 	}
 
 	pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
 			/*
 			 * Ignore non-expandable colors; addKey already handled the case.
 			 *
-			 * We need no special check for begin/end pseudocolors here.  We
-			 * don't need to do any processing for them, and they will be
-			 * marked non-expandable since the regex engine will have reported
-			 * them that way.
+			 * We need no special check for WHITE or begin/end pseudocolors
+			 * here.  We don't need to do any processing for them, and they
+			 * will be marked non-expandable since the regex engine will have
+			 * reported them that way.
 			 */
 			if (!colorInfo->expandable)
 				continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
 	states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.
 
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:
 
@@ -349,6 +361,8 @@ The possible arc types are:
 
     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".
 
     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".
 
     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".
 
     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
 	struct colordesc *cd = &cm->cd[a->co];
 
+	assert(a->co >= 0);
 	if (cd->arcs != NULL)
 		cd->arcs->colorchainRev = a;
 	a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
 	struct colordesc *cd = &cm->cd[a->co];
 	struct arc *aa = a->colorchainRev;
 
+	assert(a->co >= 0);
 	if (aa == NULL)
 	{
 		assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,
 
 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
 	struct colordesc *end = CDEND(cm);
 	color		co;
 
+	if (but == COLORLESS)
+	{
+		newarc(nfa, type, RAINBOW, from, to);
+		return;
+	}
+
+	/* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
 			!(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
 				struct colormap *cm,
 				int type,
-				struct state *of,	/* complements of this guy's PLAIN outarcs */
+				struct state *of,
 				struct state *from,
 				struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
 	color		co;
 
 	assert(of != from);
+
+	/* A RAINBOW arc matches all colors, making the complement empty */
+	if (findarc(of, PLAIN, RAINBOW) != NULL)
+		return;
+
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
 			if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 92c9c4d795..1ac030570d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,
 
 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
 	struct arc *a;
 
 	assert(old != from);
+	assert(type == AHEAD || type == BEHIND);
 
 	for (a = old->outs; a != NULL; a = a->outchain)
+	{
+		assert(a->type == PLAIN);
 		newarc(nfa, type, a->co, from, to);
+	}
 }
 
 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
 	for (a = from->ins; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->inchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
 				cparc(nfa, a, s, to);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, a->from, to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
 	for (a = to->outs; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->outchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
 				cparc(nfa, a, from, s);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, from, a->to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE	1	// destroys arc
  * #def SATISFIED		2	// constraint satisfied
  * #def COMPATIBLE		3	// compatible but not satisfied yet
+ * #def REPLACEARC		4	// replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+		struct arc *con,
 		struct arc *a)
 {
 #define  CA(ct,at)	 (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
 		case CA(BEHIND, PLAIN):
 			if (con->co == a->co)
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', '^'):		/* collision, similar constraints */
 		case CA('$', '$'):
-		case CA(AHEAD, AHEAD):
+			if (con->co == a->co)	/* true duplication */
+				return SATISFIED;
+			return INCOMPATIBLE;
+			break;
+		case CA(AHEAD, AHEAD):	/* collision, similar constraints */
 		case CA(BEHIND, BEHIND):
 			if (con->co == a->co)	/* true duplication */
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', BEHIND):	/* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
 					break;
 				case LACON:
 					assert(s->no != cnfa->pre);
+					assert(a->co >= 0);
 					ca->co = (color) (cnfa->ncolors + a->co);
 					ca->to = a->to->no;
 					ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
 					break;
 				default:
 					NERR(REG_ASSERT);
-					break;
+					return;
 			}
 		carcsort(first, ca - first);
 		ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
 	switch (a->type)
 	{
 		case PLAIN:
-			fprintf(f, "[%ld]", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "[*]");
+			else
+				fprintf(f, "[%ld]", (long) a->co);
 			break;
 		case AHEAD:
-			fprintf(f, ">%ld>", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, ">*>");
+			else
+				fprintf(f, ">%ld>", (long) a->co);
 			break;
 		case BEHIND:
-			fprintf(f, "<%ld<", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "<*<");
+			else
+				fprintf(f, "<%ld<", (long) a->co);
 			break;
 		case LACON:
 			fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
 	pos = 1;
 	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
 	{
-		if (ca->co < cnfa->ncolors)
+		if (ca->co == RAINBOW)
+			fprintf(f, "\t[*]->%d", ca->to);
+		else if (ca->co < cnfa->ncolors)
 			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		else
 			fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 91078dcd80..5956b86026 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int	push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE	1		/* destroys arc */
 #define SATISFIED	2			/* constraint satisfied */
 #define COMPATIBLE	3			/* compatible but not satisfied yet */
-static int	combine(struct arc *, struct arc *);
+#define REPLACEARC	4			/* replace arc's color with constraint color */
+static int	combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
 									struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN	'A'				/* beginning of string (even if not BOL) */
 #define SEND	'Z'				/* end of string (even if not EOL) */
 
-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-	((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+	((a)->co >= 0 && \
+	 ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
 
 
 /* static function list */
@@ -1390,7 +1393,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
 	unsigned	h;
 	struct carc *ca;
 	struct sset *p;
+	int			ispseudocolor;
 	int			ispost;
 	int			noprogress;
 	int			gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
 	 */
 	for (i = 0; i < d->wordsper; i++)
 		d->work[i] = 0;			/* build new stateset bitmap in d->work */
+	ispseudocolor = d->cm->cd[co].flags & PSEUDO;
 	ispost = 0;
 	noprogress = 1;
 	gotstate = 0;
 	for (i = 0; i < d->nstates; i++)
 		if (ISBSET(css->states, i))
 			for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-				if (ca->co == co)
+				if (ca->co == co ||
+					(ca->co == RAINBOW && !ispseudocolor))
 				{
 					BSET(d->work, ca->to);
 					gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
 	assert(regex != NULL && regex->re_magic == REMAGIC);
 	cm = &((struct guts *) regex->re_guts)->cmap;
 
-	if (co <= 0 || co > cm->max)	/* we reject 0 which is WHITE */
+	if (co <= 0 || co > cm->max)	/* <= 0 rejects WHITE and RAINBOW */
 		return -1;
 	if (cm->cd[co].flags & PSEUDO)	/* also pseudocolors (BOS etc) */
 		return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
 			/* We can ignore BOS/BOL arcs */
 			if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 				continue;
-			/* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+			/*
+			 * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+			 * and LACONs
+			 */
 			if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-				ca->co >= cnfa->ncolors)
+				ca->co == RAINBOW || ca->co >= cnfa->ncolors)
 			{
 				thiscolor = COLORLESS;
 				break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@
 
 #include "regex/regex.h"
 
+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE		0		/* color for chars not appearing in regex */
+#define COLOR_RAINBOW	(-2)	/* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5d0e7a961c..5bcd669d59 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;			/* colors of characters */
 
 #define MAX_COLOR	32767		/* max color (must fit in 'color' datatype) */
 #define COLORLESS	(-1)		/* impossible color */
+#define RAINBOW		(-2)		/* represents all colors except pseudocolors */
 #define WHITE		0			/* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */
 
@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
 	int			type;			/* 0 if free, else an NFA arc type code */
-	color		co;
+	color		co;				/* color the arc matches (possibly RAINBOW) */
 	struct state *from;			/* where it's from (and contained within) */
 	struct state *to;			/* where it's to */
 	struct arc *outchain;		/* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain	outchain	/* we do not maintain "freechainRev" */
 	struct arc *inchain;		/* link in *to's ins chain */
 	struct arc *inchainRev;		/* back-link in *to's ins chain */
+	/* these fields are not used when co == RAINBOW: */
 	struct arc *colorchain;		/* link in color's arc chain */
 	struct arc *colorchainRev;	/* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {

0002-recognize-matchall-NFAs-2.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs-2.patchDownload

diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.
 
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 1ac030570d..e5c5c65d84 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
 	nfa->v = v;
 	nfa->bos[0] = nfa->bos[1] = COLORLESS;
 	nfa->eos[0] = nfa->eos[1] = COLORLESS;
+	nfa->flags = 0;
+	nfa->minmatchall = nfa->maxmatchall = -1;
 	nfa->parent = parent;		/* Precedes newfstate so parent is valid. */
 	nfa->post = newfstate(nfa, '@');	/* number 0 */
 	nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
 	if (NISERR())
 		return 0;
 
+	/* Detect whether NFA can't match anything */
 	if (nfa->pre->outs == NULL)
 		return REG_UIMPOSSIBLE;
+
+	/* Detect whether NFA matches all strings (possibly with length bounds) */
+	checkmatchall(nfa);
+
+	/* Detect whether NFA can possibly match a zero-length string */
 	for (a = nfa->pre->outs; a != NULL; a = a->outchain)
 		for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
 			if (aa->to == nfa->post)
@@ -2884,6 +2892,279 @@ analyze(struct nfa *nfa)
 	return 0;
 }
 
+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+	bool		hasmatch[DUPINF + 1];
+	int			minmatch,
+				maxmatch,
+				morematch;
+
+	/*
+	 * hasmatch[i] will be set true if a match of length i is feasible, for i
+	 * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+	 * length of DUPINF or more is feasible.
+	 */
+	memset(hasmatch, 0, sizeof(hasmatch));
+
+	/*
+	 * Recursively search the graph for all-RAINBOW paths to the "post" state,
+	 * starting at the "pre" state.  The -1 initial depth accounts for the
+	 * fact that transitions out of the "pre" state are not part of the
+	 * matched string.  We likewise don't count the final transition to the
+	 * "post" state as part of the match length.  (But we still insist that
+	 * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+	 * lookahead constraints at the start/end of the pattern.)
+	 */
+	if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+		return;
+
+	/*
+	 * We found some all-RAINBOW paths, and not anything that we couldn't
+	 * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+	 * states match the RAINBOW arcs there.  (We could do this while
+	 * recursing, but it's expensive and unlikely to fail, so do it last.)
+	 */
+	if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+		!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+		return;
+	if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+		!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+		return;
+
+	/*
+	 * hasmatch[] now represents the set of possible match lengths; but we
+	 * want to reduce that to a min and max value, because it doesn't seem
+	 * worth complicating regexec.c to deal with nonconsecutive possible match
+	 * lengths.  Find min and max of first run of lengths, then verify there
+	 * are no nonconsecutive lengths.
+	 */
+	for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+	{
+		if (hasmatch[minmatch])
+			break;
+	}
+	assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+	for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+	{
+		if (!hasmatch[maxmatch + 1])
+			break;
+	}
+	for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+	{
+		if (hasmatch[morematch])
+			return;				/* fail, there are nonconsecutive lengths */
+	}
+
+	/* Success, so record the info */
+	nfa->minmatchall = minmatch;
+	nfa->maxmatchall = maxmatch;
+	nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+					  bool foundloop, int depth,
+					  bool *hasmatch)
+{
+	bool		result = false;
+	struct arc *a;
+
+	/*
+	 * Since this is recursive, it could be driven to stack overflow.  But we
+	 * need not treat that as a hard failure; just deem the NFA non-matchall.
+	 */
+	if (STACK_TOO_DEEP(nfa->v->re))
+		return false;
+
+	/*
+	 * Likewise, if we get to a depth too large to represent correctly in
+	 * maxmatchall, fail quietly.
+	 */
+	if (depth >= DUPINF)
+		return false;
+
+	/*
+	 * Scan the outarcs to detect cases we can't handle, and to see if there
+	 * is a loop-to-self here.  We need to know about any such loop before we
+	 * recurse, so it's hard to avoid making two passes over the outarcs.  In
+	 * any case, checking for showstoppers before we recurse is probably best.
+	 */
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->type != PLAIN)
+			return false;		/* any LACONs make it non-matchall */
+		if (a->co != RAINBOW)
+		{
+			if (nfa->cm->cd[a->co].flags & PSEUDO)
+			{
+				/*
+				 * Pseudocolor arc: verify it's in a valid place (this seems
+				 * quite unlikely to fail, but let's be sure).
+				 */
+				if (s == nfa->pre &&
+					(a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+					 /* okay BOS/BOL arc */ ;
+				else if (a->to == nfa->post &&
+						 (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+					 /* okay EOS/EOL arc */ ;
+				else
+					return false;	/* unexpected pseudocolor arc */
+				/* We'll finish checking these arcs after the recursion */
+				continue;
+			}
+			return false;		/* any other color makes it non-matchall */
+		}
+		if (a->to == s)
+		{
+			/*
+			 * We found a cycle of length 1, so remember that to pass down to
+			 * successor states.  (It doesn't matter if there was also such a
+			 * loop at a predecessor state.)
+			 */
+			foundloop = true;
+		}
+		else if (a->to->tmp)
+		{
+			/* We found a cycle of length > 1, so fail. */
+			return false;
+		}
+	}
+
+	/* We need to recurse, so mark state as under consideration */
+	assert(s->tmp == NULL);
+	s->tmp = s;
+
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->co != RAINBOW)
+			continue;			/* ignore pseudocolor arcs */
+		if (a->to == nfa->post)
+		{
+			/* We found an all-RAINBOW path to the post state */
+			result = true;
+			/* Record potential match lengths */
+			assert(depth >= 0);
+			hasmatch[depth] = true;
+			if (foundloop)
+			{
+				/* A predecessor loop makes all larger lengths match, too */
+				int			i;
+
+				for (i = depth + 1; i <= DUPINF; i++)
+					hasmatch[i] = true;
+			}
+		}
+		else if (a->to != s)
+		{
+			/* This is a new path forward; recurse to investigate */
+			result = checkmatchall_recurse(nfa, a->to,
+										   foundloop, depth + 1,
+										   hasmatch);
+			/* Fail if any recursive path fails */
+			if (!result)
+				break;
+		}
+	}
+
+	s->tmp = NULL;
+	return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+	{
+		if (a1->co != co1)
+			continue;
+		for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+		{
+			if (a2->co == co2 && a2->to == a1->to)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+	{
+		if (a1->type != PLAIN || a1->co != co1)
+			continue;
+		for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+		{
+			if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3211,9 @@ compact(struct nfa *nfa,
 	cnfa->eos[0] = nfa->eos[0];
 	cnfa->eos[1] = nfa->eos[1];
 	cnfa->ncolors = maxcolor(nfa->cm) + 1;
-	cnfa->flags = 0;
+	cnfa->flags = nfa->flags;
+	cnfa->minmatchall = nfa->minmatchall;
+	cnfa->maxmatchall = nfa->maxmatchall;
 
 	ca = cnfa->arcs;
 	for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3317,11 @@ dumpnfa(struct nfa *nfa,
 		fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
 	if (nfa->eos[1] != COLORLESS)
 		fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+	if (nfa->flags & HASLACONS)
+		fprintf(f, ", haslacons");
+	if (nfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				nfa->minmatchall, nfa->maxmatchall);
 	fprintf(f, "\n");
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
@@ -3201,6 +3489,9 @@ dumpcnfa(struct cnfa *cnfa,
 		fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
 	if (cnfa->flags & HASLACONS)
 		fprintf(f, ", haslacons");
+	if (cnfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				cnfa->minmatchall, cnfa->maxmatchall);
 	fprintf(f, "\n");
 	for (st = 0; st < cnfa->nstates; st++)
 		dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 5956b86026..17e79ad0fb 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+								  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int	carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = stop - start;
+		size_t		maxmatchall = d->cnfa->maxmatchall;
+
+		if (nchr < d->cnfa->minmatchall)
+			return NULL;
+		if (maxmatchall == DUPINF)
+		{
+			if (stop == v->stop && hitstopp != NULL)
+				*hitstopp = 1;
+		}
+		else
+		{
+			if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+				*hitstopp = 1;
+			if (nchr > maxmatchall)
+				return start + maxmatchall;
+		}
+		return stop;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = min - start;
+
+		if (d->cnfa->maxmatchall != DUPINF &&
+			nchr > d->cnfa->maxmatchall)
+			return NULL;
+		if ((max - start) < d->cnfa->minmatchall)
+			return NULL;
+		if (nchr < d->cnfa->minmatchall)
+			min = start + d->cnfa->minmatchall;
+		if (coldp != NULL)
+			*coldp = start;
+		/* there is no case where we should set *hitstopp */
+		return min;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
 	struct sset *ss;
 	struct colormap *cm = d->cm;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = probe - v->start;
+
+		/*
+		 * It might seem that we should check maxmatchall too, but the
+		 * implicit .* at the front of the pattern absorbs any extra
+		 * characters.  Thus, we should always match as long as there are at
+		 * least minmatchall characters.
+		 */
+		if (nchr < d->cnfa->minmatchall)
+			return 0;
+		return 1;
+	}
+
 	/* initialize and startup, or restart, if necessary */
 	if (cp == NULL || cp > probe)
 	{
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
 	assert(g->tree != NULL);
 	cnfa = &g->tree->cnfa;
 
+	/* matchall NFAs never have a fixed prefix */
+	if (cnfa->flags & MATCHALL)
+		return REG_NOMATCH;
+
 	/*
 	 * Since a correct NFA should never contain any exit-free loops, it should
 	 * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 5bcd669d59..956b37b72d 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
 	struct colormap *cm;		/* the color map */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
+	int			flags;			/* flags to pass forward to cNFA */
+	int			minmatchall;	/* min number of chrs to match, if matchall */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 	struct vars *v;				/* simplifies compile error reporting */
 	struct nfa *parent;			/* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
 	int			ncolors;		/* number of colors (max color in use + 1) */
 	int			flags;
 #define  HASLACONS	01			/* uses lookaround constraints */
+#define  MATCHALL	02			/* matches all strings of a range of lengths */
 	int			pre;			/* setup state number */
 	int			post;			/* teardown state number */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
 	struct carc **states;		/* vector of pointers to outarc lists */
 	/* states[n] are pointers into a single malloc'd array of arcs */
 	struct carc *arcs;			/* the area for the lists */
+	/* these fields are used only in a MATCHALL NFA (else they're -1): */
+	int			minmatchall;	/* min number of chrs to match */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 };
 
 #define ZAPCNFA(cnfa)	((cnfa).nstates = 0)
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..f01ca071d9 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3315,6 +3315,21 @@ select * from test_regex('(?=b)b', 'a', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)
 
+-- expectMatch	23.9 HP		...(?!.)	abcde	cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {cde}
+(2 rows)
+
+-- expectNomatch	23.10 HP	...(?=.)	abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+(1 row)
+
 -- Postgres addition: lookbehind constraints
 -- expectMatch	23.11 HPN		(?<=a)b*	ab	b
 select * from test_regex('(?<=a)b*', 'ab', 'HPN');
@@ -3376,6 +3391,39 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)
 
+-- expectMatch	23.19 HP		(?<=.)..	abcde	bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bc}
+(2 rows)
+
+-- expectMatch	23.20 HP		(?<=..)a*	aaabb	a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch	23.21 HP		(?<=..)b*	aaabb	{}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch	23.22 HP		(?<=..)b+	aaabb	bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch	24.1  PT	ab+?		abb	ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..ae7d6b43e4 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1049,6 +1049,10 @@ select * from test_regex('a(?!b)b*', 'a', 'HP');
 select * from test_regex('(?=b)b', 'b', 'HP');
 -- expectNomatch	23.8 HP		(?=b)b		a
 select * from test_regex('(?=b)b', 'a', 'HP');
+-- expectMatch	23.9 HP		...(?!.)	abcde	cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+-- expectNomatch	23.10 HP	...(?=.)	abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
 
 -- Postgres addition: lookbehind constraints
 
@@ -1068,6 +1072,15 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch	23.18 HP		(?<=b)b		b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch	23.19 HP		(?<=.)..	abcde	bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+-- expectMatch	23.20 HP		(?<=..)a*	aaabb	a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch	23.21 HP		(?<=..)b*	aaabb	{}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch	23.22 HP		(?<=..)b+	aaabb	bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
 
 -- doing 24 "non-greedy quantifiers"

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#4)

Re: Some regular-expression performance hacking

On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:

0001-invent-rainbow-arcs-2.patch
0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.

Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

...previously raised...

error invalid regular expression: regular expression is too complex

...but now goes through:

is_match <NULL> => t
captured <NULL> => {de}
error invalid regular expression: regular expression is too complex => <NULL>

Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

The test that found the deviation tests each (pattern, text string) individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT
tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, patterns.pattern),
COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? | count
----------+----------+---------
t | t | 1448212
(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? | count
----------+----------+---------
t | t | 1448212
(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#5)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

I've successfully tested both patches against the 1.5M regexes-in-the-wild dataset.
Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:
This 100577 bytes big regex (pattern_id = 207811)...
...
...previously raised...
error invalid regular expression: regular expression is too complex
...but now goes through:

Nice. The patched regex engine is apparently capable of handling even more complex regexes than before.

Yeah. There are various limitations that can lead to REG_ETOOBIG, but the
main ones are "too many states" and "too many arcs". The RAINBOW change
directly reduces the number of arcs and thus makes larger regexes feasible.
I'm sure it's coincidental that the one such example you captured happens
to be fixed by this change, but hey I'll take it.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#5)

4 attachment(s)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

Below is the result of the performance test query:
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
Time: 592196.145 ms (09:52.196)
-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
Time: 461739.364 ms (07:41.739)
That's an impressive 22% speed-up!

I've been doing some more hacking over the weekend, and have a couple
of additional improvements to show. The point of these two additional
patches is to reduce the number of "struct subre" sub-regexps that
the regex parser creates. The subre's themselves aren't that large,
so this might seem like it would have only small benefit. However,
each subre requires its own NFA for the portion of the RE that it
matches. That adds space, and it also adds compilation time because
we run the "optimize()" pass separately for each such NFA. Maybe
there'd be a way to share some of that work, but I'm not very clear
how. In any case, not having a subre at all is clearly better where
we can manage it.

0003 is a small patch that fixes up parseqatom() so that it doesn't
emit no-op subre's for empty portions of a regexp branch that are
adjacent to a "messy" regexp atom (that is, a capture node, a
backref, or an atom with greediness different from what preceded it).

0004 is a rather larger patch whose result is to get rid of extra
subre's associated with alternation subre's. If we have a|b|c
and any of those alternation branches are messy, we end up with

*
/ \
a *
/ \
b *
/ \
c NULL

where each "*" is an alternation subre node, and all those "*"'s have
identical NFAs that match the whole a|b|c construct. This means that
for an N-way alternation we're going to need something like O(N^2)
work to optimize all those NFAs. That's embarrassing (and I think
it's my fault --- if memory serves, I put in this representation
of messy alternations years ago).

We can improve matters by having just one parent node for an
alternation:

*
\
a -> b -> c

That requires replacing the binary-tree structure of subre's
with a child-and-sibling arrangement, which is not terribly
difficult but accounts for most of the bulk of the patch.
(I'd wanted to do that for years, but up till now I did not
think it would have any real material benefit.)

There might be more that can be done in this line, but that's
as far as I got so far.

I did some testing on this using your dataset (thanks for
giving me a copy) and this query:

SELECT
pattern,
subject,
is_match AS is_match_head,
captured AS captured_head,
subject ~ pattern AS is_match_patch,
regexp_match(subject, pattern) AS captured_patch
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern)
OR captured IS DISTINCT FROM regexp_match(subject, pattern));

I got these runtimes (non-cassert builds):

HEAD	313661.149 ms (05:13.661)
+0001	297397.293 ms (04:57.397)	5% better than HEAD
+0002	151995.803 ms (02:31.996)	51% better than HEAD
+0003	139843.934 ms (02:19.844)	55% better than HEAD
+0004	95034.611 ms (01:35.035)	69% better than HEAD

Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly. I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.

Anyway, I'm feeling pretty pleased with these results ...

regards, tom lane

Attachments:

0001-invent-rainbow-arcs-3.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs-3.patchDownload

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;
 
 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN	(-1)
-#define COLOR_BLANK		(-2)
+#define COLOR_UNKNOWN	(-3)
+#define COLOR_BLANK		(-4)
 
 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
 		palloc0(colorsCount * sizeof(TrgmColorInfo));
 
 	/*
-	 * Loop over colors, filling TrgmColorInfo about each.
+	 * Loop over colors, filling TrgmColorInfo about each.  Note we include
+	 * WHITE (0) even though we know it'll be reported as non-expandable.
 	 */
 	for (i = 0; i < colorsCount; i++)
 	{
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 			/* Add enter key to this state */
 			addKeyToQueue(trgmNFA, &destKey);
 		}
-		else
+		else if (arc->co >= 0)
 		{
-			/* Regular color */
+			/* Regular color (including WHITE) */
 			TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
 
 			if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 				addKeyToQueue(trgmNFA, &destKey);
 			}
 		}
+		else
+		{
+			/* RAINBOW: treat as unexpandable color */
+			destKey.prefix.colors[0] = COLOR_UNKNOWN;
+			destKey.prefix.colors[1] = COLOR_UNKNOWN;
+			destKey.nstate = arc->to;
+			addKeyToQueue(trgmNFA, &destKey);
+		}
 	}
 
 	pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
 			/*
 			 * Ignore non-expandable colors; addKey already handled the case.
 			 *
-			 * We need no special check for begin/end pseudocolors here.  We
-			 * don't need to do any processing for them, and they will be
-			 * marked non-expandable since the regex engine will have reported
-			 * them that way.
+			 * We need no special check for WHITE or begin/end pseudocolors
+			 * here.  We don't need to do any processing for them, and they
+			 * will be marked non-expandable since the regex engine will have
+			 * reported them that way.
 			 */
 			if (!colorInfo->expandable)
 				continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..cc1834b89c 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
 	states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.
 
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:
 
@@ -349,6 +361,8 @@ The possible arc types are:
 
     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".
 
     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".
 
     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".
 
     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
 	struct colordesc *cd = &cm->cd[a->co];
 
+	assert(a->co >= 0);
 	if (cd->arcs != NULL)
 		cd->arcs->colorchainRev = a;
 	a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
 	struct colordesc *cd = &cm->cd[a->co];
 	struct arc *aa = a->colorchainRev;
 
+	assert(a->co >= 0);
 	if (aa == NULL)
 	{
 		assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,
 
 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
 	struct colordesc *end = CDEND(cm);
 	color		co;
 
+	if (but == COLORLESS)
+	{
+		newarc(nfa, type, RAINBOW, from, to);
+		return;
+	}
+
+	/* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
 			!(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
 				struct colormap *cm,
 				int type,
-				struct state *of,	/* complements of this guy's PLAIN outarcs */
+				struct state *of,
 				struct state *from,
 				struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
 	color		co;
 
 	assert(of != from);
+
+	/* A RAINBOW arc matches all colors, making the complement empty */
+	if (findarc(of, PLAIN, RAINBOW) != NULL)
+		return;
+
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
 			if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index cde82625c8..ff98bfd694 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,
 
 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
 	struct arc *a;
 
 	assert(old != from);
+	assert(type == AHEAD || type == BEHIND);
 
 	for (a = old->outs; a != NULL; a = a->outchain)
+	{
+		assert(a->type == PLAIN);
 		newarc(nfa, type, a->co, from, to);
+	}
 }
 
 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
 	for (a = from->ins; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->inchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
 				cparc(nfa, a, s, to);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, a->from, to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
 	for (a = to->outs; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->outchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
 				cparc(nfa, a, from, s);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, from, a->to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE	1	// destroys arc
  * #def SATISFIED		2	// constraint satisfied
  * #def COMPATIBLE		3	// compatible but not satisfied yet
+ * #def REPLACEARC		4	// replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+		struct arc *con,
 		struct arc *a)
 {
 #define  CA(ct,at)	 (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
 		case CA(BEHIND, PLAIN):
 			if (con->co == a->co)
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', '^'):		/* collision, similar constraints */
 		case CA('$', '$'):
-		case CA(AHEAD, AHEAD):
+			if (con->co == a->co)	/* true duplication */
+				return SATISFIED;
+			return INCOMPATIBLE;
+			break;
+		case CA(AHEAD, AHEAD):	/* collision, similar constraints */
 		case CA(BEHIND, BEHIND):
 			if (con->co == a->co)	/* true duplication */
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', BEHIND):	/* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
 					break;
 				case LACON:
 					assert(s->no != cnfa->pre);
+					assert(a->co >= 0);
 					ca->co = (color) (cnfa->ncolors + a->co);
 					ca->to = a->to->no;
 					ca++;
@@ -2902,7 +2957,7 @@ compact(struct nfa *nfa,
 					break;
 				default:
 					NERR(REG_ASSERT);
-					break;
+					return;
 			}
 		carcsort(first, ca - first);
 		ca->co = COLORLESS;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
 	switch (a->type)
 	{
 		case PLAIN:
-			fprintf(f, "[%ld]", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "[*]");
+			else
+				fprintf(f, "[%ld]", (long) a->co);
 			break;
 		case AHEAD:
-			fprintf(f, ">%ld>", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, ">*>");
+			else
+				fprintf(f, ">%ld>", (long) a->co);
 			break;
 		case BEHIND:
-			fprintf(f, "<%ld<", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "<*<");
+			else
+				fprintf(f, "<%ld<", (long) a->co);
 			break;
 		case LACON:
 			fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
 	pos = 1;
 	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
 	{
-		if (ca->co < cnfa->ncolors)
+		if (ca->co == RAINBOW)
+			fprintf(f, "\t[*]->%d", ca->to);
+		else if (ca->co < cnfa->ncolors)
 			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		else
 			fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index f0896d2db1..e73476040d 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int	push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE	1		/* destroys arc */
 #define SATISFIED	2			/* constraint satisfied */
 #define COMPATIBLE	3			/* compatible but not satisfied yet */
-static int	combine(struct arc *, struct arc *);
+#define REPLACEARC	4			/* replace arc's color with constraint color */
+static int	combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
 									struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN	'A'				/* beginning of string (even if not BOL) */
 #define SEND	'Z'				/* end of string (even if not EOL) */
 
-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-	((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+	((a)->co >= 0 && \
+	 ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
 
 
 /* static function list */
@@ -1393,7 +1396,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
 	unsigned	h;
 	struct carc *ca;
 	struct sset *p;
+	int			ispseudocolor;
 	int			ispost;
 	int			noprogress;
 	int			gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
 	 */
 	for (i = 0; i < d->wordsper; i++)
 		d->work[i] = 0;			/* build new stateset bitmap in d->work */
+	ispseudocolor = d->cm->cd[co].flags & PSEUDO;
 	ispost = 0;
 	noprogress = 1;
 	gotstate = 0;
 	for (i = 0; i < d->nstates; i++)
 		if (ISBSET(css->states, i))
 			for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-				if (ca->co == co)
+				if (ca->co == co ||
+					(ca->co == RAINBOW && !ispseudocolor))
 				{
 					BSET(d->work, ca->to);
 					gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
 	assert(regex != NULL && regex->re_magic == REMAGIC);
 	cm = &((struct guts *) regex->re_guts)->cmap;
 
-	if (co <= 0 || co > cm->max)	/* we reject 0 which is WHITE */
+	if (co <= 0 || co > cm->max)	/* <= 0 rejects WHITE and RAINBOW */
 		return -1;
 	if (cm->cd[co].flags & PSEUDO)	/* also pseudocolors (BOS etc) */
 		return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
 			/* We can ignore BOS/BOL arcs */
 			if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 				continue;
-			/* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+			/*
+			 * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+			 * and LACONs
+			 */
 			if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-				ca->co >= cnfa->ncolors)
+				ca->co == RAINBOW || ca->co >= cnfa->ncolors)
 			{
 				thiscolor = COLORLESS;
 				break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@
 
 #include "regex/regex.h"
 
+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE		0		/* color for chars not appearing in regex */
+#define COLOR_RAINBOW	(-2)	/* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 0a616562d0..6d39108319 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;			/* colors of characters */
 
 #define MAX_COLOR	32767		/* max color (must fit in 'color' datatype) */
 #define COLORLESS	(-1)		/* impossible color */
+#define RAINBOW		(-2)		/* represents all colors except pseudocolors */
 #define WHITE		0			/* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */
 
@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
 	int			type;			/* 0 if free, else an NFA arc type code */
-	color		co;
+	color		co;				/* color the arc matches (possibly RAINBOW) */
 	struct state *from;			/* where it's from (and contained within) */
 	struct state *to;			/* where it's to */
 	struct arc *outchain;		/* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain	outchain	/* we do not maintain "freechainRev" */
 	struct arc *inchain;		/* link in *to's ins chain */
 	struct arc *inchainRev;		/* back-link in *to's ins chain */
+	/* these fields are not used when co == RAINBOW: */
 	struct arc *colorchain;		/* link in color's arc chain */
 	struct arc *colorchainRev;	/* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {

0002-recognize-matchall-NFAs-3.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs-3.patchDownload

diff --git a/src/backend/regex/README b/src/backend/regex/README
index cc1834b89c..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -410,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.
 
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index ff98bfd694..b403bb250d 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
 	nfa->v = v;
 	nfa->bos[0] = nfa->bos[1] = COLORLESS;
 	nfa->eos[0] = nfa->eos[1] = COLORLESS;
+	nfa->flags = 0;
+	nfa->minmatchall = nfa->maxmatchall = -1;
 	nfa->parent = parent;		/* Precedes newfstate so parent is valid. */
 	nfa->post = newfstate(nfa, '@');	/* number 0 */
 	nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
 	if (NISERR())
 		return 0;
 
+	/* Detect whether NFA can't match anything */
 	if (nfa->pre->outs == NULL)
 		return REG_UIMPOSSIBLE;
+
+	/* Detect whether NFA matches all strings (possibly with length bounds) */
+	checkmatchall(nfa);
+
+	/* Detect whether NFA can possibly match a zero-length string */
 	for (a = nfa->pre->outs; a != NULL; a = a->outchain)
 		for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
 			if (aa->to == nfa->post)
@@ -2884,6 +2892,279 @@ analyze(struct nfa *nfa)
 	return 0;
 }
 
+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+	bool		hasmatch[DUPINF + 1];
+	int			minmatch,
+				maxmatch,
+				morematch;
+
+	/*
+	 * hasmatch[i] will be set true if a match of length i is feasible, for i
+	 * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+	 * length of DUPINF or more is feasible.
+	 */
+	memset(hasmatch, 0, sizeof(hasmatch));
+
+	/*
+	 * Recursively search the graph for all-RAINBOW paths to the "post" state,
+	 * starting at the "pre" state.  The -1 initial depth accounts for the
+	 * fact that transitions out of the "pre" state are not part of the
+	 * matched string.  We likewise don't count the final transition to the
+	 * "post" state as part of the match length.  (But we still insist that
+	 * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+	 * lookahead constraints at the start/end of the pattern.)
+	 */
+	if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+		return;
+
+	/*
+	 * We found some all-RAINBOW paths, and not anything that we couldn't
+	 * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+	 * states match the RAINBOW arcs there.  (We could do this while
+	 * recursing, but it's expensive and unlikely to fail, so do it last.)
+	 */
+	if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+		!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+		return;
+	if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+		!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+		return;
+
+	/*
+	 * hasmatch[] now represents the set of possible match lengths; but we
+	 * want to reduce that to a min and max value, because it doesn't seem
+	 * worth complicating regexec.c to deal with nonconsecutive possible match
+	 * lengths.  Find min and max of first run of lengths, then verify there
+	 * are no nonconsecutive lengths.
+	 */
+	for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+	{
+		if (hasmatch[minmatch])
+			break;
+	}
+	assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+	for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+	{
+		if (!hasmatch[maxmatch + 1])
+			break;
+	}
+	for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+	{
+		if (hasmatch[morematch])
+			return;				/* fail, there are nonconsecutive lengths */
+	}
+
+	/* Success, so record the info */
+	nfa->minmatchall = minmatch;
+	nfa->maxmatchall = maxmatch;
+	nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+					  bool foundloop, int depth,
+					  bool *hasmatch)
+{
+	bool		result = false;
+	struct arc *a;
+
+	/*
+	 * Since this is recursive, it could be driven to stack overflow.  But we
+	 * need not treat that as a hard failure; just deem the NFA non-matchall.
+	 */
+	if (STACK_TOO_DEEP(nfa->v->re))
+		return false;
+
+	/*
+	 * Likewise, if we get to a depth too large to represent correctly in
+	 * maxmatchall, fail quietly.
+	 */
+	if (depth >= DUPINF)
+		return false;
+
+	/*
+	 * Scan the outarcs to detect cases we can't handle, and to see if there
+	 * is a loop-to-self here.  We need to know about any such loop before we
+	 * recurse, so it's hard to avoid making two passes over the outarcs.  In
+	 * any case, checking for showstoppers before we recurse is probably best.
+	 */
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->type != PLAIN)
+			return false;		/* any LACONs make it non-matchall */
+		if (a->co != RAINBOW)
+		{
+			if (nfa->cm->cd[a->co].flags & PSEUDO)
+			{
+				/*
+				 * Pseudocolor arc: verify it's in a valid place (this seems
+				 * quite unlikely to fail, but let's be sure).
+				 */
+				if (s == nfa->pre &&
+					(a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+					 /* okay BOS/BOL arc */ ;
+				else if (a->to == nfa->post &&
+						 (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+					 /* okay EOS/EOL arc */ ;
+				else
+					return false;	/* unexpected pseudocolor arc */
+				/* We'll finish checking these arcs after the recursion */
+				continue;
+			}
+			return false;		/* any other color makes it non-matchall */
+		}
+		if (a->to == s)
+		{
+			/*
+			 * We found a cycle of length 1, so remember that to pass down to
+			 * successor states.  (It doesn't matter if there was also such a
+			 * loop at a predecessor state.)
+			 */
+			foundloop = true;
+		}
+		else if (a->to->tmp)
+		{
+			/* We found a cycle of length > 1, so fail. */
+			return false;
+		}
+	}
+
+	/* We need to recurse, so mark state as under consideration */
+	assert(s->tmp == NULL);
+	s->tmp = s;
+
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->co != RAINBOW)
+			continue;			/* ignore pseudocolor arcs */
+		if (a->to == nfa->post)
+		{
+			/* We found an all-RAINBOW path to the post state */
+			result = true;
+			/* Record potential match lengths */
+			assert(depth >= 0);
+			hasmatch[depth] = true;
+			if (foundloop)
+			{
+				/* A predecessor loop makes all larger lengths match, too */
+				int			i;
+
+				for (i = depth + 1; i <= DUPINF; i++)
+					hasmatch[i] = true;
+			}
+		}
+		else if (a->to != s)
+		{
+			/* This is a new path forward; recurse to investigate */
+			result = checkmatchall_recurse(nfa, a->to,
+										   foundloop, depth + 1,
+										   hasmatch);
+			/* Fail if any recursive path fails */
+			if (!result)
+				break;
+		}
+	}
+
+	s->tmp = NULL;
+	return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+	{
+		if (a1->co != co1)
+			continue;
+		for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+		{
+			if (a2->co == co2 && a2->to == a1->to)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+	{
+		if (a1->type != PLAIN || a1->co != co1)
+			continue;
+		for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+		{
+			if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3211,9 @@ compact(struct nfa *nfa,
 	cnfa->eos[0] = nfa->eos[0];
 	cnfa->eos[1] = nfa->eos[1];
 	cnfa->ncolors = maxcolor(nfa->cm) + 1;
-	cnfa->flags = 0;
+	cnfa->flags = nfa->flags;
+	cnfa->minmatchall = nfa->minmatchall;
+	cnfa->maxmatchall = nfa->maxmatchall;
 
 	ca = cnfa->arcs;
 	for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3317,11 @@ dumpnfa(struct nfa *nfa,
 		fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
 	if (nfa->eos[1] != COLORLESS)
 		fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+	if (nfa->flags & HASLACONS)
+		fprintf(f, ", haslacons");
+	if (nfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				nfa->minmatchall, nfa->maxmatchall);
 	fprintf(f, "\n");
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
@@ -3201,6 +3489,9 @@ dumpcnfa(struct cnfa *cnfa,
 		fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
 	if (cnfa->flags & HASLACONS)
 		fprintf(f, ", haslacons");
+	if (cnfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				cnfa->minmatchall, cnfa->maxmatchall);
 	fprintf(f, "\n");
 	for (st = 0; st < cnfa->nstates; st++)
 		dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index e73476040d..b228aedbd9 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+								  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int	carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..20ec463204 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = stop - start;
+		size_t		maxmatchall = d->cnfa->maxmatchall;
+
+		if (nchr < d->cnfa->minmatchall)
+			return NULL;
+		if (maxmatchall == DUPINF)
+		{
+			if (stop == v->stop && hitstopp != NULL)
+				*hitstopp = 1;
+		}
+		else
+		{
+			if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+				*hitstopp = 1;
+			if (nchr > maxmatchall)
+				return start + maxmatchall;
+		}
+		return stop;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = min - start;
+
+		if (d->cnfa->maxmatchall != DUPINF &&
+			nchr > d->cnfa->maxmatchall)
+			return NULL;
+		if ((max - start) < d->cnfa->minmatchall)
+			return NULL;
+		if (nchr < d->cnfa->minmatchall)
+			min = start + d->cnfa->minmatchall;
+		if (coldp != NULL)
+			*coldp = start;
+		/* there is no case where we should set *hitstopp */
+		return min;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
 	struct sset *ss;
 	struct colormap *cm = d->cm;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = probe - v->start;
+
+		/*
+		 * It might seem that we should check maxmatchall too, but the
+		 * implicit .* at the front of the pattern absorbs any extra
+		 * characters.  Thus, we should always match as long as there are at
+		 * least minmatchall characters.
+		 */
+		if (nchr < d->cnfa->minmatchall)
+			return 0;
+		return 1;
+	}
+
 	/* initialize and startup, or restart, if necessary */
 	if (cp == NULL || cp > probe)
 	{
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
 	assert(g->tree != NULL);
 	cnfa = &g->tree->cnfa;
 
+	/* matchall NFAs never have a fixed prefix */
+	if (cnfa->flags & MATCHALL)
+		return REG_NOMATCH;
+
 	/*
 	 * Since a correct NFA should never contain any exit-free loops, it should
 	 * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 6d39108319..82e761bfe5 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
 	struct colormap *cm;		/* the color map */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
+	int			flags;			/* flags to pass forward to cNFA */
+	int			minmatchall;	/* min number of chrs to match, if matchall */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 	struct vars *v;				/* simplifies compile error reporting */
 	struct nfa *parent;			/* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
 	int			ncolors;		/* number of colors (max color in use + 1) */
 	int			flags;
 #define  HASLACONS	01			/* uses lookaround constraints */
+#define  MATCHALL	02			/* matches all strings of a range of lengths */
 	int			pre;			/* setup state number */
 	int			post;			/* teardown state number */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
 	struct carc **states;		/* vector of pointers to outarc lists */
 	/* states[n] are pointers into a single malloc'd array of arcs */
 	struct carc *arcs;			/* the area for the lists */
+	/* these fields are used only in a MATCHALL NFA (else they're -1): */
+	int			minmatchall;	/* min number of chrs to match */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 };
 
 /*
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 0dc2265d8b..f01ca071d9 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3315,6 +3315,21 @@ select * from test_regex('(?=b)b', 'a', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)
 
+-- expectMatch	23.9 HP		...(?!.)	abcde	cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {cde}
+(2 rows)
+
+-- expectNomatch	23.10 HP	...(?=.)	abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+(1 row)
+
 -- Postgres addition: lookbehind constraints
 -- expectMatch	23.11 HPN		(?<=a)b*	ab	b
 select * from test_regex('(?<=a)b*', 'ab', 'HPN');
@@ -3376,6 +3391,39 @@ select * from test_regex('(?<=b)b', 'b', 'HP');
  {0,REG_ULOOKAROUND,REG_UNONPOSIX}
 (1 row)
 
+-- expectMatch	23.19 HP		(?<=.)..	abcde	bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bc}
+(2 rows)
+
+-- expectMatch	23.20 HP		(?<=..)a*	aaabb	a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {a}
+(2 rows)
+
+-- expectMatch	23.21 HP		(?<=..)b*	aaabb	{}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {""}
+(2 rows)
+
+-- expectMatch	23.22 HP		(?<=..)b+	aaabb	bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
+            test_regex             
+-----------------------------------
+ {0,REG_ULOOKAROUND,REG_UNONPOSIX}
+ {bb}
+(2 rows)
+
 -- doing 24 "non-greedy quantifiers"
 -- expectMatch	24.1  PT	ab+?		abb	ab
 select * from test_regex('ab+?', 'abb', 'PT');
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 1a2bfa6235..ae7d6b43e4 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1049,6 +1049,10 @@ select * from test_regex('a(?!b)b*', 'a', 'HP');
 select * from test_regex('(?=b)b', 'b', 'HP');
 -- expectNomatch	23.8 HP		(?=b)b		a
 select * from test_regex('(?=b)b', 'a', 'HP');
+-- expectMatch	23.9 HP		...(?!.)	abcde	cde
+select * from test_regex('...(?!.)', 'abcde', 'HP');
+-- expectNomatch	23.10 HP	...(?=.)	abc
+select * from test_regex('...(?=.)', 'abc', 'HP');
 
 -- Postgres addition: lookbehind constraints
 
@@ -1068,6 +1072,15 @@ select * from test_regex('a(?<!b)b*', 'a', 'HP');
 select * from test_regex('(?<=b)b', 'bb', 'HP');
 -- expectNomatch	23.18 HP		(?<=b)b		b
 select * from test_regex('(?<=b)b', 'b', 'HP');
+-- expectMatch	23.19 HP		(?<=.)..	abcde	bc
+select * from test_regex('(?<=.)..', 'abcde', 'HP');
+-- expectMatch	23.20 HP		(?<=..)a*	aaabb	a
+select * from test_regex('(?<=..)a*', 'aaabb', 'HP');
+-- expectMatch	23.21 HP		(?<=..)b*	aaabb	{}
+-- Note: empty match here is correct, it matches after the first 2 characters
+select * from test_regex('(?<=..)b*', 'aaabb', 'HP');
+-- expectMatch	23.22 HP		(?<=..)b+	aaabb	bb
+select * from test_regex('(?<=..)b+', 'aaabb', 'HP');
 
 -- doing 24 "non-greedy quantifiers"

0003-remove-useless-concat-nodes.patchtext/x-diff; charset=us-ascii; name=0003-remove-useless-concat-nodes.patchDownload

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index b228aedbd9..6cf4209d30 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1123,14 +1123,24 @@ parseqatom(struct vars *v,
 	t->left = atom;
 	atomp = &t->left;
 
-	/* here we should recurse... but we must postpone that to the end */
+	/*
+	 * Here we should recurse to fill t->right ... but we must postpone that
+	 * to the end.
+	 */
 
-	/* split top into prefix and remaining */
+	/*
+	 * Convert top node to a concatenation of the prefix (top->left, covering
+	 * whatever we parsed previously) and remaining (t).  Note that the prefix
+	 * could be empty, in which case this concatenation node is unnecessary.
+	 * To keep things simple, we operate in a general way for now, and get rid
+	 * of unnecessary subres below.
+	 */
 	assert(top->op == '=' && top->left == NULL && top->right == NULL);
 	top->left = subre(v, '=', top->flags, top->begin, lp);
 	NOERR();
 	top->op = '.';
 	top->right = t;
+	/* top->flags will get updated later */
 
 	/* if it's a backref, now is the time to replicate the subNFA */
 	if (atomtype == BACKREF)
@@ -1220,16 +1230,75 @@ parseqatom(struct vars *v,
 	/* and finally, look after that postponed recursion */
 	t = top->right;
 	if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
+	{
+		/* parse all the rest of the branch, and insert in t->right */
 		t->right = parsebranch(v, stopper, type, s2, rp, 1);
+		NOERR();
+		assert(SEE('|') || SEE(stopper) || SEE(EOS));
+
+		/* here's the promised update of the flags */
+		t->flags |= COMBINE(t->flags, t->right->flags);
+		top->flags |= COMBINE(top->flags, t->flags);
+
+		/*
+		 * At this point both top and t are concatenation (op == '.') subres,
+		 * and we have top->left = prefix of branch, top->right = t, t->left =
+		 * messy atom (with quantification superstructure if needed), t->right
+		 * = rest of branch.
+		 *
+		 * If the messy atom was the first thing in the branch, then top->left
+		 * is vacuous and we can get rid of one level of concatenation.  Since
+		 * the caller is holding a pointer to the top node, we can't remove
+		 * that node; but we're allowed to change its properties.
+		 */
+		assert(top->left->op == '=');
+		if (top->left->begin == top->left->end)
+		{
+			assert(!MESSY(top->left->flags));
+			freesubre(v, top->left);
+			top->left = t->left;
+			top->right = t->right;
+			t->left = t->right = NULL;
+			freesubre(v, t);
+		}
+	}
 	else
 	{
+		/*
+		 * There's nothing left in the branch, so we don't need the second
+		 * concatenation node 't'.  Just link s2 straight to rp.
+		 */
 		EMPTYARC(s2, rp);
-		t->right = subre(v, '=', 0, s2, rp);
+		top->right = t->left;
+		top->flags |= COMBINE(top->flags, top->right->flags);
+		t->left = t->right = NULL;
+		freesubre(v, t);
+
+		/*
+		 * Again, it could be that top->left is vacuous (if the messy atom was
+		 * in fact the only thing in the branch).  In that case we need no
+		 * concatenation at all; just replace top with top->right.
+		 */
+		assert(top->left->op == '=');
+		if (top->left->begin == top->left->end)
+		{
+			assert(!MESSY(top->left->flags));
+			freesubre(v, top->left);
+			t = top->right;
+			top->op = t->op;
+			top->flags = t->flags;
+			top->id = t->id;
+			top->subno = t->subno;
+			top->min = t->min;
+			top->max = t->max;
+			top->left = t->left;
+			top->right = t->right;
+			top->begin = t->begin;
+			top->end = t->end;
+			t->left = t->right = NULL;
+			freesubre(v, t);
+		}
 	}
-	NOERR();
-	assert(SEE('|') || SEE(stopper) || SEE(EOS));
-	t->flags |= COMBINE(t->flags, t->right->flags);
-	top->flags |= COMBINE(top->flags, t->flags);
 }
 
 /*

0004-make-subre-trees-Nary.patchtext/x-diff; charset=us-ascii; name=0004-make-subre-trees-Nary.patchDownload

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 6cf4209d30..c688806992 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -58,6 +58,7 @@ static void processlacon(struct vars *, struct state *, struct state *, int,
 						 struct state *, struct state *);
 static struct subre *subre(struct vars *, int, int, struct state *, struct state *);
 static void freesubre(struct vars *, struct subre *);
+static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
 static void optst(struct vars *, struct subre *);
 static int	numst(struct subre *, int);
@@ -652,8 +653,8 @@ makesearch(struct vars *v,
  * parse - parse an RE
  *
  * This is actually just the top level, which parses a bunch of branches
- * tied together with '|'.  They appear in the tree as the left children
- * of a chain of '|' subres.
+ * tied together with '|'.  If there's more than one, they appear in the
+ * tree as the children of a '|' subre.
  */
 static struct subre *
 parse(struct vars *v,
@@ -662,41 +663,34 @@ parse(struct vars *v,
 	  struct state *init,		/* initial state */
 	  struct state *final)		/* final state */
 {
-	struct state *left;			/* scaffolding for branch */
-	struct state *right;
 	struct subre *branches;		/* top level */
-	struct subre *branch;		/* current branch */
-	struct subre *t;			/* temporary */
-	int			firstbranch;	/* is this the first branch? */
+	struct subre *lastbranch;	/* latest branch */
 
 	assert(stopper == ')' || stopper == EOS);
 
 	branches = subre(v, '|', LONGER, init, final);
 	NOERRN();
-	branch = branches;
-	firstbranch = 1;
+	lastbranch = NULL;
 	do
 	{							/* a branch */
-		if (!firstbranch)
-		{
-			/* need a place to hang it */
-			branch->right = subre(v, '|', LONGER, init, final);
-			NOERRN();
-			branch = branch->right;
-		}
-		firstbranch = 0;
+		struct subre *branch;
+		struct state *left;		/* scaffolding for branch */
+		struct state *right;
+
 		left = newstate(v->nfa);
 		right = newstate(v->nfa);
 		NOERRN();
 		EMPTYARC(init, left);
 		EMPTYARC(right, final);
 		NOERRN();
-		branch->left = parsebranch(v, stopper, type, left, right, 0);
+		branch = parsebranch(v, stopper, type, left, right, 0);
 		NOERRN();
-		branch->flags |= UP(branch->flags | branch->left->flags);
-		if ((branch->flags & ~branches->flags) != 0)	/* new flags */
-			for (t = branches; t != branch; t = t->right)
-				t->flags |= branch->flags;
+		if (lastbranch)
+			lastbranch->sibling = branch;
+		else
+			branches->child = branch;
+		branches->flags |= UP(branches->flags | branch->flags);
+		lastbranch = branch;
 	} while (EAT('|'));
 	assert(SEE(stopper) || SEE(EOS));
 
@@ -707,20 +701,16 @@ parse(struct vars *v,
 	}
 
 	/* optimize out simple cases */
-	if (branch == branches)
+	if (lastbranch == branches->child)
 	{							/* only one branch */
-		assert(branch->right == NULL);
-		t = branch->left;
-		branch->left = NULL;
-		freesubre(v, branches);
-		branches = t;
+		assert(lastbranch->sibling == NULL);
+		freesrnode(v, branches);
+		branches = lastbranch;
 	}
 	else if (!MESSY(branches->flags))
 	{							/* no interesting innards */
-		freesubre(v, branches->left);
-		branches->left = NULL;
-		freesubre(v, branches->right);
-		branches->right = NULL;
+		freesubreandsiblings(v, branches->child);
+		branches->child = NULL;
 		branches->op = '=';
 	}
 
@@ -972,7 +962,7 @@ parseqatom(struct vars *v,
 				t = subre(v, '(', atom->flags | CAP, lp, rp);
 				NOERR();
 				t->subno = subno;
-				t->left = atom;
+				t->child = atom;
 				atom = t;
 			}
 			/* postpone everything else pending possible {0} */
@@ -1120,26 +1110,27 @@ parseqatom(struct vars *v,
 	/* break remaining subRE into x{...} and what follows */
 	t = subre(v, '.', COMBINE(qprefer, atom->flags), lp, rp);
 	NOERR();
-	t->left = atom;
-	atomp = &t->left;
+	t->child = atom;
+	atomp = &t->child;
 
 	/*
-	 * Here we should recurse to fill t->right ... but we must postpone that
-	 * to the end.
+	 * Here we should recurse to fill t->child->sibling ... but we must
+	 * postpone that to the end.  One reason is that t->child may be replaced
+	 * below, and we don't want to worry about its sibling link.
 	 */
 
 	/*
-	 * Convert top node to a concatenation of the prefix (top->left, covering
+	 * Convert top node to a concatenation of the prefix (top->child, covering
 	 * whatever we parsed previously) and remaining (t).  Note that the prefix
 	 * could be empty, in which case this concatenation node is unnecessary.
 	 * To keep things simple, we operate in a general way for now, and get rid
 	 * of unnecessary subres below.
 	 */
-	assert(top->op == '=' && top->left == NULL && top->right == NULL);
-	top->left = subre(v, '=', top->flags, top->begin, lp);
+	assert(top->op == '=' && top->child == NULL);
+	top->child = subre(v, '=', top->flags, top->begin, lp);
 	NOERR();
 	top->op = '.';
-	top->right = t;
+	top->child->sibling = t;
 	/* top->flags will get updated later */
 
 	/* if it's a backref, now is the time to replicate the subNFA */
@@ -1201,9 +1192,9 @@ parseqatom(struct vars *v,
 		f = COMBINE(qprefer, atom->flags);
 		t = subre(v, '.', f, s, atom->end); /* prefix and atom */
 		NOERR();
-		t->left = subre(v, '=', PREF(f), s, atom->begin);
+		t->child = subre(v, '=', PREF(f), s, atom->begin);
 		NOERR();
-		t->right = atom;
+		t->child->sibling = atom;
 		*atomp = t;
 		/* rest of branch can be strung starting from atom->end */
 		s2 = atom->end;
@@ -1222,44 +1213,43 @@ parseqatom(struct vars *v,
 		NOERR();
 		t->min = (short) m;
 		t->max = (short) n;
-		t->left = atom;
+		t->child = atom;
 		*atomp = t;
 		/* rest of branch is to be strung from iteration's end state */
 	}
 
 	/* and finally, look after that postponed recursion */
-	t = top->right;
+	t = top->child->sibling;
 	if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
 	{
-		/* parse all the rest of the branch, and insert in t->right */
-		t->right = parsebranch(v, stopper, type, s2, rp, 1);
+		/* parse all the rest of the branch, and insert in t->child->sibling */
+		t->child->sibling = parsebranch(v, stopper, type, s2, rp, 1);
 		NOERR();
 		assert(SEE('|') || SEE(stopper) || SEE(EOS));
 
 		/* here's the promised update of the flags */
-		t->flags |= COMBINE(t->flags, t->right->flags);
+		t->flags |= COMBINE(t->flags, t->child->sibling->flags);
 		top->flags |= COMBINE(top->flags, t->flags);
 
 		/*
 		 * At this point both top and t are concatenation (op == '.') subres,
-		 * and we have top->left = prefix of branch, top->right = t, t->left =
-		 * messy atom (with quantification superstructure if needed), t->right
-		 * = rest of branch.
+		 * and we have top->child = prefix of branch, top->child->sibling = t,
+		 * t->child = messy atom (with quantification superstructure if
+		 * needed), t->child->sibling = rest of branch.
 		 *
-		 * If the messy atom was the first thing in the branch, then top->left
-		 * is vacuous and we can get rid of one level of concatenation.  Since
-		 * the caller is holding a pointer to the top node, we can't remove
-		 * that node; but we're allowed to change its properties.
+		 * If the messy atom was the first thing in the branch, then
+		 * top->child is vacuous and we can get rid of one level of
+		 * concatenation.  Since the caller is holding a pointer to the top
+		 * node, we can't remove that node; but we're allowed to change its
+		 * properties.
 		 */
-		assert(top->left->op == '=');
-		if (top->left->begin == top->left->end)
+		assert(top->child->op == '=');
+		if (top->child->begin == top->child->end)
 		{
-			assert(!MESSY(top->left->flags));
-			freesubre(v, top->left);
-			top->left = t->left;
-			top->right = t->right;
-			t->left = t->right = NULL;
-			freesubre(v, t);
+			assert(!MESSY(top->child->flags));
+			freesubre(v, top->child);
+			top->child = t->child;
+			freesrnode(v, t);
 		}
 	}
 	else
@@ -1269,34 +1259,31 @@ parseqatom(struct vars *v,
 		 * concatenation node 't'.  Just link s2 straight to rp.
 		 */
 		EMPTYARC(s2, rp);
-		top->right = t->left;
-		top->flags |= COMBINE(top->flags, top->right->flags);
-		t->left = t->right = NULL;
-		freesubre(v, t);
+		top->child->sibling = t->child;
+		top->flags |= COMBINE(top->flags, top->child->sibling->flags);
+		freesrnode(v, t);
 
 		/*
-		 * Again, it could be that top->left is vacuous (if the messy atom was
-		 * in fact the only thing in the branch).  In that case we need no
-		 * concatenation at all; just replace top with top->right.
+		 * Again, it could be that top->child is vacuous (if the messy atom
+		 * was in fact the only thing in the branch).  In that case we need no
+		 * concatenation at all; just replace top with top->child->sibling.
 		 */
-		assert(top->left->op == '=');
-		if (top->left->begin == top->left->end)
+		assert(top->child->op == '=');
+		if (top->child->begin == top->child->end)
 		{
-			assert(!MESSY(top->left->flags));
-			freesubre(v, top->left);
-			t = top->right;
+			assert(!MESSY(top->child->flags));
+			t = top->child->sibling;
+			freesubre(v, top->child);
 			top->op = t->op;
 			top->flags = t->flags;
 			top->id = t->id;
 			top->subno = t->subno;
 			top->min = t->min;
 			top->max = t->max;
-			top->left = t->left;
-			top->right = t->right;
+			top->child = t->child;
 			top->begin = t->begin;
 			top->end = t->end;
-			t->left = t->right = NULL;
-			freesubre(v, t);
+			freesrnode(v, t);
 		}
 	}
 }
@@ -1786,7 +1773,7 @@ subre(struct vars *v,
 	}
 
 	if (ret != NULL)
-		v->treefree = ret->left;
+		v->treefree = ret->child;
 	else
 	{
 		ret = (struct subre *) MALLOC(sizeof(struct subre));
@@ -1806,8 +1793,8 @@ subre(struct vars *v,
 	ret->id = 0;				/* will be assigned later */
 	ret->subno = 0;
 	ret->min = ret->max = 1;
-	ret->left = NULL;
-	ret->right = NULL;
+	ret->child = NULL;
+	ret->sibling = NULL;
 	ret->begin = begin;
 	ret->end = end;
 	ZAPCNFA(ret->cnfa);
@@ -1817,6 +1804,9 @@ subre(struct vars *v,
 
 /*
  * freesubre - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * but not its siblings.
  */
 static void
 freesubre(struct vars *v,		/* might be NULL */
@@ -1825,14 +1815,31 @@ freesubre(struct vars *v,		/* might be NULL */
 	if (sr == NULL)
 		return;
 
-	if (sr->left != NULL)
-		freesubre(v, sr->left);
-	if (sr->right != NULL)
-		freesubre(v, sr->right);
+	if (sr->child != NULL)
+		freesubreandsiblings(v, sr->child);
 
 	freesrnode(v, sr);
 }
 
+/*
+ * freesubreandsiblings - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * as well as any following siblings.
+ */
+static void
+freesubreandsiblings(struct vars *v,	/* might be NULL */
+					 struct subre *sr)
+{
+	while (sr != NULL)
+	{
+		struct subre *next = sr->sibling;
+
+		freesubre(v, sr);
+		sr = next;
+	}
+}
+
 /*
  * freesrnode - free one node in a subRE subtree
  */
@@ -1850,7 +1857,7 @@ freesrnode(struct vars *v,		/* might be NULL */
 	if (v != NULL && v->treechain != NULL)
 	{
 		/* we're still parsing, maybe we can reuse the subre */
-		sr->left = v->treefree;
+		sr->child = v->treefree;
 		v->treefree = sr;
 	}
 	else
@@ -1881,15 +1888,14 @@ numst(struct subre *t,
 	  int start)				/* starting point for subtree numbers */
 {
 	int			i;
+	struct subre *t2;
 
 	assert(t != NULL);
 
 	i = start;
 	t->id = (short) i++;
-	if (t->left != NULL)
-		i = numst(t->left, i);
-	if (t->right != NULL)
-		i = numst(t->right, i);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		i = numst(t2, i);
 	return i;
 }
 
@@ -1913,13 +1919,13 @@ numst(struct subre *t,
 static void
 markst(struct subre *t)
 {
+	struct subre *t2;
+
 	assert(t != NULL);
 
 	t->flags |= INUSE;
-	if (t->left != NULL)
-		markst(t->left);
-	if (t->right != NULL)
-		markst(t->right);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		markst(t2);
 }
 
 /*
@@ -1949,12 +1955,12 @@ nfatree(struct vars *v,
 		struct subre *t,
 		FILE *f)				/* for debug output */
 {
+	struct subre *t2;
+
 	assert(t != NULL && t->begin != NULL);
 
-	if (t->left != NULL)
-		(DISCARD) nfatree(v, t->left, f);
-	if (t->right != NULL)
-		(DISCARD) nfatree(v, t->right, f);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		(DISCARD) nfatree(v, t2, f);
 
 	return nfanode(v, t, 0, f);
 }
@@ -2206,6 +2212,7 @@ stdump(struct subre *t,
 	   int nfapresent)			/* is the original NFA still around? */
 {
 	char		idbuf[50];
+	struct subre *t2;
 
 	fprintf(f, "%s. `%c'", stid(t, idbuf, sizeof(idbuf)), t->op);
 	if (t->flags & LONGER)
@@ -2231,20 +2238,18 @@ stdump(struct subre *t,
 	}
 	if (nfapresent)
 		fprintf(f, " %ld-%ld", (long) t->begin->no, (long) t->end->no);
-	if (t->left != NULL)
-		fprintf(f, " L:%s", stid(t->left, idbuf, sizeof(idbuf)));
-	if (t->right != NULL)
-		fprintf(f, " R:%s", stid(t->right, idbuf, sizeof(idbuf)));
+	if (t->child != NULL)
+		fprintf(f, " C:%s", stid(t->child, idbuf, sizeof(idbuf)));
+	if (t->sibling != NULL)
+		fprintf(f, " S:%s", stid(t->sibling, idbuf, sizeof(idbuf)));
 	if (!NULLCNFA(t->cnfa))
 	{
 		fprintf(f, "\n");
 		dumpcnfa(&t->cnfa, f);
 	}
 	fprintf(f, "\n");
-	if (t->left != NULL)
-		stdump(t->left, f, nfapresent);
-	if (t->right != NULL)
-		stdump(t->right, f, nfapresent);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		stdump(t2, f, nfapresent);
 }
 
 /*
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index f7eaa76b02..4b65e38329 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -635,6 +635,8 @@ static void
 zaptreesubs(struct vars *v,
 			struct subre *t)
 {
+	struct subre *t2;
+
 	if (t->op == '(')
 	{
 		int			n = t->subno;
@@ -647,10 +649,8 @@ zaptreesubs(struct vars *v,
 		}
 	}
 
-	if (t->left != NULL)
-		zaptreesubs(v, t->left);
-	if (t->right != NULL)
-		zaptreesubs(v, t->right);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		zaptreesubs(v, t2);
 }
 
 /*
@@ -709,35 +709,35 @@ cdissect(struct vars *v,
 	switch (t->op)
 	{
 		case '=':				/* terminal node */
-			assert(t->left == NULL && t->right == NULL);
+			assert(t->child == NULL);
 			er = REG_OKAY;		/* no action, parent did the work */
 			break;
 		case 'b':				/* back reference */
-			assert(t->left == NULL && t->right == NULL);
+			assert(t->child == NULL);
 			er = cbrdissect(v, t, begin, end);
 			break;
 		case '.':				/* concatenation */
-			assert(t->left != NULL && t->right != NULL);
-			if (t->left->flags & SHORTER)	/* reverse scan */
+			assert(t->child != NULL);
+			if (t->child->flags & SHORTER)	/* reverse scan */
 				er = crevcondissect(v, t, begin, end);
 			else
 				er = ccondissect(v, t, begin, end);
 			break;
 		case '|':				/* alternation */
-			assert(t->left != NULL);
+			assert(t->child != NULL);
 			er = caltdissect(v, t, begin, end);
 			break;
 		case '*':				/* iteration */
-			assert(t->left != NULL);
-			if (t->left->flags & SHORTER)	/* reverse scan */
+			assert(t->child != NULL);
+			if (t->child->flags & SHORTER)	/* reverse scan */
 				er = creviterdissect(v, t, begin, end);
 			else
 				er = citerdissect(v, t, begin, end);
 			break;
 		case '(':				/* capturing */
-			assert(t->left != NULL && t->right == NULL);
+			assert(t->child != NULL);
 			assert(t->subno > 0);
-			er = cdissect(v, t->left, begin, end);
+			er = cdissect(v, t->child, begin, end);
 			if (er == REG_OKAY)
 				subset(v, t, begin, end);
 			break;
@@ -765,19 +765,22 @@ ccondissect(struct vars *v,
 			chr *begin,			/* beginning of relevant substring */
 			chr *end)			/* end of same */
 {
+	struct subre *left = t->child;
+	struct subre *right = left->sibling;
 	struct dfa *d;
 	struct dfa *d2;
 	chr		   *mid;
 	int			er;
 
 	assert(t->op == '.');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->right != NULL && t->right->cnfa.nstates > 0);
-	assert(!(t->left->flags & SHORTER));
+	assert(left != NULL && left->cnfa.nstates > 0);
+	assert(right != NULL && right->cnfa.nstates > 0);
+	assert(right->sibling == NULL);
+	assert(!(left->flags & SHORTER));
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, left);
 	NOERR();
-	d2 = getsubdfa(v, t->right);
+	d2 = getsubdfa(v, right);
 	NOERR();
 	MDEBUG(("cconcat %d\n", t->id));
 
@@ -794,10 +797,10 @@ ccondissect(struct vars *v,
 		/* try this midpoint on for size */
 		if (longest(v, d2, mid, end, (int *) NULL) == end)
 		{
-			er = cdissect(v, t->left, begin, mid);
+			er = cdissect(v, left, begin, mid);
 			if (er == REG_OKAY)
 			{
-				er = cdissect(v, t->right, mid, end);
+				er = cdissect(v, right, mid, end);
 				if (er == REG_OKAY)
 				{
 					/* satisfaction */
@@ -826,8 +829,8 @@ ccondissect(struct vars *v,
 			return REG_NOMATCH;
 		}
 		MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-		zaptreesubs(v, t->left);
-		zaptreesubs(v, t->right);
+		zaptreesubs(v, left);
+		zaptreesubs(v, right);
 	}
 
 	/* can't get here */
@@ -843,19 +846,22 @@ crevcondissect(struct vars *v,
 			   chr *begin,		/* beginning of relevant substring */
 			   chr *end)		/* end of same */
 {
+	struct subre *left = t->child;
+	struct subre *right = left->sibling;
 	struct dfa *d;
 	struct dfa *d2;
 	chr		   *mid;
 	int			er;
 
 	assert(t->op == '.');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->right != NULL && t->right->cnfa.nstates > 0);
-	assert(t->left->flags & SHORTER);
+	assert(left != NULL && left->cnfa.nstates > 0);
+	assert(right != NULL && right->cnfa.nstates > 0);
+	assert(right->sibling == NULL);
+	assert(left->flags & SHORTER);
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, left);
 	NOERR();
-	d2 = getsubdfa(v, t->right);
+	d2 = getsubdfa(v, right);
 	NOERR();
 	MDEBUG(("crevcon %d\n", t->id));
 
@@ -872,10 +878,10 @@ crevcondissect(struct vars *v,
 		/* try this midpoint on for size */
 		if (longest(v, d2, mid, end, (int *) NULL) == end)
 		{
-			er = cdissect(v, t->left, begin, mid);
+			er = cdissect(v, left, begin, mid);
 			if (er == REG_OKAY)
 			{
-				er = cdissect(v, t->right, mid, end);
+				er = cdissect(v, right, mid, end);
 				if (er == REG_OKAY)
 				{
 					/* satisfaction */
@@ -904,8 +910,8 @@ crevcondissect(struct vars *v,
 			return REG_NOMATCH;
 		}
 		MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-		zaptreesubs(v, t->left);
-		zaptreesubs(v, t->right);
+		zaptreesubs(v, left);
+		zaptreesubs(v, right);
 	}
 
 	/* can't get here */
@@ -1005,26 +1011,30 @@ caltdissect(struct vars *v,
 	struct dfa *d;
 	int			er;
 
-	/* We loop, rather than tail-recurse, to handle a chain of alternatives */
+	assert(t->op == '|');
+
+	t = t->child;
+	/* there should be at least 2 alternatives */
+	assert(t != NULL && t->sibling != NULL);
+
 	while (t != NULL)
 	{
-		assert(t->op == '|');
-		assert(t->left != NULL && t->left->cnfa.nstates > 0);
+		assert(t->cnfa.nstates > 0);
 
 		MDEBUG(("calt n%d\n", t->id));
 
-		d = getsubdfa(v, t->left);
+		d = getsubdfa(v, t);
 		NOERR();
 		if (longest(v, d, begin, end, (int *) NULL) == end)
 		{
 			MDEBUG(("calt matched\n"));
-			er = cdissect(v, t->left, begin, end);
+			er = cdissect(v, t, begin, end);
 			if (er != REG_NOMATCH)
 				return er;
 		}
 		NOERR();
 
-		t = t->right;
+		t = t->sibling;
 	}
 
 	return REG_NOMATCH;
@@ -1050,8 +1060,8 @@ citerdissect(struct vars *v,
 	int			er;
 
 	assert(t->op == '*');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(!(t->left->flags & SHORTER));
+	assert(t->child != NULL && t->child->cnfa.nstates > 0);
+	assert(!(t->child->flags & SHORTER));
 	assert(begin <= end);
 
 	/*
@@ -1086,7 +1096,7 @@ citerdissect(struct vars *v,
 		return REG_ESPACE;
 	endpts[0] = begin;
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, t->child);
 	if (ISERR())
 	{
 		FREE(endpts);
@@ -1165,8 +1175,8 @@ citerdissect(struct vars *v,
 
 		for (i = nverified + 1; i <= k; i++)
 		{
-			zaptreesubs(v, t->left);
-			er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+			zaptreesubs(v, t->child);
+			er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
 			if (er == REG_OKAY)
 			{
 				nverified = i;
@@ -1251,8 +1261,8 @@ creviterdissect(struct vars *v,
 	int			er;
 
 	assert(t->op == '*');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->left->flags & SHORTER);
+	assert(t->child != NULL && t->child->cnfa.nstates > 0);
+	assert(t->child->flags & SHORTER);
 	assert(begin <= end);
 
 	/*
@@ -1287,7 +1297,7 @@ creviterdissect(struct vars *v,
 		return REG_ESPACE;
 	endpts[0] = begin;
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, t->child);
 	if (ISERR())
 	{
 		FREE(endpts);
@@ -1372,8 +1382,8 @@ creviterdissect(struct vars *v,
 
 		for (i = nverified + 1; i <= k; i++)
 		{
-			zaptreesubs(v, t->left);
-			er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+			zaptreesubs(v, t->child);
+			er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
 			if (er == REG_OKAY)
 			{
 				nverified = i;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 82e761bfe5..249c44982f 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -423,15 +423,17 @@ struct cnfa
  *		'='  plain regex without interesting substructure (implemented as DFA)
  *		'b'  back-reference (has no substructure either)
  *		'('  capture node: captures the match of its single child
- *		'.'  concatenation: matches a match for left, then a match for right
- *		'|'  alternation: matches a match for left or a match for right
+ *		'.'  concatenation: matches a match for first child, then second child
+ *		'|'  alternation: matches a match for any of its children
  *		'*'  iteration: matches some number of matches of its single child
  *
- * Note: the right child of an alternation must be another alternation or
- * NULL; hence, an N-way branch requires N alternation nodes, not N-1 as you
- * might expect.  This could stand to be changed.  Actually I'd rather see
- * a single alternation node with N children, but that will take revising
- * the representation of struct subre.
+ * An alternation node can have any number of children (but at least two),
+ * linked through their sibling fields.
+ *
+ * A concatenation node must have exactly two children.  It might be useful
+ * to support more, but that would complicate the executor.  Note that it is
+ * the first child's greediness that determines the node's preference for
+ * where to split a match.
  *
  * Note: when a backref is directly quantified, we stick the min/max counts
  * into the backref rather than plastering an iteration node on top.  This is
@@ -460,8 +462,8 @@ struct subre
 								 * LATYPE code for lookaround constraint */
 	short		min;			/* min repetitions for iteration or backref */
 	short		max;			/* max repetitions for iteration or backref */
-	struct subre *left;			/* left child, if any (also freelist chain) */
-	struct subre *right;		/* right child, if any */
+	struct subre *child;		/* first child, if any (also freelist chain) */
+	struct subre *sibling;		/* next child of same parent, if any */
 	struct state *begin;		/* outarcs from here... */
 	struct state *end;			/* ...ending in inarcs here */
 	struct cnfa cnfa;			/* compacted NFA, if any */

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#7)

Re: Some regular-expression performance hacking

On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:

I got these runtimes (non-cassert builds):
HEAD 313661.149 ms (05:13.661)
+0001 297397.293 ms (04:57.397) 5% better than HEAD
+0002 151995.803 ms (02:31.996) 51% better than HEAD
+0003 139843.934 ms (02:19.844) 55% better than HEAD
+0004 95034.611 ms (01:35.035) 69% better than HEAD
Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly. I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.

Mind blowing speed-up, wow!

I've tested all 4 patches successfully.

To eliminate the baseline cost of the join,
I first created this table:

CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
;

Then I ran this query:

\timing

SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All patches gave the same result:

?column? | ?column? | count
----------+----------+---------
f | f | 1448212
(1 row)

I.e., no detected semantic differences.

Timing differences:

HEAD  570632.722 ms (09:30.633)
+0001 472938.857 ms (07:52.939) 17% better than HEAD
+0002 451638.049 ms (07:31.638) 20% better than HEAD
+0003 439377.813 ms (07:19.378) 23% better than HEAD
+0004 96447.038 ms (01:36.447) 83% better than HEAD

I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:

SELECT version();
version
----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
(1 row)

My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.

PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql.log'

Amazing work!

I hope to have a new dataset ready soon with regex flags for applied subjects as well.

/Joel

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#8)

5 attachment(s)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

I've tested all 4 patches successfully.

Thanks!

I found one other area that could be improved with the same idea of
getting rid of unnecessary subre's: right now, every pair of capturing
parentheses gives rise to a "capture" subre with an NFA identical to
its single child subre (which is what does the actual matching work).
While this doesn't really add any runtime cost, the duplicate NFA
definitely does add to the compilation cost, since we run it through
optimization independently of the child.

I initially thought that we could just flush capture subres altogether
in favor of annotating their children with a "we need to capture this
result" marker. However, Spencer's regression tests soon exposed the
flaw in that notion. It's legal to write "((x))" or even "((((x))))",
so that there can be multiple sets of capturing parentheses with a
single child node. The solution adopted in the attached 0005 is to
handle the innermost capture with a marker on the child subre, but if
we need an additional capture on a node that's already marked, put
a capture subre on top just like before. One could instead complicate
the data structure by allowing N capture markers on a single subre
node, but I judged that not to be a good tradeoff. I don't see any
reason except spec compliance to allow such equivalent capture groups,
so I don't care if they're a bit inefficient. (If anyone knows of a
useful application for writing REs like this, we could reconsider that
choice.)

One small issue with marking the child directly is that we can't get
away any longer with overlaying capture and backref subexpression
numbers, since you could theoretically write (\1) which'd result in
needing to put a capture label on a backref subre. This could again
have been handled by making the capture a separate node, but I really
don't much care for the way that subre.subno has been overloaded for
three(!) different purposes depending on node type. So I just broke
it into three separate fields. Maybe the incremental cost of the
larger subre struct was worth worrying about back in 1997 ... but
I kind of doubt that it was a useful micro-optimization even then,
considering the additional NFA baggage that every subre carries.

Also, I widened "subre.id" from short to int, since the narrower field
no longer saves anything given the new struct layout. The existing
choice was dubious already, because every other use of subre ID
numbers was int or even size_t, and there was nothing checking for
overflow of the id fields. (Although perhaps it doesn't matter,
since I'm unsure that the id fields are really used for anything
except debugging purposes.)

For me, 0005 makes a fairly perceptible difference on your test case
subject_id = 611875, which I've been paying attention to because it's
the one that failed with "regular expression is too complex" before.
I see about a 20% time savings from 0004 on that case, but not really
any noticeable difference in the total runtime for the whole suite.
So I think we're getting to the point of diminishing returns for
this concept (another reason for not chasing after optimization of
the duplicate-captures case). Still, we're clearly way ahead of
where we started.

Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

regards, tom lane

Attachments:

0001-invent-rainbow-arcs-4.patchtext/x-diff; charset=us-ascii; name=0001-invent-rainbow-arcs-4.patchDownload

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;
 
 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN	(-1)
-#define COLOR_BLANK		(-2)
+#define COLOR_UNKNOWN	(-3)
+#define COLOR_BLANK		(-4)
 
 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
 		palloc0(colorsCount * sizeof(TrgmColorInfo));
 
 	/*
-	 * Loop over colors, filling TrgmColorInfo about each.
+	 * Loop over colors, filling TrgmColorInfo about each.  Note we include
+	 * WHITE (0) even though we know it'll be reported as non-expandable.
 	 */
 	for (i = 0; i < colorsCount; i++)
 	{
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 			/* Add enter key to this state */
 			addKeyToQueue(trgmNFA, &destKey);
 		}
-		else
+		else if (arc->co >= 0)
 		{
-			/* Regular color */
+			/* Regular color (including WHITE) */
 			TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
 
 			if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 				addKeyToQueue(trgmNFA, &destKey);
 			}
 		}
+		else
+		{
+			/* RAINBOW: treat as unexpandable color */
+			destKey.prefix.colors[0] = COLOR_UNKNOWN;
+			destKey.prefix.colors[1] = COLOR_UNKNOWN;
+			destKey.nstate = arc->to;
+			addKeyToQueue(trgmNFA, &destKey);
+		}
 	}
 
 	pfree(arcs);
@@ -1216,10 +1225,10 @@ addArcs(TrgmNFA *trgmNFA, TrgmState *state)
 			/*
 			 * Ignore non-expandable colors; addKey already handled the case.
 			 *
-			 * We need no special check for begin/end pseudocolors here.  We
-			 * don't need to do any processing for them, and they will be
-			 * marked non-expandable since the regex engine will have reported
-			 * them that way.
+			 * We need no special check for WHITE or begin/end pseudocolors
+			 * here.  We don't need to do any processing for them, and they
+			 * will be marked non-expandable since the regex engine will have
+			 * reported them that way.
 			 */
 			if (!colorInfo->expandable)
 				continue;
diff --git a/src/backend/regex/README b/src/backend/regex/README
index f08aab69e3..a83ab5074d 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -261,6 +261,18 @@ and the NFA has these arcs:
 	states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.
 
+There is one more complexity, which is how to handle ".", that is a
+match-anything atom.  We used to do that by generating a "rainbow"
+of arcs of all live colors between the two NFA states before and after
+the dot.  That's expensive in itself when there are lots of colors,
+and it also typically adds lots of follow-on arc-splitting work for the
+color splitting logic.  Now we handle this case by generating a single arc
+labeled with the special color RAINBOW, meaning all colors.  Such arcs
+never need to be split, so they help keep NFAs small in this common case.
+(Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
+not supposed to match newline.  In that case we still handle "." by
+generating an almost-rainbow of all colors except newline's color.)
+
 Given this summary, we can see we need the following operations for
 colors:
 
@@ -349,6 +361,8 @@ The possible arc types are:
 
     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
+    In addition there can be "rainbow" PLAIN arcs, which are dumped as
+    "[*]->to_state".
 
     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".
@@ -356,11 +370,11 @@ The possible arc types are:
     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
-    ">color_number>->to_state".
+    ">color_number>->to_state", or possibly ">*>->to_state".
 
     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
-    dumped as "<color_number<->to_state".
+    dumped as "<color_number<->to_state", or possibly "<*<->to_state".
 
     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
@@ -396,14 +410,20 @@ substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.
 
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  The executor implements that by checking if the
+adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
+right color, and it does that in the same loop that checks characters
+within the match.
+
 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
-This definition is necessary to support regexes that begin or end with
-constraints such as \m and \M, which imply requirements on the adjacent
-character if any.  NFAs for simple unanchored patterns will usually have
-pre-state outarcs for all possible character colors as well as BOS and
-BOL, and post-state inarcs for all possible character colors as well as
-EOS and EOL, so that the executor's behavior will work.
+After the optimize() transformation, there are explicit arcs mentioning
+BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
+NFA for a pattern without anchors or adjacent-character constraints will
+have pre-state outarcs for RAINBOW (all possible character colors) as well
+as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index f5a4151757..0864011cce 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -977,6 +977,7 @@ colorchain(struct colormap *cm,
 {
 	struct colordesc *cd = &cm->cd[a->co];
 
+	assert(a->co >= 0);
 	if (cd->arcs != NULL)
 		cd->arcs->colorchainRev = a;
 	a->colorchain = cd->arcs;
@@ -994,6 +995,7 @@ uncolorchain(struct colormap *cm,
 	struct colordesc *cd = &cm->cd[a->co];
 	struct arc *aa = a->colorchainRev;
 
+	assert(a->co >= 0);
 	if (aa == NULL)
 	{
 		assert(cd->arcs == a);
@@ -1012,6 +1014,9 @@ uncolorchain(struct colormap *cm,
 
 /*
  * rainbow - add arcs of all full colors (but one) between specified states
+ *
+ * If there isn't an exception color, we now generate just a single arc
+ * labeled RAINBOW, saving lots of arc-munging later on.
  */
 static void
 rainbow(struct nfa *nfa,
@@ -1025,6 +1030,13 @@ rainbow(struct nfa *nfa,
 	struct colordesc *end = CDEND(cm);
 	color		co;
 
+	if (but == COLORLESS)
+	{
+		newarc(nfa, type, RAINBOW, from, to);
+		return;
+	}
+
+	/* Gotta do it the hard way.  Skip subcolors, pseudocolors, and "but" */
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && cd->sub != co && co != but &&
 			!(cd->flags & PSEUDO))
@@ -1034,13 +1046,16 @@ rainbow(struct nfa *nfa,
 /*
  * colorcomplement - add arcs of complementary colors
  *
+ * We add arcs of all colors that are not pseudocolors and do not match
+ * any of the "of" state's PLAIN outarcs.
+ *
  * The calling sequence ought to be reconciled with cloneouts().
  */
 static void
 colorcomplement(struct nfa *nfa,
 				struct colormap *cm,
 				int type,
-				struct state *of,	/* complements of this guy's PLAIN outarcs */
+				struct state *of,
 				struct state *from,
 				struct state *to)
 {
@@ -1049,6 +1064,11 @@ colorcomplement(struct nfa *nfa,
 	color		co;
 
 	assert(of != from);
+
+	/* A RAINBOW arc matches all colors, making the complement empty */
+	if (findarc(of, PLAIN, RAINBOW) != NULL)
+		return;
+
 	for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
 		if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
 			if (findarc(of, PLAIN, co) == NULL)
diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 7ed675f88a..ff98bfd694 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -271,6 +271,11 @@ destroystate(struct nfa *nfa,
  *
  * This function checks to make sure that no duplicate arcs are created.
  * In general we never want duplicates.
+ *
+ * However: in principle, a RAINBOW arc is redundant with any plain arc
+ * (unless that arc is for a pseudocolor).  But we don't try to recognize
+ * that redundancy, either here or in allied operations such as moveins().
+ * The pseudocolor consideration makes that more costly than it seems worth.
  */
 static void
 newarc(struct nfa *nfa,
@@ -1170,6 +1175,9 @@ copyouts(struct nfa *nfa,
 
 /*
  * cloneouts - copy out arcs of a state to another state pair, modifying type
+ *
+ * This is only used to convert PLAIN arcs to AHEAD/BEHIND arcs, which share
+ * the same interpretation of "co".  It wouldn't be sensible with LACONs.
  */
 static void
 cloneouts(struct nfa *nfa,
@@ -1181,9 +1189,13 @@ cloneouts(struct nfa *nfa,
 	struct arc *a;
 
 	assert(old != from);
+	assert(type == AHEAD || type == BEHIND);
 
 	for (a = old->outs; a != NULL; a = a->outchain)
+	{
+		assert(a->type == PLAIN);
 		newarc(nfa, type, a->co, from, to);
+	}
 }
 
 /*
@@ -1597,7 +1609,7 @@ pull(struct nfa *nfa,
 	for (a = from->ins; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->inchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1624,6 +1636,10 @@ pull(struct nfa *nfa,
 				cparc(nfa, a, s, to);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, a->from, to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1764,7 +1780,7 @@ push(struct nfa *nfa,
 	for (a = to->outs; a != NULL && !NISERR(); a = nexta)
 	{
 		nexta = a->outchain;
-		switch (combine(con, a))
+		switch (combine(nfa, con, a))
 		{
 			case INCOMPATIBLE:	/* destroy the arc */
 				freearc(nfa, a);
@@ -1791,6 +1807,10 @@ push(struct nfa *nfa,
 				cparc(nfa, a, from, s);
 				freearc(nfa, a);
 				break;
+			case REPLACEARC:	/* replace arc's color */
+				newarc(nfa, a->type, con->co, from, a->to);
+				freearc(nfa, a);
+				break;
 			default:
 				assert(NOTREACHED);
 				break;
@@ -1810,9 +1830,11 @@ push(struct nfa *nfa,
  * #def INCOMPATIBLE	1	// destroys arc
  * #def SATISFIED		2	// constraint satisfied
  * #def COMPATIBLE		3	// compatible but not satisfied yet
+ * #def REPLACEARC		4	// replace arc's color with constraint color
  */
 static int
-combine(struct arc *con,
+combine(struct nfa *nfa,
+		struct arc *con,
 		struct arc *a)
 {
 #define  CA(ct,at)	 (((ct)<<CHAR_BIT) | (at))
@@ -1827,14 +1849,46 @@ combine(struct arc *con,
 		case CA(BEHIND, PLAIN):
 			if (con->co == a->co)
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', '^'):		/* collision, similar constraints */
 		case CA('$', '$'):
-		case CA(AHEAD, AHEAD):
+			if (con->co == a->co)	/* true duplication */
+				return SATISFIED;
+			return INCOMPATIBLE;
+			break;
+		case CA(AHEAD, AHEAD):	/* collision, similar constraints */
 		case CA(BEHIND, BEHIND):
 			if (con->co == a->co)	/* true duplication */
 				return SATISFIED;
+			if (con->co == RAINBOW)
+			{
+				/* con is satisfied unless arc's color is a pseudocolor */
+				if (!(nfa->cm->cd[a->co].flags & PSEUDO))
+					return SATISFIED;
+			}
+			else if (a->co == RAINBOW)
+			{
+				/* con is incompatible if it's for a pseudocolor */
+				if (nfa->cm->cd[con->co].flags & PSEUDO)
+					return INCOMPATIBLE;
+				/* otherwise, constraint constrains arc to be only its color */
+				return REPLACEARC;
+			}
 			return INCOMPATIBLE;
 			break;
 		case CA('^', BEHIND):	/* collision, dissimilar constraints */
@@ -2895,6 +2949,7 @@ compact(struct nfa *nfa,
 					break;
 				case LACON:
 					assert(s->no != cnfa->pre);
+					assert(a->co >= 0);
 					ca->co = (color) (cnfa->ncolors + a->co);
 					ca->to = a->to->no;
 					ca++;
@@ -3068,13 +3123,22 @@ dumparc(struct arc *a,
 	switch (a->type)
 	{
 		case PLAIN:
-			fprintf(f, "[%ld]", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "[*]");
+			else
+				fprintf(f, "[%ld]", (long) a->co);
 			break;
 		case AHEAD:
-			fprintf(f, ">%ld>", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, ">*>");
+			else
+				fprintf(f, ">%ld>", (long) a->co);
 			break;
 		case BEHIND:
-			fprintf(f, "<%ld<", (long) a->co);
+			if (a->co == RAINBOW)
+				fprintf(f, "<*<");
+			else
+				fprintf(f, "<%ld<", (long) a->co);
 			break;
 		case LACON:
 			fprintf(f, ":%ld:", (long) a->co);
@@ -3161,7 +3225,9 @@ dumpcstate(int st,
 	pos = 1;
 	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
 	{
-		if (ca->co < cnfa->ncolors)
+		if (ca->co == RAINBOW)
+			fprintf(f, "\t[*]->%d", ca->to);
+		else if (ca->co < cnfa->ncolors)
 			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		else
 			fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index cd0caaa2d0..ae8dbe5819 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -158,7 +158,8 @@ static int	push(struct nfa *, struct arc *, struct state **);
 #define INCOMPATIBLE	1		/* destroys arc */
 #define SATISFIED	2			/* constraint satisfied */
 #define COMPATIBLE	3			/* compatible but not satisfied yet */
-static int	combine(struct arc *, struct arc *);
+#define REPLACEARC	4			/* replace arc's color with constraint color */
+static int	combine(struct nfa *nfa, struct arc *con, struct arc *a);
 static void fixempties(struct nfa *, FILE *);
 static struct state *emptyreachable(struct nfa *, struct state *,
 									struct state *, struct arc **);
@@ -289,9 +290,11 @@ struct vars
 #define SBEGIN	'A'				/* beginning of string (even if not BOL) */
 #define SEND	'Z'				/* end of string (even if not EOL) */
 
-/* is an arc colored, and hence on a color chain? */
+/* is an arc colored, and hence should belong to a color chain? */
+/* the test on "co" eliminates RAINBOW arcs, which we don't bother to chain */
 #define COLORED(a) \
-	((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND)
+	((a)->co >= 0 && \
+	 ((a)->type == PLAIN || (a)->type == AHEAD || (a)->type == BEHIND))
 
 
 /* static function list */
@@ -1393,7 +1396,8 @@ bracket(struct vars *v,
  * cbracket - handle complemented bracket expression
  * We do it by calling bracket() with dummy endpoints, and then complementing
  * the result.  The alternative would be to invoke rainbow(), and then delete
- * arcs as the b.e. is seen... but that gets messy.
+ * arcs as the b.e. is seen... but that gets messy, and is really quite
+ * infeasible now that rainbow() just puts out one RAINBOW arc.
  */
 static void
 cbracket(struct vars *v,
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 5695e158a5..32be2592c5 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -612,6 +612,7 @@ miss(struct vars *v,
 	unsigned	h;
 	struct carc *ca;
 	struct sset *p;
+	int			ispseudocolor;
 	int			ispost;
 	int			noprogress;
 	int			gotstate;
@@ -643,13 +644,15 @@ miss(struct vars *v,
 	 */
 	for (i = 0; i < d->wordsper; i++)
 		d->work[i] = 0;			/* build new stateset bitmap in d->work */
+	ispseudocolor = d->cm->cd[co].flags & PSEUDO;
 	ispost = 0;
 	noprogress = 1;
 	gotstate = 0;
 	for (i = 0; i < d->nstates; i++)
 		if (ISBSET(css->states, i))
 			for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
-				if (ca->co == co)
+				if (ca->co == co ||
+					(ca->co == RAINBOW && !ispseudocolor))
 				{
 					BSET(d->work, ca->to);
 					gotstate = 1;
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index d4f940b8c3..a493dbe88c 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -222,7 +222,8 @@ pg_reg_colorisend(const regex_t *regex, int co)
  * Get number of member chrs of color number "co".
  *
  * Note: we return -1 if the color number is invalid, or if it is a special
- * color (WHITE or a pseudocolor), or if the number of members is uncertain.
+ * color (WHITE, RAINBOW, or a pseudocolor), or if the number of members is
+ * uncertain.
  * Callers should not try to extract the members if -1 is returned.
  */
 int
@@ -233,7 +234,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
 	assert(regex != NULL && regex->re_magic == REMAGIC);
 	cm = &((struct guts *) regex->re_guts)->cmap;
 
-	if (co <= 0 || co > cm->max)	/* we reject 0 which is WHITE */
+	if (co <= 0 || co > cm->max)	/* <= 0 rejects WHITE and RAINBOW */
 		return -1;
 	if (cm->cd[co].flags & PSEUDO)	/* also pseudocolors (BOS etc) */
 		return -1;
@@ -257,7 +258,7 @@ pg_reg_getnumcharacters(const regex_t *regex, int co)
  * whose length chars_len must be at least as long as indicated by
  * pg_reg_getnumcharacters(), else not all chars will be returned.
  *
- * Fetching the members of WHITE or a pseudocolor is not supported.
+ * Fetching the members of WHITE, RAINBOW, or a pseudocolor is not supported.
  *
  * Caution: this is a relatively expensive operation.
  */
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 1d4593ac94..e2fbad7a8a 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -165,9 +165,13 @@ findprefix(struct cnfa *cnfa,
 			/* We can ignore BOS/BOL arcs */
 			if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 				continue;
-			/* ... but EOS/EOL arcs terminate the search, as do LACONs */
+
+			/*
+			 * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
+			 * and LACONs
+			 */
 			if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-				ca->co >= cnfa->ncolors)
+				ca->co == RAINBOW || ca->co >= cnfa->ncolors)
 			{
 				thiscolor = COLORLESS;
 				break;
diff --git a/src/include/regex/regexport.h b/src/include/regex/regexport.h
index e6209463f7..99c4fb854e 100644
--- a/src/include/regex/regexport.h
+++ b/src/include/regex/regexport.h
@@ -30,6 +30,10 @@
 
 #include "regex/regex.h"
 
+/* These macros must match corresponding ones in regguts.h: */
+#define COLOR_WHITE		0		/* color for chars not appearing in regex */
+#define COLOR_RAINBOW	(-2)	/* represents all colors except pseudocolors */
+
 /* information about one arc of a regex's NFA */
 typedef struct
 {
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 0a616562d0..6d39108319 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -130,11 +130,16 @@
 /*
  * As soon as possible, we map chrs into equivalence classes -- "colors" --
  * which are of much more manageable number.
+ *
+ * To further reduce the number of arcs in NFAs and DFAs, we also have a
+ * special RAINBOW "color" that can be assigned to an arc.  This is not a
+ * real color, in that it has no entry in color maps.
  */
 typedef short color;			/* colors of characters */
 
 #define MAX_COLOR	32767		/* max color (must fit in 'color' datatype) */
 #define COLORLESS	(-1)		/* impossible color */
+#define RAINBOW		(-2)		/* represents all colors except pseudocolors */
 #define WHITE		0			/* default color, parent of all others */
 /* Note: various places in the code know that WHITE is zero */
 
@@ -276,7 +281,7 @@ struct state;
 struct arc
 {
 	int			type;			/* 0 if free, else an NFA arc type code */
-	color		co;
+	color		co;				/* color the arc matches (possibly RAINBOW) */
 	struct state *from;			/* where it's from (and contained within) */
 	struct state *to;			/* where it's to */
 	struct arc *outchain;		/* link in *from's outs chain or free chain */
@@ -284,6 +289,7 @@ struct arc
 #define  freechain	outchain	/* we do not maintain "freechainRev" */
 	struct arc *inchain;		/* link in *to's ins chain */
 	struct arc *inchainRev;		/* back-link in *to's ins chain */
+	/* these fields are not used when co == RAINBOW: */
 	struct arc *colorchain;		/* link in color's arc chain */
 	struct arc *colorchainRev;	/* back-link in color's arc chain */
 };
@@ -344,6 +350,9 @@ struct nfa
  * Plain arcs just store the transition color number as "co".  LACON arcs
  * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
  * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
+ *
+ * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
+ * it doesn't break the rule about how to recognize LACON arcs.
  */
 struct carc
 {

0002-recognize-matchall-NFAs-4.patchtext/x-diff; charset=us-ascii; name=0002-recognize-matchall-NFAs-4.patchDownload

diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index ff98bfd694..1a6c9ce183 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -65,6 +65,8 @@ newnfa(struct vars *v,
 	nfa->v = v;
 	nfa->bos[0] = nfa->bos[1] = COLORLESS;
 	nfa->eos[0] = nfa->eos[1] = COLORLESS;
+	nfa->flags = 0;
+	nfa->minmatchall = nfa->maxmatchall = -1;
 	nfa->parent = parent;		/* Precedes newfstate so parent is valid. */
 	nfa->post = newfstate(nfa, '@');	/* number 0 */
 	nfa->pre = newfstate(nfa, '>'); /* number 1 */
@@ -2875,8 +2877,14 @@ analyze(struct nfa *nfa)
 	if (NISERR())
 		return 0;
 
+	/* Detect whether NFA can't match anything */
 	if (nfa->pre->outs == NULL)
 		return REG_UIMPOSSIBLE;
+
+	/* Detect whether NFA matches all strings (possibly with length bounds) */
+	checkmatchall(nfa);
+
+	/* Detect whether NFA can possibly match a zero-length string */
 	for (a = nfa->pre->outs; a != NULL; a = a->outchain)
 		for (aa = a->to->outs; aa != NULL; aa = aa->outchain)
 			if (aa->to == nfa->post)
@@ -2884,6 +2892,282 @@ analyze(struct nfa *nfa)
 	return 0;
 }
 
+/*
+ * checkmatchall - does the NFA represent no more than a string length test?
+ *
+ * If so, set nfa->minmatchall and nfa->maxmatchall correctly (they are -1
+ * to begin with) and set the MATCHALL bit in nfa->flags.
+ *
+ * To succeed, we require all arcs to be PLAIN RAINBOW arcs, except for those
+ * for pseudocolors (i.e., BOS/BOL/EOS/EOL).  We must be able to reach the
+ * post state via RAINBOW arcs, and if there are any loops in the graph, they
+ * must be loop-to-self arcs, ensuring that each loop iteration consumes
+ * exactly one character.  (Longer loops are problematic because they create
+ * non-consecutive possible match lengths; we have no good way to represent
+ * that situation for lengths beyond the DUPINF limit.)
+ *
+ * Pseudocolor arcs complicate things a little.  We know that they can only
+ * appear as pre-state outarcs (for BOS/BOL) or post-state inarcs (for
+ * EOS/EOL).  There, they must exactly replicate the parallel RAINBOW arcs,
+ * e.g. if the pre state has one RAINBOW outarc to state 2, it must have BOS
+ * and BOL outarcs to state 2, and no others.  Missing or extra pseudocolor
+ * arcs can occur, meaning that the NFA involves some constraint on the
+ * adjacent characters, which makes it not a matchall NFA.
+ */
+static void
+checkmatchall(struct nfa *nfa)
+{
+	bool		hasmatch[DUPINF + 1];
+	int			minmatch,
+				maxmatch,
+				morematch;
+
+	/*
+	 * hasmatch[i] will be set true if a match of length i is feasible, for i
+	 * from 0 to DUPINF-1.  hasmatch[DUPINF] will be set true if every match
+	 * length of DUPINF or more is feasible.
+	 */
+	memset(hasmatch, 0, sizeof(hasmatch));
+
+	/*
+	 * Recursively search the graph for all-RAINBOW paths to the "post" state,
+	 * starting at the "pre" state.  The -1 initial depth accounts for the
+	 * fact that transitions out of the "pre" state are not part of the
+	 * matched string.  We likewise don't count the final transition to the
+	 * "post" state as part of the match length.  (But we still insist that
+	 * those transitions have RAINBOW arcs, otherwise there are lookbehind or
+	 * lookahead constraints at the start/end of the pattern.)
+	 */
+	if (!checkmatchall_recurse(nfa, nfa->pre, false, -1, hasmatch))
+		return;
+
+	/*
+	 * We found some all-RAINBOW paths, and not anything that we couldn't
+	 * handle.  Now verify that pseudocolor arcs adjacent to the pre and post
+	 * states match the RAINBOW arcs there.  (We could do this while
+	 * recursing, but it's expensive and unlikely to fail, so do it last.)
+	 */
+	if (!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[0]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[0], RAINBOW) ||
+		!check_out_colors_match(nfa->pre, RAINBOW, nfa->bos[1]) ||
+		!check_out_colors_match(nfa->pre, nfa->bos[1], RAINBOW))
+		return;
+	if (!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[0]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[0], RAINBOW) ||
+		!check_in_colors_match(nfa->post, RAINBOW, nfa->eos[1]) ||
+		!check_in_colors_match(nfa->post, nfa->eos[1], RAINBOW))
+		return;
+
+	/*
+	 * hasmatch[] now represents the set of possible match lengths; but we
+	 * want to reduce that to a min and max value, because it doesn't seem
+	 * worth complicating regexec.c to deal with nonconsecutive possible match
+	 * lengths.  Find min and max of first run of lengths, then verify there
+	 * are no nonconsecutive lengths.
+	 */
+	for (minmatch = 0; minmatch <= DUPINF; minmatch++)
+	{
+		if (hasmatch[minmatch])
+			break;
+	}
+	assert(minmatch <= DUPINF); /* else checkmatchall_recurse lied */
+	for (maxmatch = minmatch; maxmatch < DUPINF; maxmatch++)
+	{
+		if (!hasmatch[maxmatch + 1])
+			break;
+	}
+	for (morematch = maxmatch + 1; morematch <= DUPINF; morematch++)
+	{
+		if (hasmatch[morematch])
+			return;				/* fail, there are nonconsecutive lengths */
+	}
+
+	/* Success, so record the info */
+	nfa->minmatchall = minmatch;
+	nfa->maxmatchall = maxmatch;
+	nfa->flags |= MATCHALL;
+}
+
+/*
+ * checkmatchall_recurse - recursive search for checkmatchall
+ *
+ * s is the current state
+ * foundloop is true if any predecessor state has a loop-to-self
+ * depth is the current recursion depth (starting at -1)
+ * hasmatch[] is the output area for recording feasible match lengths
+ *
+ * We return true if there is at least one all-RAINBOW path to the "post"
+ * state and no non-matchall paths; otherwise false.  Note we assume that
+ * any dead-end paths have already been removed, else we might return
+ * false unnecessarily.
+ */
+static bool
+checkmatchall_recurse(struct nfa *nfa, struct state *s,
+					  bool foundloop, int depth,
+					  bool *hasmatch)
+{
+	bool		result = false;
+	struct arc *a;
+
+	/*
+	 * Since this is recursive, it could be driven to stack overflow.  But we
+	 * need not treat that as a hard failure; just deem the NFA non-matchall.
+	 */
+	if (STACK_TOO_DEEP(nfa->v->re))
+		return false;
+
+	/*
+	 * Likewise, if we get to a depth too large to represent correctly in
+	 * maxmatchall, fail quietly.
+	 */
+	if (depth >= DUPINF)
+		return false;
+
+	/*
+	 * Scan the outarcs to detect cases we can't handle, and to see if there
+	 * is a loop-to-self here.  We need to know about any such loop before we
+	 * recurse, so it's hard to avoid making two passes over the outarcs.  In
+	 * any case, checking for showstoppers before we recurse is probably best.
+	 */
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->type != PLAIN)
+			return false;		/* any LACONs make it non-matchall */
+		if (a->co != RAINBOW)
+		{
+			if (nfa->cm->cd[a->co].flags & PSEUDO)
+			{
+				/*
+				 * Pseudocolor arc: verify it's in a valid place (this seems
+				 * quite unlikely to fail, but let's be sure).
+				 */
+				if (s == nfa->pre &&
+					(a->co == nfa->bos[0] || a->co == nfa->bos[1]))
+					 /* okay BOS/BOL arc */ ;
+				else if (a->to == nfa->post &&
+						 (a->co == nfa->eos[0] || a->co == nfa->eos[1]))
+					 /* okay EOS/EOL arc */ ;
+				else
+					return false;	/* unexpected pseudocolor arc */
+				/* We'll finish checking these arcs after the recursion */
+				continue;
+			}
+			return false;		/* any other color makes it non-matchall */
+		}
+		if (a->to == s)
+		{
+			/*
+			 * We found a cycle of length 1, so remember that to pass down to
+			 * successor states.  (It doesn't matter if there was also such a
+			 * loop at a predecessor state.)
+			 */
+			foundloop = true;
+		}
+		else if (a->to->tmp)
+		{
+			/* We found a cycle of length > 1, so fail. */
+			return false;
+		}
+	}
+
+	/* We need to recurse, so mark state as under consideration */
+	assert(s->tmp == NULL);
+	s->tmp = s;
+
+	for (a = s->outs; a != NULL; a = a->outchain)
+	{
+		if (a->co != RAINBOW)
+			continue;			/* ignore pseudocolor arcs */
+		if (a->to == nfa->post)
+		{
+			/* We found an all-RAINBOW path to the post state */
+			result = true;
+			/* Record potential match lengths */
+			assert(depth >= 0);
+			hasmatch[depth] = true;
+			if (foundloop)
+			{
+				/* A predecessor loop makes all larger lengths match, too */
+				int			i;
+
+				for (i = depth + 1; i <= DUPINF; i++)
+					hasmatch[i] = true;
+			}
+		}
+		else if (a->to != s)
+		{
+			/* This is a new path forward; recurse to investigate */
+			result = checkmatchall_recurse(nfa, a->to,
+										   foundloop, depth + 1,
+										   hasmatch);
+			/* Fail if any recursive path fails */
+			if (!result)
+				break;
+		}
+	}
+
+	s->tmp = NULL;
+	return result;
+}
+
+/*
+ * check_out_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s outarc of color co1 has a matching outarc of color co2.
+ * (checkmatchall_recurse already verified that all of the outarcs are PLAIN,
+ * so we need not examine arc types here.  Also, since it verified that there
+ * are only RAINBOW and pseudocolor arcs, there shouldn't be enough arcs for
+ * this brute-force O(N^2) implementation to cause problems.)
+ */
+static bool
+check_out_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->outs; a1 != NULL; a1 = a1->outchain)
+	{
+		if (a1->co != co1)
+			continue;
+		for (a2 = s->outs; a2 != NULL; a2 = a2->outchain)
+		{
+			if (a2->co == co2 && a2->to == a1->to)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * check_in_colors_match - subroutine for checkmatchall
+ *
+ * Check if every s inarc of color co1 has a matching inarc of color co2.
+ * (For paranoia's sake, ignore any non-PLAIN arcs here.  But we're still
+ * not expecting very many arcs.)
+ */
+static bool
+check_in_colors_match(struct state *s, color co1, color co2)
+{
+	struct arc *a1;
+	struct arc *a2;
+
+	for (a1 = s->ins; a1 != NULL; a1 = a1->inchain)
+	{
+		if (a1->type != PLAIN || a1->co != co1)
+			continue;
+		for (a2 = s->ins; a2 != NULL; a2 = a2->inchain)
+		{
+			if (a2->type == PLAIN && a2->co == co2 && a2->from == a1->from)
+				break;
+		}
+		if (a2 == NULL)
+			return false;
+	}
+	return true;
+}
+
 /*
  * compact - construct the compact representation of an NFA
  */
@@ -2930,7 +3214,9 @@ compact(struct nfa *nfa,
 	cnfa->eos[0] = nfa->eos[0];
 	cnfa->eos[1] = nfa->eos[1];
 	cnfa->ncolors = maxcolor(nfa->cm) + 1;
-	cnfa->flags = 0;
+	cnfa->flags = nfa->flags;
+	cnfa->minmatchall = nfa->minmatchall;
+	cnfa->maxmatchall = nfa->maxmatchall;
 
 	ca = cnfa->arcs;
 	for (s = nfa->states; s != NULL; s = s->next)
@@ -3034,6 +3320,11 @@ dumpnfa(struct nfa *nfa,
 		fprintf(f, ", eos [%ld]", (long) nfa->eos[0]);
 	if (nfa->eos[1] != COLORLESS)
 		fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
+	if (nfa->flags & HASLACONS)
+		fprintf(f, ", haslacons");
+	if (nfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				nfa->minmatchall, nfa->maxmatchall);
 	fprintf(f, "\n");
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
@@ -3201,6 +3492,9 @@ dumpcnfa(struct cnfa *cnfa,
 		fprintf(f, ", eol [%ld]", (long) cnfa->eos[1]);
 	if (cnfa->flags & HASLACONS)
 		fprintf(f, ", haslacons");
+	if (cnfa->flags & MATCHALL)
+		fprintf(f, ", minmatchall %d, maxmatchall %d",
+				cnfa->minmatchall, cnfa->maxmatchall);
 	fprintf(f, "\n");
 	for (st = 0; st < cnfa->nstates; st++)
 		dumpcstate(st, cnfa, f);
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae8dbe5819..39e837adc2 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -175,6 +175,11 @@ static void cleanup(struct nfa *);
 static void markreachable(struct nfa *, struct state *, struct state *, struct state *);
 static void markcanreach(struct nfa *, struct state *, struct state *, struct state *);
 static long analyze(struct nfa *);
+static void checkmatchall(struct nfa *);
+static bool checkmatchall_recurse(struct nfa *, struct state *,
+								  bool, int, bool *);
+static bool check_out_colors_match(struct state *, color, color);
+static bool check_in_colors_match(struct state *, color, color);
 static void compact(struct nfa *, struct cnfa *);
 static void carcsort(struct carc *, size_t);
 static int	carc_cmp(const void *, const void *);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 32be2592c5..89d162ed6a 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -58,6 +58,29 @@ longest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = stop - start;
+		size_t		maxmatchall = d->cnfa->maxmatchall;
+
+		if (nchr < d->cnfa->minmatchall)
+			return NULL;
+		if (maxmatchall == DUPINF)
+		{
+			if (stop == v->stop && hitstopp != NULL)
+				*hitstopp = 1;
+		}
+		else
+		{
+			if (stop == v->stop && nchr <= maxmatchall + 1 && hitstopp != NULL)
+				*hitstopp = 1;
+			if (nchr > maxmatchall)
+				return start + maxmatchall;
+		}
+		return stop;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -187,6 +210,24 @@ shortest(struct vars *v,
 	if (hitstopp != NULL)
 		*hitstopp = 0;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = min - start;
+
+		if (d->cnfa->maxmatchall != DUPINF &&
+			nchr > d->cnfa->maxmatchall)
+			return NULL;
+		if ((max - start) < d->cnfa->minmatchall)
+			return NULL;
+		if (nchr < d->cnfa->minmatchall)
+			min = start + d->cnfa->minmatchall;
+		if (coldp != NULL)
+			*coldp = start;
+		/* there is no case where we should set *hitstopp */
+		return min;
+	}
+
 	/* initialize */
 	css = initialize(v, d, start);
 	if (css == NULL)
@@ -312,6 +353,22 @@ matchuntil(struct vars *v,
 	struct sset *ss;
 	struct colormap *cm = d->cm;
 
+	/* fast path for matchall NFAs */
+	if (d->cnfa->flags & MATCHALL)
+	{
+		size_t		nchr = probe - v->start;
+
+		/*
+		 * It might seem that we should check maxmatchall too, but the .* at
+		 * the front of the pattern absorbs any extra characters (and it was
+		 * tacked on *after* computing minmatchall/maxmatchall).  Thus, we
+		 * should match if there are at least minmatchall characters.
+		 */
+		if (nchr < d->cnfa->minmatchall)
+			return 0;
+		return 1;
+	}
+
 	/* initialize and startup, or restart, if necessary */
 	if (cp == NULL || cp > probe)
 	{
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index e2fbad7a8a..ec435b6f5f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -77,6 +77,10 @@ pg_regprefix(regex_t *re,
 	assert(g->tree != NULL);
 	cnfa = &g->tree->cnfa;
 
+	/* matchall NFAs never have a fixed prefix */
+	if (cnfa->flags & MATCHALL)
+		return REG_NOMATCH;
+
 	/*
 	 * Since a correct NFA should never contain any exit-free loops, it should
 	 * not be possible for our traversal to return to a previously visited NFA
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 6d39108319..82e761bfe5 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -331,6 +331,9 @@ struct nfa
 	struct colormap *cm;		/* the color map */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
+	int			flags;			/* flags to pass forward to cNFA */
+	int			minmatchall;	/* min number of chrs to match, if matchall */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 	struct vars *v;				/* simplifies compile error reporting */
 	struct nfa *parent;			/* parent NFA, if any */
 };
@@ -353,6 +356,14 @@ struct nfa
  *
  * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
  * it doesn't break the rule about how to recognize LACON arcs.
+ *
+ * We have special markings for "trivial" NFAs that can match any string
+ * (possibly with limits on the number of characters therein).  In such a
+ * case, flags & MATCHALL is set (and HASLACONS can't be set).  Then the
+ * fields minmatchall and maxmatchall give the minimum and maximum numbers
+ * of characters to match.  For example, ".*" produces minmatchall = 0
+ * and maxmatchall = DUPINF, while ".+" produces minmatchall = 1 and
+ * maxmatchall = DUPINF.
  */
 struct carc
 {
@@ -366,6 +377,7 @@ struct cnfa
 	int			ncolors;		/* number of colors (max color in use + 1) */
 	int			flags;
 #define  HASLACONS	01			/* uses lookaround constraints */
+#define  MATCHALL	02			/* matches all strings of a range of lengths */
 	int			pre;			/* setup state number */
 	int			post;			/* teardown state number */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
@@ -375,6 +387,9 @@ struct cnfa
 	struct carc **states;		/* vector of pointers to outarc lists */
 	/* states[n] are pointers into a single malloc'd array of arcs */
 	struct carc *arcs;			/* the area for the lists */
+	/* these fields are used only in a MATCHALL NFA (else they're -1): */
+	int			minmatchall;	/* min number of chrs to match */
+	int			maxmatchall;	/* max number of chrs to match, or DUPINF */
 };
 
 /*

0003-remove-useless-concat-nodes-4.patchtext/x-diff; charset=us-ascii; name=0003-remove-useless-concat-nodes-4.patchDownload

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 39e837adc2..0182e02db1 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1123,14 +1123,24 @@ parseqatom(struct vars *v,
 	t->left = atom;
 	atomp = &t->left;
 
-	/* here we should recurse... but we must postpone that to the end */
+	/*
+	 * Here we should recurse to fill t->right ... but we must postpone that
+	 * to the end.
+	 */
 
-	/* split top into prefix and remaining */
+	/*
+	 * Convert top node to a concatenation of the prefix (top->left, covering
+	 * whatever we parsed previously) and remaining (t).  Note that the prefix
+	 * could be empty, in which case this concatenation node is unnecessary.
+	 * To keep things simple, we operate in a general way for now, and get rid
+	 * of unnecessary subres below.
+	 */
 	assert(top->op == '=' && top->left == NULL && top->right == NULL);
 	top->left = subre(v, '=', top->flags, top->begin, lp);
 	NOERR();
 	top->op = '.';
 	top->right = t;
+	/* top->flags will get updated later */
 
 	/* if it's a backref, now is the time to replicate the subNFA */
 	if (atomtype == BACKREF)
@@ -1220,16 +1230,75 @@ parseqatom(struct vars *v,
 	/* and finally, look after that postponed recursion */
 	t = top->right;
 	if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
+	{
+		/* parse all the rest of the branch, and insert in t->right */
 		t->right = parsebranch(v, stopper, type, s2, rp, 1);
+		NOERR();
+		assert(SEE('|') || SEE(stopper) || SEE(EOS));
+
+		/* here's the promised update of the flags */
+		t->flags |= COMBINE(t->flags, t->right->flags);
+		top->flags |= COMBINE(top->flags, t->flags);
+
+		/*
+		 * At this point both top and t are concatenation (op == '.') subres,
+		 * and we have top->left = prefix of branch, top->right = t, t->left =
+		 * messy atom (with quantification superstructure if needed), t->right
+		 * = rest of branch.
+		 *
+		 * If the messy atom was the first thing in the branch, then top->left
+		 * is vacuous and we can get rid of one level of concatenation.  Since
+		 * the caller is holding a pointer to the top node, we can't remove
+		 * that node; but we're allowed to change its properties.
+		 */
+		assert(top->left->op == '=');
+		if (top->left->begin == top->left->end)
+		{
+			assert(!MESSY(top->left->flags));
+			freesubre(v, top->left);
+			top->left = t->left;
+			top->right = t->right;
+			t->left = t->right = NULL;
+			freesubre(v, t);
+		}
+	}
 	else
 	{
+		/*
+		 * There's nothing left in the branch, so we don't need the second
+		 * concatenation node 't'.  Just link s2 straight to rp.
+		 */
 		EMPTYARC(s2, rp);
-		t->right = subre(v, '=', 0, s2, rp);
+		top->right = t->left;
+		top->flags |= COMBINE(top->flags, top->right->flags);
+		t->left = t->right = NULL;
+		freesubre(v, t);
+
+		/*
+		 * Again, it could be that top->left is vacuous (if the messy atom was
+		 * in fact the only thing in the branch).  In that case we need no
+		 * concatenation at all; just replace top with top->right.
+		 */
+		assert(top->left->op == '=');
+		if (top->left->begin == top->left->end)
+		{
+			assert(!MESSY(top->left->flags));
+			freesubre(v, top->left);
+			t = top->right;
+			top->op = t->op;
+			top->flags = t->flags;
+			top->id = t->id;
+			top->subno = t->subno;
+			top->min = t->min;
+			top->max = t->max;
+			top->left = t->left;
+			top->right = t->right;
+			top->begin = t->begin;
+			top->end = t->end;
+			t->left = t->right = NULL;
+			freesubre(v, t);
+		}
 	}
-	NOERR();
-	assert(SEE('|') || SEE(stopper) || SEE(EOS));
-	t->flags |= COMBINE(t->flags, t->right->flags);
-	top->flags |= COMBINE(top->flags, t->flags);
 }
 
 /*

0004-make-subre-trees-Nary-4.patchtext/x-diff; charset=us-ascii; name=0004-make-subre-trees-Nary-4.patchDownload

diff --git a/src/backend/regex/README b/src/backend/regex/README
index a83ab5074d..cafeb3dffb 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -129,9 +129,9 @@ If not, we can reject the match immediately without iterating through many
 possibilities.
 
 As an example, consider the regex "(a[bc]+)\1".  The compiled
-representation will have a top-level concatenation subre node.  Its left
+representation will have a top-level concatenation subre node.  Its first
 child is a capture node, and the child of that is a plain DFA node for
-"a[bc]+".  The concatenation's right child is a backref node for \1.
+"a[bc]+".  The concatenation's second child is a backref node for \1.
 The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
 where the backref has been replaced by a copy of the DFA for its referent
 expression.  When executed, the concatenation node will have to search for
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 0182e02db1..4d483e7e53 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -58,6 +58,7 @@ static void processlacon(struct vars *, struct state *, struct state *, int,
 						 struct state *, struct state *);
 static struct subre *subre(struct vars *, int, int, struct state *, struct state *);
 static void freesubre(struct vars *, struct subre *);
+static void freesubreandsiblings(struct vars *, struct subre *);
 static void freesrnode(struct vars *, struct subre *);
 static void optst(struct vars *, struct subre *);
 static int	numst(struct subre *, int);
@@ -652,8 +653,8 @@ makesearch(struct vars *v,
  * parse - parse an RE
  *
  * This is actually just the top level, which parses a bunch of branches
- * tied together with '|'.  They appear in the tree as the left children
- * of a chain of '|' subres.
+ * tied together with '|'.  If there's more than one, they appear in the
+ * tree as the children of a '|' subre.
  */
 static struct subre *
 parse(struct vars *v,
@@ -662,41 +663,34 @@ parse(struct vars *v,
 	  struct state *init,		/* initial state */
 	  struct state *final)		/* final state */
 {
-	struct state *left;			/* scaffolding for branch */
-	struct state *right;
 	struct subre *branches;		/* top level */
-	struct subre *branch;		/* current branch */
-	struct subre *t;			/* temporary */
-	int			firstbranch;	/* is this the first branch? */
+	struct subre *lastbranch;	/* latest branch */
 
 	assert(stopper == ')' || stopper == EOS);
 
 	branches = subre(v, '|', LONGER, init, final);
 	NOERRN();
-	branch = branches;
-	firstbranch = 1;
+	lastbranch = NULL;
 	do
 	{							/* a branch */
-		if (!firstbranch)
-		{
-			/* need a place to hang it */
-			branch->right = subre(v, '|', LONGER, init, final);
-			NOERRN();
-			branch = branch->right;
-		}
-		firstbranch = 0;
+		struct subre *branch;
+		struct state *left;		/* scaffolding for branch */
+		struct state *right;
+
 		left = newstate(v->nfa);
 		right = newstate(v->nfa);
 		NOERRN();
 		EMPTYARC(init, left);
 		EMPTYARC(right, final);
 		NOERRN();
-		branch->left = parsebranch(v, stopper, type, left, right, 0);
+		branch = parsebranch(v, stopper, type, left, right, 0);
 		NOERRN();
-		branch->flags |= UP(branch->flags | branch->left->flags);
-		if ((branch->flags & ~branches->flags) != 0)	/* new flags */
-			for (t = branches; t != branch; t = t->right)
-				t->flags |= branch->flags;
+		if (lastbranch)
+			lastbranch->sibling = branch;
+		else
+			branches->child = branch;
+		branches->flags |= UP(branches->flags | branch->flags);
+		lastbranch = branch;
 	} while (EAT('|'));
 	assert(SEE(stopper) || SEE(EOS));
 
@@ -707,20 +701,16 @@ parse(struct vars *v,
 	}
 
 	/* optimize out simple cases */
-	if (branch == branches)
+	if (lastbranch == branches->child)
 	{							/* only one branch */
-		assert(branch->right == NULL);
-		t = branch->left;
-		branch->left = NULL;
-		freesubre(v, branches);
-		branches = t;
+		assert(lastbranch->sibling == NULL);
+		freesrnode(v, branches);
+		branches = lastbranch;
 	}
 	else if (!MESSY(branches->flags))
 	{							/* no interesting innards */
-		freesubre(v, branches->left);
-		branches->left = NULL;
-		freesubre(v, branches->right);
-		branches->right = NULL;
+		freesubreandsiblings(v, branches->child);
+		branches->child = NULL;
 		branches->op = '=';
 	}
 
@@ -972,7 +962,7 @@ parseqatom(struct vars *v,
 				t = subre(v, '(', atom->flags | CAP, lp, rp);
 				NOERR();
 				t->subno = subno;
-				t->left = atom;
+				t->child = atom;
 				atom = t;
 			}
 			/* postpone everything else pending possible {0} */
@@ -1120,26 +1110,27 @@ parseqatom(struct vars *v,
 	/* break remaining subRE into x{...} and what follows */
 	t = subre(v, '.', COMBINE(qprefer, atom->flags), lp, rp);
 	NOERR();
-	t->left = atom;
-	atomp = &t->left;
+	t->child = atom;
+	atomp = &t->child;
 
 	/*
-	 * Here we should recurse to fill t->right ... but we must postpone that
-	 * to the end.
+	 * Here we should recurse to fill t->child->sibling ... but we must
+	 * postpone that to the end.  One reason is that t->child may be replaced
+	 * below, and we don't want to worry about its sibling link.
 	 */
 
 	/*
-	 * Convert top node to a concatenation of the prefix (top->left, covering
+	 * Convert top node to a concatenation of the prefix (top->child, covering
 	 * whatever we parsed previously) and remaining (t).  Note that the prefix
 	 * could be empty, in which case this concatenation node is unnecessary.
 	 * To keep things simple, we operate in a general way for now, and get rid
 	 * of unnecessary subres below.
 	 */
-	assert(top->op == '=' && top->left == NULL && top->right == NULL);
-	top->left = subre(v, '=', top->flags, top->begin, lp);
+	assert(top->op == '=' && top->child == NULL);
+	top->child = subre(v, '=', top->flags, top->begin, lp);
 	NOERR();
 	top->op = '.';
-	top->right = t;
+	top->child->sibling = t;
 	/* top->flags will get updated later */
 
 	/* if it's a backref, now is the time to replicate the subNFA */
@@ -1201,9 +1192,9 @@ parseqatom(struct vars *v,
 		f = COMBINE(qprefer, atom->flags);
 		t = subre(v, '.', f, s, atom->end); /* prefix and atom */
 		NOERR();
-		t->left = subre(v, '=', PREF(f), s, atom->begin);
+		t->child = subre(v, '=', PREF(f), s, atom->begin);
 		NOERR();
-		t->right = atom;
+		t->child->sibling = atom;
 		*atomp = t;
 		/* rest of branch can be strung starting from atom->end */
 		s2 = atom->end;
@@ -1222,44 +1213,43 @@ parseqatom(struct vars *v,
 		NOERR();
 		t->min = (short) m;
 		t->max = (short) n;
-		t->left = atom;
+		t->child = atom;
 		*atomp = t;
 		/* rest of branch is to be strung from iteration's end state */
 	}
 
 	/* and finally, look after that postponed recursion */
-	t = top->right;
+	t = top->child->sibling;
 	if (!(SEE('|') || SEE(stopper) || SEE(EOS)))
 	{
-		/* parse all the rest of the branch, and insert in t->right */
-		t->right = parsebranch(v, stopper, type, s2, rp, 1);
+		/* parse all the rest of the branch, and insert in t->child->sibling */
+		t->child->sibling = parsebranch(v, stopper, type, s2, rp, 1);
 		NOERR();
 		assert(SEE('|') || SEE(stopper) || SEE(EOS));
 
 		/* here's the promised update of the flags */
-		t->flags |= COMBINE(t->flags, t->right->flags);
+		t->flags |= COMBINE(t->flags, t->child->sibling->flags);
 		top->flags |= COMBINE(top->flags, t->flags);
 
 		/*
 		 * At this point both top and t are concatenation (op == '.') subres,
-		 * and we have top->left = prefix of branch, top->right = t, t->left =
-		 * messy atom (with quantification superstructure if needed), t->right
-		 * = rest of branch.
+		 * and we have top->child = prefix of branch, top->child->sibling = t,
+		 * t->child = messy atom (with quantification superstructure if
+		 * needed), t->child->sibling = rest of branch.
 		 *
-		 * If the messy atom was the first thing in the branch, then top->left
-		 * is vacuous and we can get rid of one level of concatenation.  Since
-		 * the caller is holding a pointer to the top node, we can't remove
-		 * that node; but we're allowed to change its properties.
+		 * If the messy atom was the first thing in the branch, then
+		 * top->child is vacuous and we can get rid of one level of
+		 * concatenation.  Since the caller is holding a pointer to the top
+		 * node, we can't remove that node; but we're allowed to change its
+		 * properties.
 		 */
-		assert(top->left->op == '=');
-		if (top->left->begin == top->left->end)
+		assert(top->child->op == '=');
+		if (top->child->begin == top->child->end)
 		{
-			assert(!MESSY(top->left->flags));
-			freesubre(v, top->left);
-			top->left = t->left;
-			top->right = t->right;
-			t->left = t->right = NULL;
-			freesubre(v, t);
+			assert(!MESSY(top->child->flags));
+			freesubre(v, top->child);
+			top->child = t->child;
+			freesrnode(v, t);
 		}
 	}
 	else
@@ -1269,34 +1259,31 @@ parseqatom(struct vars *v,
 		 * concatenation node 't'.  Just link s2 straight to rp.
 		 */
 		EMPTYARC(s2, rp);
-		top->right = t->left;
-		top->flags |= COMBINE(top->flags, top->right->flags);
-		t->left = t->right = NULL;
-		freesubre(v, t);
+		top->child->sibling = t->child;
+		top->flags |= COMBINE(top->flags, top->child->sibling->flags);
+		freesrnode(v, t);
 
 		/*
-		 * Again, it could be that top->left is vacuous (if the messy atom was
-		 * in fact the only thing in the branch).  In that case we need no
-		 * concatenation at all; just replace top with top->right.
+		 * Again, it could be that top->child is vacuous (if the messy atom
+		 * was in fact the only thing in the branch).  In that case we need no
+		 * concatenation at all; just replace top with top->child->sibling.
 		 */
-		assert(top->left->op == '=');
-		if (top->left->begin == top->left->end)
+		assert(top->child->op == '=');
+		if (top->child->begin == top->child->end)
 		{
-			assert(!MESSY(top->left->flags));
-			freesubre(v, top->left);
-			t = top->right;
+			assert(!MESSY(top->child->flags));
+			t = top->child->sibling;
+			freesubre(v, top->child);
 			top->op = t->op;
 			top->flags = t->flags;
 			top->id = t->id;
 			top->subno = t->subno;
 			top->min = t->min;
 			top->max = t->max;
-			top->left = t->left;
-			top->right = t->right;
+			top->child = t->child;
 			top->begin = t->begin;
 			top->end = t->end;
-			t->left = t->right = NULL;
-			freesubre(v, t);
+			freesrnode(v, t);
 		}
 	}
 }
@@ -1786,7 +1773,7 @@ subre(struct vars *v,
 	}
 
 	if (ret != NULL)
-		v->treefree = ret->left;
+		v->treefree = ret->child;
 	else
 	{
 		ret = (struct subre *) MALLOC(sizeof(struct subre));
@@ -1806,8 +1793,8 @@ subre(struct vars *v,
 	ret->id = 0;				/* will be assigned later */
 	ret->subno = 0;
 	ret->min = ret->max = 1;
-	ret->left = NULL;
-	ret->right = NULL;
+	ret->child = NULL;
+	ret->sibling = NULL;
 	ret->begin = begin;
 	ret->end = end;
 	ZAPCNFA(ret->cnfa);
@@ -1817,6 +1804,9 @@ subre(struct vars *v,
 
 /*
  * freesubre - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * but not its siblings.
  */
 static void
 freesubre(struct vars *v,		/* might be NULL */
@@ -1825,14 +1815,31 @@ freesubre(struct vars *v,		/* might be NULL */
 	if (sr == NULL)
 		return;
 
-	if (sr->left != NULL)
-		freesubre(v, sr->left);
-	if (sr->right != NULL)
-		freesubre(v, sr->right);
+	if (sr->child != NULL)
+		freesubreandsiblings(v, sr->child);
 
 	freesrnode(v, sr);
 }
 
+/*
+ * freesubreandsiblings - free a subRE subtree
+ *
+ * This frees child node(s) of the given subRE too,
+ * as well as any following siblings.
+ */
+static void
+freesubreandsiblings(struct vars *v,	/* might be NULL */
+					 struct subre *sr)
+{
+	while (sr != NULL)
+	{
+		struct subre *next = sr->sibling;
+
+		freesubre(v, sr);
+		sr = next;
+	}
+}
+
 /*
  * freesrnode - free one node in a subRE subtree
  */
@@ -1850,7 +1857,7 @@ freesrnode(struct vars *v,		/* might be NULL */
 	if (v != NULL && v->treechain != NULL)
 	{
 		/* we're still parsing, maybe we can reuse the subre */
-		sr->left = v->treefree;
+		sr->child = v->treefree;
 		v->treefree = sr;
 	}
 	else
@@ -1881,15 +1888,14 @@ numst(struct subre *t,
 	  int start)				/* starting point for subtree numbers */
 {
 	int			i;
+	struct subre *t2;
 
 	assert(t != NULL);
 
 	i = start;
 	t->id = (short) i++;
-	if (t->left != NULL)
-		i = numst(t->left, i);
-	if (t->right != NULL)
-		i = numst(t->right, i);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		i = numst(t2, i);
 	return i;
 }
 
@@ -1913,13 +1919,13 @@ numst(struct subre *t,
 static void
 markst(struct subre *t)
 {
+	struct subre *t2;
+
 	assert(t != NULL);
 
 	t->flags |= INUSE;
-	if (t->left != NULL)
-		markst(t->left);
-	if (t->right != NULL)
-		markst(t->right);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		markst(t2);
 }
 
 /*
@@ -1949,12 +1955,12 @@ nfatree(struct vars *v,
 		struct subre *t,
 		FILE *f)				/* for debug output */
 {
+	struct subre *t2;
+
 	assert(t != NULL && t->begin != NULL);
 
-	if (t->left != NULL)
-		(DISCARD) nfatree(v, t->left, f);
-	if (t->right != NULL)
-		(DISCARD) nfatree(v, t->right, f);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		(DISCARD) nfatree(v, t2, f);
 
 	return nfanode(v, t, 0, f);
 }
@@ -2206,6 +2212,7 @@ stdump(struct subre *t,
 	   int nfapresent)			/* is the original NFA still around? */
 {
 	char		idbuf[50];
+	struct subre *t2;
 
 	fprintf(f, "%s. `%c'", stid(t, idbuf, sizeof(idbuf)), t->op);
 	if (t->flags & LONGER)
@@ -2231,20 +2238,18 @@ stdump(struct subre *t,
 	}
 	if (nfapresent)
 		fprintf(f, " %ld-%ld", (long) t->begin->no, (long) t->end->no);
-	if (t->left != NULL)
-		fprintf(f, " L:%s", stid(t->left, idbuf, sizeof(idbuf)));
-	if (t->right != NULL)
-		fprintf(f, " R:%s", stid(t->right, idbuf, sizeof(idbuf)));
+	if (t->child != NULL)
+		fprintf(f, " C:%s", stid(t->child, idbuf, sizeof(idbuf)));
+	if (t->sibling != NULL)
+		fprintf(f, " S:%s", stid(t->sibling, idbuf, sizeof(idbuf)));
 	if (!NULLCNFA(t->cnfa))
 	{
 		fprintf(f, "\n");
 		dumpcnfa(&t->cnfa, f);
 	}
 	fprintf(f, "\n");
-	if (t->left != NULL)
-		stdump(t->left, f, nfapresent);
-	if (t->right != NULL)
-		stdump(t->right, f, nfapresent);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		stdump(t2, f, nfapresent);
 }
 
 /*
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index adcbcc0a8e..4541cf9a7e 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -640,6 +640,8 @@ static void
 zaptreesubs(struct vars *v,
 			struct subre *t)
 {
+	struct subre *t2;
+
 	if (t->op == '(')
 	{
 		int			n = t->subno;
@@ -652,10 +654,8 @@ zaptreesubs(struct vars *v,
 		}
 	}
 
-	if (t->left != NULL)
-		zaptreesubs(v, t->left);
-	if (t->right != NULL)
-		zaptreesubs(v, t->right);
+	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
+		zaptreesubs(v, t2);
 }
 
 /*
@@ -714,35 +714,35 @@ cdissect(struct vars *v,
 	switch (t->op)
 	{
 		case '=':				/* terminal node */
-			assert(t->left == NULL && t->right == NULL);
+			assert(t->child == NULL);
 			er = REG_OKAY;		/* no action, parent did the work */
 			break;
 		case 'b':				/* back reference */
-			assert(t->left == NULL && t->right == NULL);
+			assert(t->child == NULL);
 			er = cbrdissect(v, t, begin, end);
 			break;
 		case '.':				/* concatenation */
-			assert(t->left != NULL && t->right != NULL);
-			if (t->left->flags & SHORTER)	/* reverse scan */
+			assert(t->child != NULL);
+			if (t->child->flags & SHORTER)	/* reverse scan */
 				er = crevcondissect(v, t, begin, end);
 			else
 				er = ccondissect(v, t, begin, end);
 			break;
 		case '|':				/* alternation */
-			assert(t->left != NULL);
+			assert(t->child != NULL);
 			er = caltdissect(v, t, begin, end);
 			break;
 		case '*':				/* iteration */
-			assert(t->left != NULL);
-			if (t->left->flags & SHORTER)	/* reverse scan */
+			assert(t->child != NULL);
+			if (t->child->flags & SHORTER)	/* reverse scan */
 				er = creviterdissect(v, t, begin, end);
 			else
 				er = citerdissect(v, t, begin, end);
 			break;
 		case '(':				/* capturing */
-			assert(t->left != NULL && t->right == NULL);
+			assert(t->child != NULL);
 			assert(t->subno > 0);
-			er = cdissect(v, t->left, begin, end);
+			er = cdissect(v, t->child, begin, end);
 			if (er == REG_OKAY)
 				subset(v, t, begin, end);
 			break;
@@ -770,19 +770,22 @@ ccondissect(struct vars *v,
 			chr *begin,			/* beginning of relevant substring */
 			chr *end)			/* end of same */
 {
+	struct subre *left = t->child;
+	struct subre *right = left->sibling;
 	struct dfa *d;
 	struct dfa *d2;
 	chr		   *mid;
 	int			er;
 
 	assert(t->op == '.');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->right != NULL && t->right->cnfa.nstates > 0);
-	assert(!(t->left->flags & SHORTER));
+	assert(left != NULL && left->cnfa.nstates > 0);
+	assert(right != NULL && right->cnfa.nstates > 0);
+	assert(right->sibling == NULL);
+	assert(!(left->flags & SHORTER));
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, left);
 	NOERR();
-	d2 = getsubdfa(v, t->right);
+	d2 = getsubdfa(v, right);
 	NOERR();
 	MDEBUG(("%d: ccondissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
 
@@ -799,10 +802,10 @@ ccondissect(struct vars *v,
 		/* try this midpoint on for size */
 		if (longest(v, d2, mid, end, (int *) NULL) == end)
 		{
-			er = cdissect(v, t->left, begin, mid);
+			er = cdissect(v, left, begin, mid);
 			if (er == REG_OKAY)
 			{
-				er = cdissect(v, t->right, mid, end);
+				er = cdissect(v, right, mid, end);
 				if (er == REG_OKAY)
 				{
 					/* satisfaction */
@@ -831,8 +834,8 @@ ccondissect(struct vars *v,
 			return REG_NOMATCH;
 		}
 		MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-		zaptreesubs(v, t->left);
-		zaptreesubs(v, t->right);
+		zaptreesubs(v, left);
+		zaptreesubs(v, right);
 	}
 
 	/* can't get here */
@@ -848,19 +851,22 @@ crevcondissect(struct vars *v,
 			   chr *begin,		/* beginning of relevant substring */
 			   chr *end)		/* end of same */
 {
+	struct subre *left = t->child;
+	struct subre *right = left->sibling;
 	struct dfa *d;
 	struct dfa *d2;
 	chr		   *mid;
 	int			er;
 
 	assert(t->op == '.');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->right != NULL && t->right->cnfa.nstates > 0);
-	assert(t->left->flags & SHORTER);
+	assert(left != NULL && left->cnfa.nstates > 0);
+	assert(right != NULL && right->cnfa.nstates > 0);
+	assert(right->sibling == NULL);
+	assert(left->flags & SHORTER);
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, left);
 	NOERR();
-	d2 = getsubdfa(v, t->right);
+	d2 = getsubdfa(v, right);
 	NOERR();
 	MDEBUG(("%d: crevcondissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
 
@@ -877,10 +883,10 @@ crevcondissect(struct vars *v,
 		/* try this midpoint on for size */
 		if (longest(v, d2, mid, end, (int *) NULL) == end)
 		{
-			er = cdissect(v, t->left, begin, mid);
+			er = cdissect(v, left, begin, mid);
 			if (er == REG_OKAY)
 			{
-				er = cdissect(v, t->right, mid, end);
+				er = cdissect(v, right, mid, end);
 				if (er == REG_OKAY)
 				{
 					/* satisfaction */
@@ -909,8 +915,8 @@ crevcondissect(struct vars *v,
 			return REG_NOMATCH;
 		}
 		MDEBUG(("%d: new midpoint %ld\n", t->id, LOFF(mid)));
-		zaptreesubs(v, t->left);
-		zaptreesubs(v, t->right);
+		zaptreesubs(v, left);
+		zaptreesubs(v, right);
 	}
 
 	/* can't get here */
@@ -1011,26 +1017,30 @@ caltdissect(struct vars *v,
 	struct dfa *d;
 	int			er;
 
-	/* We loop, rather than tail-recurse, to handle a chain of alternatives */
+	assert(t->op == '|');
+
+	t = t->child;
+	/* there should be at least 2 alternatives */
+	assert(t != NULL && t->sibling != NULL);
+
 	while (t != NULL)
 	{
-		assert(t->op == '|');
-		assert(t->left != NULL && t->left->cnfa.nstates > 0);
+		assert(t->cnfa.nstates > 0);
 
 		MDEBUG(("%d: caltdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
 
-		d = getsubdfa(v, t->left);
+		d = getsubdfa(v, t);
 		NOERR();
 		if (longest(v, d, begin, end, (int *) NULL) == end)
 		{
 			MDEBUG(("%d: caltdissect matched\n", t->id));
-			er = cdissect(v, t->left, begin, end);
+			er = cdissect(v, t, begin, end);
 			if (er != REG_NOMATCH)
 				return er;
 		}
 		NOERR();
 
-		t = t->right;
+		t = t->sibling;
 	}
 
 	return REG_NOMATCH;
@@ -1056,8 +1066,8 @@ citerdissect(struct vars *v,
 	int			er;
 
 	assert(t->op == '*');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(!(t->left->flags & SHORTER));
+	assert(t->child != NULL && t->child->cnfa.nstates > 0);
+	assert(!(t->child->flags & SHORTER));
 	assert(begin <= end);
 
 	MDEBUG(("%d: citerdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
@@ -1094,7 +1104,7 @@ citerdissect(struct vars *v,
 		return REG_ESPACE;
 	endpts[0] = begin;
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, t->child);
 	if (ISERR())
 	{
 		FREE(endpts);
@@ -1172,8 +1182,8 @@ citerdissect(struct vars *v,
 
 		for (i = nverified + 1; i <= k; i++)
 		{
-			zaptreesubs(v, t->left);
-			er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+			zaptreesubs(v, t->child);
+			er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
 			if (er == REG_OKAY)
 			{
 				nverified = i;
@@ -1258,8 +1268,8 @@ creviterdissect(struct vars *v,
 	int			er;
 
 	assert(t->op == '*');
-	assert(t->left != NULL && t->left->cnfa.nstates > 0);
-	assert(t->left->flags & SHORTER);
+	assert(t->child != NULL && t->child->cnfa.nstates > 0);
+	assert(t->child->flags & SHORTER);
 	assert(begin <= end);
 
 	MDEBUG(("%d: creviterdissect %ld-%ld\n", t->id, LOFF(begin), LOFF(end)));
@@ -1299,7 +1309,7 @@ creviterdissect(struct vars *v,
 		return REG_ESPACE;
 	endpts[0] = begin;
 
-	d = getsubdfa(v, t->left);
+	d = getsubdfa(v, t->child);
 	if (ISERR())
 	{
 		FREE(endpts);
@@ -1383,8 +1393,8 @@ creviterdissect(struct vars *v,
 
 		for (i = nverified + 1; i <= k; i++)
 		{
-			zaptreesubs(v, t->left);
-			er = cdissect(v, t->left, endpts[i - 1], endpts[i]);
+			zaptreesubs(v, t->child);
+			er = cdissect(v, t->child, endpts[i - 1], endpts[i]);
 			if (er == REG_OKAY)
 			{
 				nverified = i;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 82e761bfe5..90ee16957a 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -423,15 +423,17 @@ struct cnfa
  *		'='  plain regex without interesting substructure (implemented as DFA)
  *		'b'  back-reference (has no substructure either)
  *		'('  capture node: captures the match of its single child
- *		'.'  concatenation: matches a match for left, then a match for right
- *		'|'  alternation: matches a match for left or a match for right
+ *		'.'  concatenation: matches a match for first child, then second child
+ *		'|'  alternation: matches a match for any of its children
  *		'*'  iteration: matches some number of matches of its single child
  *
- * Note: the right child of an alternation must be another alternation or
- * NULL; hence, an N-way branch requires N alternation nodes, not N-1 as you
- * might expect.  This could stand to be changed.  Actually I'd rather see
- * a single alternation node with N children, but that will take revising
- * the representation of struct subre.
+ * An alternation node can have any number of children (but at least two),
+ * linked through their sibling fields.
+ *
+ * A concatenation node must have exactly two children.  It might be useful
+ * to support more, but that would complicate the executor.  Note that it is
+ * the first child's greediness that determines the node's preference for
+ * where to split a match.
  *
  * Note: when a backref is directly quantified, we stick the min/max counts
  * into the backref rather than plastering an iteration node on top.  This is
@@ -460,8 +462,8 @@ struct subre
 								 * LATYPE code for lookaround constraint */
 	short		min;			/* min repetitions for iteration or backref */
 	short		max;			/* max repetitions for iteration or backref */
-	struct subre *left;			/* left child, if any (also freelist chain) */
-	struct subre *right;		/* right child, if any */
+	struct subre *child;		/* first child, if any (also freelist chain) */
+	struct subre *sibling;		/* next child of same parent, if any */
 	struct state *begin;		/* outarcs from here... */
 	struct state *end;			/* ...ending in inarcs here */
 	struct cnfa cnfa;			/* compacted NFA, if any */

0005-remove-separate-capture-nodes-4.patchtext/x-diff; charset=us-ascii; name=0005-remove-separate-capture-nodes-4.patchDownload

diff --git a/src/backend/regex/README b/src/backend/regex/README
index cafeb3dffb..e4b083664f 100644
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@@ -130,8 +130,8 @@ possibilities.
 
 As an example, consider the regex "(a[bc]+)\1".  The compiled
 representation will have a top-level concatenation subre node.  Its first
-child is a capture node, and the child of that is a plain DFA node for
-"a[bc]+".  The concatenation's second child is a backref node for \1.
+child is a plain DFA node for "a[bc]+" (which is marked as being a capture
+node).  The concatenation's second child is a backref node for \1.
 The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
 where the backref has been replaced by a copy of the DFA for its referent
 expression.  When executed, the concatenation node will have to search for
@@ -147,6 +147,17 @@ run much faster than a pure NFA engine could do.  It is this behavior that
 justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
 library.
 
+It's perhaps worth noting that separate capture subre nodes are a rarity:
+normally, we just mark a subre as capturing and that's it.  However, it's
+legal to write a regex like "((x))" in which the same substring has to be
+captured by multiple sets of parentheses.  Since a subre has room for only
+one "capno" field, a single subre can't handle that.  We handle such cases
+by wrapping the base subre (which captures the innermost parens) in a
+no-op capture node, or even more than one for "(((x)))" etc.  This is a
+little bit inefficient because we end up with multiple identical NFAs,
+but since the case is pointless and infrequent, it's not worth working
+harder.
+
 
 Colors and colormapping
 -----------------------
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 4d483e7e53..891ad15b23 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -452,7 +452,7 @@ pg_regcomp(regex_t *re,
 #endif
 
 		/* Prepend .* to pattern if it's a lookbehind LACON */
-		nfanode(v, lasub, !LATYPE_IS_AHEAD(lasub->subno), debug);
+		nfanode(v, lasub, !LATYPE_IS_AHEAD(lasub->latype), debug);
 	}
 	CNOERR();
 	if (v->tree->flags & SHORTER)
@@ -944,7 +944,13 @@ parseqatom(struct vars *v,
 			else
 				atomtype = PLAIN;	/* something that's not '(' */
 			NEXT();
-			/* need new endpoints because tree will contain pointers */
+
+			/*
+			 * Make separate endpoints to ensure we keep this sub-NFA cleanly
+			 * separate from what surrounds it.  We need to be sure that when
+			 * we duplicate the sub-NFA for a backref, we get the right states
+			 * and no others.
+			 */
 			s = newstate(v->nfa);
 			s2 = newstate(v->nfa);
 			NOERR();
@@ -959,11 +965,21 @@ parseqatom(struct vars *v,
 			{
 				assert(v->subs[subno] == NULL);
 				v->subs[subno] = atom;
-				t = subre(v, '(', atom->flags | CAP, lp, rp);
-				NOERR();
-				t->subno = subno;
-				t->child = atom;
-				atom = t;
+				if (atom->capno == 0)
+				{
+					/* normal case: just mark the atom as capturing */
+					atom->flags |= CAP;
+					atom->capno = subno;
+				}
+				else
+				{
+					/* generate no-op wrapper node to handle "((x))" */
+					t = subre(v, '(', atom->flags | CAP, lp, rp);
+					NOERR();
+					t->capno = subno;
+					t->child = atom;
+					atom = t;
+				}
 			}
 			/* postpone everything else pending possible {0} */
 			break;
@@ -976,7 +992,7 @@ parseqatom(struct vars *v,
 			atom = subre(v, 'b', BACKR, lp, rp);
 			NOERR();
 			subno = v->nextvalue;
-			atom->subno = subno;
+			atom->backno = subno;
 			EMPTYARC(lp, rp);	/* temporarily, so there's something */
 			NEXT();
 			break;
@@ -1276,8 +1292,10 @@ parseqatom(struct vars *v,
 			freesubre(v, top->child);
 			top->op = t->op;
 			top->flags = t->flags;
+			top->latype = t->latype;
 			top->id = t->id;
-			top->subno = t->subno;
+			top->capno = t->capno;
+			top->backno = t->backno;
 			top->min = t->min;
 			top->max = t->max;
 			top->child = t->child;
@@ -1790,8 +1808,10 @@ subre(struct vars *v,
 
 	ret->op = op;
 	ret->flags = flags;
+	ret->latype = (char) -1;
 	ret->id = 0;				/* will be assigned later */
-	ret->subno = 0;
+	ret->capno = 0;
+	ret->backno = 0;
 	ret->min = ret->max = 1;
 	ret->child = NULL;
 	ret->sibling = NULL;
@@ -1893,7 +1913,7 @@ numst(struct subre *t,
 	assert(t != NULL);
 
 	i = start;
-	t->id = (short) i++;
+	t->id = i++;
 	for (t2 = t->child; t2 != NULL; t2 = t2->sibling)
 		i = numst(t2, i);
 	return i;
@@ -2040,7 +2060,7 @@ newlacon(struct vars *v,
 	sub = &v->lacons[n];
 	sub->begin = begin;
 	sub->end = end;
-	sub->subno = latype;
+	sub->latype = latype;
 	ZAPCNFA(sub->cnfa);
 	return n;
 }
@@ -2163,7 +2183,7 @@ dump(regex_t *re,
 		struct subre *lasub = &g->lacons[i];
 		const char *latype;
 
-		switch (lasub->subno)
+		switch (lasub->latype)
 		{
 			case LATYPE_AHEAD_POS:
 				latype = "positive lookahead";
@@ -2227,8 +2247,12 @@ stdump(struct subre *t,
 		fprintf(f, " hasbackref");
 	if (!(t->flags & INUSE))
 		fprintf(f, " UNUSED");
-	if (t->subno != 0)
-		fprintf(f, " (#%d)", t->subno);
+	if (t->latype != (char) -1)
+		fprintf(f, " latype(%d)", t->latype);
+	if (t->capno != 0)
+		fprintf(f, " capture(%d)", t->capno);
+	if (t->backno != 0)
+		fprintf(f, " backref(%d)", t->backno);
 	if (t->min != 1 || t->max != 1)
 	{
 		fprintf(f, " {%d,", t->min);
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 89d162ed6a..957ceb8137 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -825,12 +825,12 @@ lacon(struct vars *v,
 	d = getladfa(v, n);
 	if (d == NULL)
 		return 0;
-	if (LATYPE_IS_AHEAD(sub->subno))
+	if (LATYPE_IS_AHEAD(sub->latype))
 	{
 		/* used to use longest() here, but shortest() could be much cheaper */
 		end = shortest(v, d, cp, cp, v->stop,
 					   (chr **) NULL, (int *) NULL);
-		satisfied = LATYPE_IS_POS(sub->subno) ? (end != NULL) : (end == NULL);
+		satisfied = LATYPE_IS_POS(sub->latype) ? (end != NULL) : (end == NULL);
 	}
 	else
 	{
@@ -843,7 +843,7 @@ lacon(struct vars *v,
 		 * nominal match.
 		 */
 		satisfied = matchuntil(v, d, cp, &v->lblastcss[n], &v->lblastcp[n]);
-		if (!LATYPE_IS_POS(sub->subno))
+		if (!LATYPE_IS_POS(sub->latype))
 			satisfied = !satisfied;
 	}
 	FDEBUG(("=== lacon %d satisfied %d\n", n, satisfied));
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index 4541cf9a7e..2a1ef0176a 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -640,13 +640,11 @@ static void
 zaptreesubs(struct vars *v,
 			struct subre *t)
 {
+	int			n = t->capno;
 	struct subre *t2;
 
-	if (t->op == '(')
+	if (n > 0)
 	{
-		int			n = t->subno;
-
-		assert(n > 0);
 		if ((size_t) n < v->nmatch)
 		{
 			v->pmatch[n].rm_so = -1;
@@ -667,7 +665,7 @@ subset(struct vars *v,
 	   chr *begin,
 	   chr *end)
 {
-	int			n = sub->subno;
+	int			n = sub->capno;
 
 	assert(n > 0);
 	if ((size_t) n >= v->nmatch)
@@ -739,12 +737,10 @@ cdissect(struct vars *v,
 			else
 				er = citerdissect(v, t, begin, end);
 			break;
-		case '(':				/* capturing */
+		case '(':				/* no-op capture node */
 			assert(t->child != NULL);
-			assert(t->subno > 0);
+			assert(t->capno > 0);
 			er = cdissect(v, t->child, begin, end);
-			if (er == REG_OKAY)
-				subset(v, t, begin, end);
 			break;
 		default:
 			er = REG_ASSERT;
@@ -758,6 +754,12 @@ cdissect(struct vars *v,
 	 */
 	assert(er != REG_NOMATCH || (t->flags & BACKR));
 
+	/*
+	 * If this node is marked as capturing, save successful match's location.
+	 */
+	if (t->capno > 0 && er == REG_OKAY)
+		subset(v, t, begin, end);
+
 	return er;
 }
 
@@ -932,7 +934,7 @@ cbrdissect(struct vars *v,
 		   chr *begin,			/* beginning of relevant substring */
 		   chr *end)			/* end of same */
 {
-	int			n = t->subno;
+	int			n = t->backno;
 	size_t		numreps;
 	size_t		tlen;
 	size_t		brlen;
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 90ee16957a..306525eb5f 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -422,7 +422,7 @@ struct cnfa
  * "op" is one of:
  *		'='  plain regex without interesting substructure (implemented as DFA)
  *		'b'  back-reference (has no substructure either)
- *		'('  capture node: captures the match of its single child
+ *		'('  no-op capture node: captures the match of its single child
  *		'.'  concatenation: matches a match for first child, then second child
  *		'|'  alternation: matches a match for any of its children
  *		'*'  iteration: matches some number of matches of its single child
@@ -446,8 +446,8 @@ struct subre
 #define  LONGER  01				/* prefers longer match */
 #define  SHORTER 02				/* prefers shorter match */
 #define  MIXED	 04				/* mixed preference below */
-#define  CAP	 010			/* capturing parens below */
-#define  BACKR	 020			/* back reference below */
+#define  CAP	 010			/* capturing parens here or below */
+#define  BACKR	 020			/* back reference here or below */
 #define  INUSE	 0100			/* in use in final tree */
 #define  NOPROP  03				/* bits which may not propagate up */
 #define  LMIX(f) ((f)<<2)		/* LONGER -> MIXED */
@@ -457,9 +457,10 @@ struct subre
 #define  PREF(f) ((f)&NOPROP)
 #define  PREF2(f1, f2)	 ((PREF(f1) != 0) ? PREF(f1) : PREF(f2))
 #define  COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2))
-	short		id;				/* ID of subre (1..ntree-1) */
-	int			subno;			/* subexpression number for 'b' and '(', or
-								 * LATYPE code for lookaround constraint */
+	char		latype;			/* LATYPE code, if lookaround constraint */
+	int			id;				/* ID of subre (1..ntree-1) */
+	int			capno;			/* if capture node, subno to capture into */
+	int			backno;			/* if backref node, subno it refers to */
 	short		min;			/* min repetitions for iteration or backref */
 	short		max;			/* max repetitions for iteration or backref */
 	struct subre *child;		/* first child, if any (also freelist chain) */

#10

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#9)

Re: Some regular-expression performance hacking

On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:

Attached is an updated patch series; it's rebased over 4e703d671
which took care of some not-really-related fixes, and I made a
pass of cleanup and comment improvements. I think this is pretty
much ready to commit, unless you want to do more testing or
code-reading.

I've produced a new dataset which now also includes the regex flags (if any) used for each subject applied to a pattern.

The new dataset contains 318364 patterns and 4474520 subjects.
(The old one had 235204 patterns and 1489489 subjects.)

I've tested the new dataset against PostgreSQL 10.16, 11.11, 12.6, 13.2, HEAD (4e703d671) and HEAD+patches.

I based the comparisons on the subjects that didn't cause an error on 13.2:

I then measured the query below for each PostgreSQL version:

\timing
SELECT version();
SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern, flags) AS captured_diff,
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All versions produces the same result:

is_match_diff | captured_diff | count
---------------+---------------+---------
f | f | 3254769
(1 row)

Good! Not a single case that differs of over 3 million different regex pattern/subject combinations,
between five major PostgreSQL versions! That's a very stable regex engine.

To get a feeling for the standard deviation of the timings,
I executed the same query above three times for each PostgreSQL version:

PostgreSQL 10.16 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit
Time: 795674.830 ms (13:15.675)
Time: 794249.704 ms (13:14.250)
Time: 771036.707 ms (12:51.037)

PostgreSQL 11.11 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 765466.191 ms (12:45.466)
Time: 787135.316 ms (13:07.135)
Time: 779582.635 ms (12:59.583)

PostgreSQL 12.6 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 785500.516 ms (13:05.501)
Time: 784511.591 ms (13:04.512)
Time: 786727.973 ms (13:06.728)

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)

HEAD (4e703d671)+0001+0002+0003+0004+0005
Time: 141290.329 ms (02:21.290)
Time: 141849.709 ms (02:21.850)
Time: 141630.819 ms (02:21.631)

That's a mind-blowing speed-up!

I also ran the more detailed test between 13.2 and HEAD+patches,
that also tests for differences in errors.

Like before, one similar improvement was found,
which previously resulted in an error, but now goes through OK:

SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$
flags |
subject | www.aeroexpo.online
count | 1
a_server_version | 13.2
a_duration | 00:00:00.298253
a_is_match |
a_captured |
a_error | invalid regular expression: regular expression is too complex
b_server_version | 14devel
b_duration | 00:00:00.665958
b_is_match | t
b_captured | {online}
b_error |

Very nice.

I've uploaded the new dataset to the same place as before.

The schema for it can be found at https://github.com/truthly/regexes-in-the-wild

If anyone else would like a copy of the 715MB dataset, please let me know.

/Joel

#11

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Joel Jacobson (#10)

Re: Some regular-expression performance hacking

On Thu, Feb 18, 2021, at 11:30, Joel Jacobson wrote:

SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$

Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.

I noticed it when trying to read from the vdeviations view in PostgreSQL 13.2.

Here is my little helper-function which I used to shrink patterns/subjects longer than N characters:

CREATE OR REPLACE FUNCTION shrink_text(text,integer) RETURNS text LANGUAGE sql AS $$
SELECT CASE WHEN length($1) < $2 THEN $1 ELSE
format('%s ... %s chars ... %s', m[1], length(m[2]), m[3])
END
FROM (
SELECT regexp_matches($1,format('^(.{1,%1$s})(.*?)(.{1,%1$s})$',$2/2)) AS m
) AS q
$$;

The regex aims to produce three capture groups,
where I wanted the first and third ones to be greedy
and match up to $2 characters (controlled by the second input param to the function),
and the second capture group in the middle to be non-greedy,
but match the remainder to make up a fully anchored match.

It works like expected in both 13.2 and HEAD+patches, but the speed-up it enormous:

PostgreSQL 13.2:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=23600.816..23600.838 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
Planning Time: 0.432 ms
Execution Time: 23600.859 ms
(4 rows)

HEAD+0001+0002+0003+0004+0005:
EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');
QUERY PLAN
-------------------------------------------------------------------------------------------
ProjectSet (cost=0.00..0.02 rows=1 width=32) (actual time=36.656..36.661 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.002 rows=1 loops=1)
Planning Time: 0.575 ms
Execution Time: 36.689 ms
(4 rows)

Cool stuff.

/Joel

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#11)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

I've produced a new dataset which now also includes the regex flags (if
any) used for each subject applied to a pattern.

Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well. Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#12)

Re: Some regular-expression performance hacking

I thought it was worth looking a little more closely at the error
cases in this set of tests, as a form of competition analysis versus
Javascript's regex engine. I ran through the cases that gave errors,
and pinned down exactly what was causing the error for as many cases
as I could. (These results are from your first test corpus, but
I doubt the second one would give different conclusions.)

We have these errors reported in the test corpus:

The existing patchset takes care of the one "regular expression is too
complex" failure. Of the rest:

It turns out that almost 39000 of the "invalid escape \ sequence"
errors are due to use of \D, \S, or \W within a character class.
We support the positive-class shorthands \d, \s, \w there, but not
their negations. I think that this might be something that Henry
Spencer just never got around to; I don't see any fundamental reason
we can't allow it, although some refactoring might be needed in the
regex lexer. Given the apparent popularity of this notation, maybe
we should put some work into that.

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)

About half of the remaining escape-sequence complaints seem to be due
to just randomly backslashing alphanumeric characters that don't need
it, as for example "i" in "\itunes\.apple\.com". Apparently
Javascript is content to take "\i" as just meaning "i". Our engine
rejects that, with a view to keeping such combinations reserved for
future definition. That's fine by me so I don't want to change it.

Of the rest, many are abbreviated numeric escapes, eg "\u45" where our
engine wants to see "\u0045". I don't think being laxer about that
would be a great idea either.

Lastly, there are some occurrences like "[\1]", which in context look
like the \1 might be intended as a back-reference? But I don't really
understand what that's supposed to do inside a bracket expression.

The "invalid character range" errors seem to be coming from constructs
like "[A-Za-z0-9-/]", which our engine rejects because it looks like
a messed-up character range.

All but 123 of the "invalid backreference number" complaints stem from
using backrefs inside lookahead constraints. Some of the rest look
like they think you can put capturing parens inside a lookahead
constraint and then backref that. I'm not really convinced that such
constructs have a well-defined meaning. (I looked at the ECMAscript
definition of regexes, and they do say it's allowed, but when trying
to define it they resort to handwaving about backtracking; at best that
is a particularly lame version of specification by implementation.)
Spencer chose to forbid these cases in our engine, and I think there
are very good implementation reasons why it won't work. Perhaps we
could provide a clearer error message about it, though.

307 of the "brackets [] not balanced" errors, as well as the one
"parentheses () not balanced" error, seem to trace to the fact that
Javascript considers "[]" to be a legal empty character class, whereas
POSIX doesn't allow empty character classes so our engine takes the
"]" literally, and then looks for a right bracket it won't find.
(That is, in POSIX "[]x]" is a character class matching ']' and 'x'.)
Maybe I'm misinterpreting this too, because if I read the
documentation correctly, "[]" in Javascript matches nothing, making
it impossible for the regex to succeed. Why would such a construct
appear this often?

The remainder of the bracket errors happen because in POSIX, the
sequences "[:", "[=", and "[." within a bracket expression introduce
special syntax, whereas in Javascript '[' is just an ordinary data
character within a bracket expression. Not much we can do here; the
standards are just incompatible.

All but 3 of the "invalid repetition count(s)" errors come from
quantifiers larger than our implementation limit of 255. A lot of
those are exactly 256, though I saw one as high as 3000. The
remaining 3 errors are from syntax like "[0-9]{0-3}", which is a
syntax error according to our engine ("[0-9]{0,3}" is correct).
AFAICT it's not a valid quantifier according to Javascript either;
perhaps that engine is just taking the "{0-3}" as literal text?

Given this, it seems like there's a fairly strong case for increasing
our repetition-count implementation limit, at least to 256, and maybe
1000 or so. I'm hesitant to make the limit *really* large, but if
we can handle a regex containing thousands of "x"'s, it's not clear
why you shouldn't be able to write that as "x{0,1000}".

All of the "quantifier operand invalid" errors come from these
three patterns:
((?!\\)?\{0(?!\\)?\})
((?!\\)?\{1(?!\\)?\})
class="(?!(tco-hidden|tco-display|tco-ellipsis))+.*?"|data-query-source=".*?"|dir=".*?"|rel=".*?"
which are evidently trying to apply a quantifier to a lookahead
constraint, which is just silly.

In short, a lot of this is from incompatible standards, or maybe
from varying ideas about whether to throw an error for invalid
constructs. But I see a couple things we could improve.

regards, tom lane

#14

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#12)

Re: Some regular-expression performance hacking

On Thu, Feb 18, 2021, at 19:10, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

I've produced a new dataset which now also includes the regex flags (if
any) used for each subject applied to a pattern.

Again, thanks for collecting this data! I'm a little confused about
how you produced the results in the "tests" table, though. It sort
of looks like you tried to feed the Javascript flags to regexp_match(),
which unsurprisingly doesn't work all that well.

That's exactly what I did. Some of the flags work the same between Javascript and PostgreSQL, others don't.

I thought maybe something interesting would surface in just trying them blindly.

Flags that aren't supported and gives errors are reported as tests where error is not null.

Most patterns have no flags, and second most popular is just the "i" flag, which should work the same.

SELECT flags, COUNT(*) FROM patterns GROUP BY 1 ORDER BY 2 DESC;
flags | count
-------+--------
| 151927
i | 120336
gi | 26057
g | 13263
gm | 4606
gim | 699
im | 491
y | 367
m | 365
gy | 105
u | 50
giy | 38
giu | 20
gimu | 14
iy | 11
iu | 6
gimy | 3
gu | 2
gmy | 2
imy | 1
my | 1
(21 rows)

This query shows what Javascript-regex-flags that could be used as-is without errors:

SELECT
patterns.flags,
COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
GROUP BY 1
ORDER BY 2;

flags | count
-------+---------
im | 2534
m | 4460
i | 543598
| 2704177
(4 rows)

I considered filtering/converting the flags to PostgreSQL,
maybe that would be an interesting approach to try as well.

Even discounting
that, I'm not getting quite the same results, and I don't understand
why not. So how was that made from the raw "patterns" and "subjects"
tables?

The rows in the tests table were generated by the create_regexp_tests() function [1]https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql

Each subject now has a foreign key to a specific pattern,
where the (pattern, flags) combination are unique in patterns.
The actual unique constraint is on (pattern_hash, flags) to avoid
an index directly on pattern which can be huge as we've seen.

So, for each subject, it is known via the pattern_id
exactly what flags were used when the regex was compiled
(and later executed/applied with the subject).

[1]: https://github.com/truthly/regexes-in-the-wild/blob/master/create_regexp_tests.sql

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)

Hmmm ... we haven't yet committed any performance-relevant changes to the
regex code, so it can't take any credit for this improvement from 13.2 to
HEAD. I speculate that this is due to some change in our parallelism
stuff (since I observe that this query is producing a parallelized hash
plan). Still, the next drop to circa 2:21 runtime is impressive enough
by itself.

OK. Another factor might perhaps be the PostgreSQL 10, 11, 12, 13 versions were compiled elsewhere,
I used the OS X binaries from https://postgresapp.com/, whereas version 14 I of course compiled myself.
Maybe I should have compiled 10, 11, 12, 13 myself instead, for a better comparison,
but I mostly just wanted to verify if I could find any differences, the performance comparison was a bonus.

Heh, what a funny coincidence:
The regex I used to shrink the very-long-pattern,
actually happens to run a lot faster with the patches.

Yeah, that just happens to be a poster child for the MATCHALL idea:

EXPLAIN ANALYZE SELECT regexp_matches(repeat('a',100000),'^(.{1,80})(.*?)(.{1,80})$');

Each of the parenthesized subexpressions of the RE is successfully
recognized as being MATCHALL, with length range 1..80 for two of them and
0..infinity for the middle one. That means the engine doesn't have to
physically scan the text to determine whether a possible division point
satisfies the sub-regexp; and that means we can find the correct division
points in O(N) not O(N^2) time.

Very nice.

Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,
that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.

Let's see if I can explain the idea:

One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.

If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.

The naive approach to just use the ranges as-is doesn't work.

Instead, the graph must first be created with single-character edges.

It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.

Perhaps a bit of fiddling with some examples is easiest
to get a grip of the idea.

Here is a live demo of the idea:
https://compiler.org/reason-re-nfa/src/index.html

The graphs are rendered live when typing in the regex,
using a Javascript port of GraphViz.

For example, try entering the regex: t[a-z]*m

This generates this range-optimized graph for the regex:

/--[a-ln-su-z]-----------------\
|/------t--------------------\ |
|| | |
-->(0)--t-->({0,1})----m-------->({0 1 2}) | |
^---[a-ln-su-z]--/ | |
^-------t-------/ | |
^---------------------------/ |
^-----------------------------/
Notice how the [a-z] bracket expression has been split up,
and we now have 3 distinct set of "ranges":
t
m
[a-ln-su-z]

Since no ranges are overlapping, each such range can safely be seen as a letter in an alphabet.

Once we have our final graph, but before we proceed to generate the machine code for it,
we can shrink the graph further by merging ranges together, which eliminate some edges:

/--------------\
| |
--->(0)--t-->(1)<--[a-ln-z]--/
|^-[a-lnz]-\
\----m-->((2))<----\
| |
\---m---/

Notice how [a-ln-su-z]+t becomes [a-ln-z].

Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.

In our example, we know for sure at least two characters will be read for the regex t[a-z]*m,
so with this optimization enabled, we get this graph:

/--[a-ln-z]
| |
--->(0)---t[a-ln-z]--->(1)<---+--[a-ln-z]
| | /
| \---m--->((2))<------\
\--------------tm------------^ | |
\----m----/

This makes not much difference for a few characters,
but if we have a long pattern with a long sentence
that is repeated, we could e.g. read in 32 bytes
and compare them all in one operation,
if our machine had 256-bits SIMD registers/instructions.

This idea has also been implemented in the online demo.

There is a level which can be adjusted
from 0 to 32 to control how many bytes to merge at most,
located in the "[+]dfa5 = merge_linear(dfa4)" step.

Anyway, I can totally understand if you've had enough of regex optimizations for this time,
but in case not, I wanted to share my work in this field, in case it's interesting to look at now or in the future.

/Joel

#15

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Joel Jacobson (#14)

Re: Some regular-expression performance hacking

On Thu, Feb 18, 2021, at 20:58, Joel Jacobson wrote:

Like you said earlier, perhaps the regex engine has been optimized enough for this time.
If not, you want to investigate an additional idea,

In the above sentence, I meant "you _may_ want to".
I'm not at all sure these idea are applicable in the PostgreSQL regex engine,
so feel free to silently ignore these if you feel there is a risk for time waste.

that I think can be seen as a generalization of the optimization trick for (.*),
if I've understood how it works correctly.

Actually not sure if it can be seen as a generalization,
I just came to think of my ideas since they also improve
the case when you have lots of (.*) or bracket expressions of large ranges.

/Joel

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#14)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

Let's see if I can explain the idea:
One of the problems with representing regexes with large bracket range expressions, like [a-z],
is you get an explosion of edges, if the model can only represent state transitions for single characters.
If we could instead let a single edge (for a state transition) represent a set of characters,
or normally even more efficiently, a set of range of characters, then we could reduce the
number of edges we need to represent the graph.
The naive approach to just use the ranges as-is doesn't work.
Instead, the graph must first be created with single-character edges.
It is then examined what ranges can be constructed in a way that no single range
overlaps any other range, so that every range can be seen as a character in an alphabet.

Hmm ... I might be misunderstanding, but I think our engine already
does a version of this. See the discussion of "colors" in
src/backend/regex/README.

Another optimization I've come up with (or probably re-invented because it feels quite obvious),
is to read more than one character, when knowing for sure multiple characters-in-a-row
are expected, by concatenating edges having only one parent and one child.

Maybe. In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea. That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.

I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing. So thanks again for collecting it!

regards, tom lane

#17

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#16)

Re: Some regular-expression performance hacking

On Thu, Feb 18, 2021, at 21:44, Tom Lane wrote:

Hmm ... I might be misunderstanding, but I think our engine already
does a version of this. See the discussion of "colors" in
src/backend/regex/README.

Thanks, I will read it with great interest.

Maybe. In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea. That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.

Right. Actually I don't have a clear idea on how it could be implemented in an NFA engine.

I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing. So thanks again for collecting it!

My pleasure. Thanks for using it!

/Joel

#18

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#13)

Re: Some regular-expression performance hacking

On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)

This popular regex

^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$

is coming from jQuery:

// A simple way to check for HTML strings
// Prioritize #id over <tag> to avoid XSS via location.hash (#9521)
// Strict HTML recognition (#11290: must start with <)
// Shortcut simple #id case for speed
rquickExpr = /^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/,

From: https://code.jquery.com/jquery-3.5.1.js

I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.

Javascript test:

"foo\nbar".match(/(.+)/)[1];
"foo"

"foo\nbar".match(/(.+)/s)[1];
"foo
bar"

"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"

/Joel

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#18)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)

I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.

"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"

Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions. I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline. That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't. And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area. Should we
change it? (I wonder what Perl does.)

regards, tom lane

#20

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#19)

1 attachment(s)

Re: Some regular-expression performance hacking

On Fri, Feb 19, 2021, at 16:26, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

On Thu, Feb 18, 2021, at 19:53, Tom Lane wrote:

(Having said that, I can't help noticing that a very large fraction
of those usages look like, eg, "[\w\W]". It seems to me that that's
a very expensive and unwieldy way to spell ".". Am I missing
something about what that does in Javascript?)

I think this is a non-POSIX hack to match any character, including newlines,
which are not included unless the "s" flag is set.

"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"

Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

I would guess the same thing.

I was just wondering about this last night in fact, while hacking on
the code to get it to accept \W etc in bracket expressions. I see that
right now, our code thinks that NLSTOP mode ('n' switch, the opposite
of 's') should cause \W \D \S to not match newline. That seems a little
weird, not least because \S should probably be different from the other
two, and it isn't. And now we see it'd mean that you couldn't use the 'n'
switch to duplicate Javascript's default behavior in this area. Should we
change it? (I wonder what Perl does.)

regards, tom lane

To allow comparing PostgreSQL vs Javascript vs Perl,
I installed three helper-functions using plv8 and plperl,
and also one convenience function for PostgreSQL
to catch errors and return the error string instead:

The string used in this test is "foo!\n!bar",
which aims to detect differences in how new-lines
and non alpha-number characters are handled.

To allow PostgreSQL to be compared with Javascript and Perl,
the "n" flag is used for PostgreSQL when no flags are used for Javascript/Perl,
and no flag for PostgreSQL when the "s" flag is used for Javascript/Perl,
for the results to be comparable.

In Javascript, when a regex contains capture groups, the entire match
is always returns as the first array element.
To make it easier to visually compare the results,
the first element is removed from Javascript,
which works in this test since all regexes contain
exactly one capture group.

Here are the results:

$ psql -e -f not_alnum.sql regex

SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(.+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo!} | {foo!} | {foo!}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(.+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(.+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(.+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"foo! +| {"foo! +| {"foo! +
!bar"} | !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"foo! +| {"foo! +
| !bar"} | !bar"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\w]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\w]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\w]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '([\W]+)', ''),
(regexp_match_v8(E'foo!\n!bar', '([\W]+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '([\W]+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
------------------------------------------------------------+-----------------+-----------------
{"invalid regular expression: invalid escape \\ sequence"} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\w+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\w+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\w+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{foo} | {foo} | {foo}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', 'n'),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', ''))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', '')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{!} | {"! +| {"! +
| !"} | !"}
(1 row)

SELECT
regexp_match_pg(E'foo!\n!bar', '(\W+)', ''),
(regexp_match_v8(E'foo!\n!bar', '(\W+)', 's'))[2:],
regexp_match_pl(E'foo!\n!bar', '(\W+)', 's')
;
regexp_match_pg | regexp_match_v8 | regexp_match_pl
-----------------+-----------------+-----------------
{"! +| {"! +| {"! +
!"} | !"} | !"}
(1 row)

/Joel

#21

Chapman Flack

chap@anastigmatix.net

almost 5 years ago

In reply to: Tom Lane (#19)

Re: Some regular-expression performance hacking

On 02/19/21 10:26, Tom Lane wrote:

"foo\nbar".match(/([\w\W]+)/)[1];
"foo
bar"

Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

Perl, Python, and Java (at least) all have a common syntax for changing
flags locally in a non-capturing group, so you could just match (?s:.)
-- which I guess isn't any shorter than [\w\W] but makes the intent more
clear.

I see that JavaScript, for some reason, does not advertise that. We don't
either; we have (?:groups) without flags, and we have (?flags) but only
global at the start of the regex. Would it be worthwhile to jump on the
bandwagon and support local flags in groups?

We currently give 2201B: invalid regular expression: invalid embedded option
on an attempt to use the syntax, so implementing it couldn't break anything
someone is already doing.

Regards,
-Chap

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Chapman Flack (#21)

Re: Some regular-expression performance hacking

Chapman Flack <chap@anastigmatix.net> writes:

On 02/19/21 10:26, Tom Lane wrote:

Oooh, that's very interesting. I guess the advantage of that over using
the 's' flag is that you can have different behaviors at different places
in the same regex.

Perl, Python, and Java (at least) all have a common syntax for changing
flags locally in a non-capturing group, so you could just match (?s:.)
-- which I guess isn't any shorter than [\w\W] but makes the intent more
clear.

Hmm, interesting.

I see that JavaScript, for some reason, does not advertise that. We don't
either; we have (?:groups) without flags, and we have (?flags) but only
global at the start of the regex. Would it be worthwhile to jump on the
bandwagon and support local flags in groups?

Yeah, perhaps. Not sure whether there are any built-in assumptions about
these flags holding still throughout the regex; that'd require some
review. But it seems like it could be a useful feature, and I don't
see any argument why we shouldn't have it.

regards, tom lane

#23

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Tom Lane (#1)

Re: Some regular-expression performance hacking

Hi,

One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:

In file included from /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function ‘checkmatchall’:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c:3087:20: warning: array subscript -1 is outside array bounds of ‘_Bool[257]’ [-Warray-bounds]
3087 | hasmatch[depth] = true;
| ^
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c:2920:8: note: while referencing ‘hasmatch’
2920 | bool hasmatch[DUPINF + 1];
| ^~~~~~~~

Greetings,

Andres Freund

#24

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Andres Freund (#23)

Re: Some regular-expression performance hacking

Andres Freund <andres@anarazel.de> writes:

One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:

In file included from /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function ‘checkmatchall’:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c:3087:20: warning: array subscript -1 is outside array bounds of ‘_Bool[257]’ [-Warray-bounds]
3087 | hasmatch[depth] = true;
| ^

Hmph. There's an "assert(depth >= 0)" immediately in front of that,
so I'm not looking too kindly on the compiler thinking it's smarter
than I am. Do you have a suggestion on how to shut it up?

regards, tom lane

#25

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#24)

Re: Some regular-expression performance hacking

I wrote:

Hmph. There's an "assert(depth >= 0)" immediately in front of that,
so I'm not looking too kindly on the compiler thinking it's smarter
than I am. Do you have a suggestion on how to shut it up?

On reflection, maybe the thing to do is convert the assert into
an always-on check, "if (depth < 0) return false". The assertion
is essentially saying that there's no arc leading directly from
the pre state to the post state. Which there had better not be,
or a lot of other stuff is going to go wrong; but I suppose there's
no way to explain that to gcc. It is annoying to have to expend
an always-on check for a can't-happen case, though.

regards, tom lane

#26

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Tom Lane (#25)

Re: Some regular-expression performance hacking

Hi,

On 2021-02-23 12:52:28 -0500, Tom Lane wrote:

I wrote:

Hmph. There's an "assert(depth >= 0)" immediately in front of that,
so I'm not looking too kindly on the compiler thinking it's smarter
than I am. Do you have a suggestion on how to shut it up?

gcc can't see the assert though, in an non-cassert optimized build... If
I force assertions to be used, the warning vanishes.

On reflection, maybe the thing to do is convert the assert into
an always-on check, "if (depth < 0) return false". The assertion
is essentially saying that there's no arc leading directly from
the pre state to the post state. Which there had better not be,
or a lot of other stuff is going to go wrong; but I suppose there's
no way to explain that to gcc. It is annoying to have to expend
an always-on check for a can't-happen case, though.

Wouldn't quite work like that because of the restrictions of what pg
infrastructure we want to expose the regex engine to, but a
if (depth < 0)
pg_unreachable();
would avoid the runtime overhead and does fix the warning.

I have been wondering about making Asserts do something along those
lines - but it'd need to be opt-in, since we clearly have a lot of
assertions that would cost too much.

Greetings,

Andres Freund

#27

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Andres Freund (#23)

Re: Some regular-expression performance hacking

Andres Freund <andres@anarazel.de> writes:

One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:

Oddly, I see no such warning with Fedora's current compiler,
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)

Are you using any special compiler switches?

regards, tom lane

#28

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Tom Lane (#27)

Re: Some regular-expression performance hacking

On 2021-02-23 13:09:18 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

One of the recent commits have introduce a new warning with gcc 10, when
building with optimizations:

Oddly, I see no such warning with Fedora's current compiler,
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)

Are you using any special compiler switches?

A few. At first I didn't see any relevant ones - but I think it's just
that you need to use -O3 instead of -O2.

andres@awork3:~/build/postgres/dev-optimize/vpath$ (cd src/backend/regex/ && ccache gcc-10 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -I../../../src/include -I/home/andres/src/postgresql/src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c -o regcomp.o /home/andres/src/postgresql/src/backend/regex/regcomp.c -O3)
In file included from /home/andres/src/postgresql/src/backend/regex/regcomp.c:2304:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c: In function ‘checkmatchall’:
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c:3086:20: warning: array subscript -1 is outside array bounds of ‘_Bool[257]’ [-Warray-bounds]
3086 | hasmatch[depth] = true;
| ^
/home/andres/src/postgresql/src/backend/regex/regc_nfa.c:2920:8: note: while referencing ‘hasmatch’
2920 | bool hasmatch[DUPINF + 1];
| ^~~~~~~~

andres@awork3:~/build/postgres/dev-optimize/vpath$ gcc-10 --version
gcc-10 (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Greetings,

Andres Freund

#29

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Andres Freund (#26)

Re: Some regular-expression performance hacking

Andres Freund <andres@anarazel.de> writes:

On 2021-02-23 12:52:28 -0500, Tom Lane wrote:

... It is annoying to have to expend
an always-on check for a can't-happen case, though.

Wouldn't quite work like that because of the restrictions of what pg
infrastructure we want to expose the regex engine to, but a
if (depth < 0)
pg_unreachable();
would avoid the runtime overhead and does fix the warning.

Yeah, I still have dreams of someday converting the regex engine
into an independent project, so I don't want to make it depend on
pg_unreachable. I'll put in the low-tech fix.

regards, tom lane

#30

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Andres Freund (#28)

Re: Some regular-expression performance hacking

Andres Freund <andres@anarazel.de> writes:

On 2021-02-23 13:09:18 -0500, Tom Lane wrote:

Oddly, I see no such warning with Fedora's current compiler,
gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
Are you using any special compiler switches?

A few. At first I didn't see any relevant ones - but I think it's just
that you need to use -O3 instead of -O2.

Ah-hah, -O3 plus remembering to disable assertions makes it
happen here too. Will fix.

regards, tom lane

#31

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#22)

1 attachment(s)

Re: Some regular-expression performance hacking

Here's another little piece of regex performance hacking. This is based
on looking at a slow regexp I found in Tcl's bug tracker:

-- Adapted from http://core.tcl.tk/tcl/tktview?name=446565
select regexp_matches(
repeat('<script> 123 </script> <script> 345 </script> <script> 123 </script>',
100000),
'<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>');

The core of the problem here is the lookahead constraint (?!</script>),
which gets applied O(N^2) times for an N-character data string. The
present patch doesn't do anything to cut down the big-O problem, but
it does take a swipe at cutting the constant factor, which should
remain useful even if we find a way to avoid the O(N^2) issue.

Poking at this with perf, I was surprised to observe that the dominant
cost is not down inside lacon() as one would expect, but in the loop
in miss() that is deciding where to call lacon(). 80% of the runtime
is going into these three lines:

for (i = 0; i < d->nstates; i++)
if (ISBSET(d->work, i))
for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)

So there are two problems here. The outer loop is iterating over all
the NFA states, even though only a small fraction of the states are
likely to have LACON out-arcs. (In the case at hand, the main NFA
has 78 states, of which just one has LACON out-arcs.) Then, for
every reachable state, we're scanning all its out-arcs to find the
ones that are LACONs. (Again, just a fraction of the out-arcs are
likely to be LACONs.) So the main thrust of this patch is to rearrange
the "struct cnfa" representation to separate plain arcs from LACON
arcs, allowing this loop to not waste time looking at irrelevant
states or arcs. This also saves some time in miss()'s preceding
main loop, which is only interested in plain arcs. Splitting the
LACON arcs from the plain arcs complicates matters in a couple of
other places, but none of them are in the least performance-critical.

The other thing I noticed while looking at miss() is that it will
call lacon() for each relevant arc, even though it's quite likely
to see multiple arcs labeled with the same constraint number,
for which the answer must be the same. So I added some simple
logic to cache the last answer and re-use it if the next arc of
interest has the same color. (We could imagine working harder
to cache in the presence of multiple interesting LACONs, but I'm
doubtful that it's worth the trouble. The one-entry cache logic
is so simple it can hardly be a net loss, though.)

On my machine, the combination of these two ideas reduces the
runtime of the example above from ~150 seconds to ~53 seconds,
or nearly 3x better. I see something like a 2% improvement on
Joel's test corpus, which might just be noise. So this isn't
any sort of universal panacea, but it sure helps when LACON
evaluation is the bottleneck.

Any objections? or better ideas?

regards, tom lane

Attachments:

0006-reduce-constant-factor-for-lacons.patchtext/x-diff; charset=us-ascii; name=0006-reduce-constant-factor-for-lacons.patchDownload

diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 69929eddfc..3da394b9a5 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -3182,81 +3182,145 @@ compact(struct nfa *nfa,
 {
 	struct state *s;
 	struct arc *a;
-	size_t		nstates;
-	size_t		narcs;
-	struct carc *ca;
-	struct carc *first;
+	size_t		nstates = 0;
+	size_t		nlastates = 0;
+	size_t		nplainarcs = 0;
+	size_t		nlaarcs = 0;
+	struct carc *cpa;
+	struct carc *cla;
 
 	assert(!NISERR());
 
-	nstates = 0;
-	narcs = 0;
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
+		size_t		nthislaarcs = 0;
+
 		nstates++;
-		narcs += s->nouts + 1;	/* need one extra for endmarker */
+		for (a = s->outs; a != NULL; a = a->outchain)
+		{
+			switch (a->type)
+			{
+				case PLAIN:
+					nplainarcs++;
+					break;
+				case LACON:
+					nthislaarcs++;
+					break;
+				default:
+					NERR(REG_ASSERT);
+					return;
+			}
+		}
+		nplainarcs++;			/* each state needs an endmarker arc */
+		if (nthislaarcs > 0)
+		{
+			nlastates++;
+			nlaarcs += nthislaarcs + 1; /* endmarker needed */
+		}
 	}
 
 	cnfa->stflags = (char *) MALLOC(nstates * sizeof(char));
-	cnfa->states = (struct carc **) MALLOC(nstates * sizeof(struct carc *));
-	cnfa->arcs = (struct carc *) MALLOC(narcs * sizeof(struct carc));
-	if (cnfa->stflags == NULL || cnfa->states == NULL || cnfa->arcs == NULL)
+	cnfa->firstarc = (struct carc **) MALLOC(nstates * sizeof(struct carc *));
+	cnfa->arcs = (struct carc *)
+		MALLOC((nplainarcs + nlaarcs) * sizeof(struct carc));
+	if (nlastates > 0)
+	{
+		cnfa->lastates = (int *) MALLOC(nlastates * sizeof(int));
+		cnfa->firstlaarc = (struct carc **)
+			MALLOC(nlastates * sizeof(struct carc *));
+	}
+	else
+	{
+		cnfa->lastates = NULL;
+		cnfa->firstlaarc = NULL;
+	}
+	if (cnfa->stflags == NULL || cnfa->firstarc == NULL || cnfa->arcs == NULL ||
+		(nlastates > 0 && (cnfa->lastates == NULL || cnfa->firstlaarc == NULL)))
 	{
 		if (cnfa->stflags != NULL)
 			FREE(cnfa->stflags);
-		if (cnfa->states != NULL)
-			FREE(cnfa->states);
+		if (cnfa->firstarc != NULL)
+			FREE(cnfa->firstarc);
 		if (cnfa->arcs != NULL)
 			FREE(cnfa->arcs);
+		if (cnfa->lastates != NULL)
+			FREE(cnfa->lastates);
+		if (cnfa->firstlaarc != NULL)
+			FREE(cnfa->firstlaarc);
 		NERR(REG_ESPACE);
 		return;
 	}
+	assert(nstates > 0);
 	cnfa->nstates = nstates;
+	cnfa->ncolors = maxcolor(nfa->cm) + 1;
+	cnfa->flags = nfa->flags;
+	if (nlastates > 0)
+		cnfa->flags |= HASLACONS;
+	cnfa->nlastates = 0;		/* will set below */
 	cnfa->pre = nfa->pre->no;
 	cnfa->post = nfa->post->no;
 	cnfa->bos[0] = nfa->bos[0];
 	cnfa->bos[1] = nfa->bos[1];
 	cnfa->eos[0] = nfa->eos[0];
 	cnfa->eos[1] = nfa->eos[1];
-	cnfa->ncolors = maxcolor(nfa->cm) + 1;
-	cnfa->flags = nfa->flags;
 	cnfa->minmatchall = nfa->minmatchall;
 	cnfa->maxmatchall = nfa->maxmatchall;
 
-	ca = cnfa->arcs;
+	cpa = cnfa->arcs;			/* where to store plain arcs */
+	cla = cnfa->arcs + nplainarcs;	/* where to store LA arcs */
 	for (s = nfa->states; s != NULL; s = s->next)
 	{
+		struct carc *firstp = cpa;
+		struct carc *firstla = cla;
+
 		assert((size_t) s->no < nstates);
 		cnfa->stflags[s->no] = 0;
-		cnfa->states[s->no] = ca;
-		first = ca;
+		cnfa->firstarc[s->no] = cpa;
 		for (a = s->outs; a != NULL; a = a->outchain)
+		{
 			switch (a->type)
 			{
 				case PLAIN:
-					ca->co = a->co;
-					ca->to = a->to->no;
-					ca++;
+					cpa->co = a->co;
+					cpa->to = a->to->no;
+					cpa++;
 					break;
 				case LACON:
 					assert(s->no != cnfa->pre);
 					assert(a->co >= 0);
-					ca->co = (color) (cnfa->ncolors + a->co);
-					ca->to = a->to->no;
-					ca++;
-					cnfa->flags |= HASLACONS;
+					if (cla == firstla)
+					{
+						cnfa->lastates[cnfa->nlastates] = s->no;
+						cnfa->firstlaarc[cnfa->nlastates] = cla;
+						cnfa->nlastates++;
+					}
+					cla->co = a->co;
+					cla->to = a->to->no;
+					cla++;
 					break;
 				default:
 					NERR(REG_ASSERT);
 					return;
 			}
-		carcsort(first, ca - first);
-		ca->co = COLORLESS;
-		ca->to = 0;
-		ca++;
+		}
+		/* sort plain arcs, just for neatness */
+		carcsort(firstp, cpa - firstp);
+		/* add endmarker for plain arcs */
+		cpa->co = COLORLESS;
+		cpa->to = 0;
+		cpa++;
+		/* and the same for LA arcs, if this state has any */
+		if (cla != firstla)
+		{
+			carcsort(firstla, cla - firstla);
+			cla->co = COLORLESS;
+			cla->to = 0;
+			cla++;
+		}
 	}
-	assert(ca == &cnfa->arcs[narcs]);
-	assert(cnfa->nstates != 0);
+	assert(cpa == &cnfa->arcs[nplainarcs]);
+	assert(cla == &cnfa->arcs[nplainarcs + nlaarcs]);
+	assert(nlastates == cnfa->nlastates);
 
 	/* mark no-progress states */
 	for (a = nfa->pre->outs; a != NULL; a = a->outchain)
@@ -3299,7 +3363,11 @@ freecnfa(struct cnfa *cnfa)
 {
 	assert(!NULLCNFA(*cnfa));	/* not empty already */
 	FREE(cnfa->stflags);
-	FREE(cnfa->states);
+	FREE(cnfa->firstarc);
+	if (cnfa->lastates != NULL)
+		FREE(cnfa->lastates);
+	if (cnfa->firstlaarc != NULL)
+		FREE(cnfa->firstlaarc);
 	FREE(cnfa->arcs);
 	ZAPCNFA(*cnfa);
 }
@@ -3518,18 +3586,19 @@ dumpcstate(int st,
 		   FILE *f)
 {
 	struct carc *ca;
+	bool		hasarc = false;
 	int			pos;
+	int			i;
 
 	fprintf(f, "%d%s", st, (cnfa->stflags[st] & CNFA_NOPROGRESS) ? ":" : ".");
 	pos = 1;
-	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+	/* Dump its plain out-arcs */
+	for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++)
 	{
 		if (ca->co == RAINBOW)
 			fprintf(f, "\t[*]->%d", ca->to);
-		else if (ca->co < cnfa->ncolors)
-			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		else
-			fprintf(f, "\t:%ld:->%d", (long) (ca->co - cnfa->ncolors), ca->to);
+			fprintf(f, "\t[%ld]->%d", (long) ca->co, ca->to);
 		if (pos == 5)
 		{
 			fprintf(f, "\n");
@@ -3537,8 +3606,28 @@ dumpcstate(int st,
 		}
 		else
 			pos++;
+		hasarc = true;
+	}
+	/* Dump its LACON out-arcs, if any */
+	for (i = 0; i < cnfa->nlastates; i++)
+	{
+		if (cnfa->lastates[i] != st)
+			continue;
+		for (ca = cnfa->firstlaarc[i]; ca->co != COLORLESS; ca++)
+		{
+			fprintf(f, "\t:%ld:->%d", (long) ca->co, ca->to);
+			if (pos == 5)
+			{
+				fprintf(f, "\n");
+				pos = 1;
+			}
+			else
+				pos++;
+			hasarc = true;
+		}
+		break;					/* needn't look through rest of lastates[] */
 	}
-	if (ca == cnfa->states[st] || pos != 1)
+	if (pos != 1 || !hasarc)
 		fprintf(f, "\n");
 	fflush(f);
 }
diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c
index 1db52d1005..ad2c98bd91 100644
--- a/src/backend/regex/rege_dfa.c
+++ b/src/backend/regex/rege_dfa.c
@@ -677,6 +677,8 @@ miss(struct vars *v,
 	int			gotstate;
 	int			dolacons;
 	int			sawlacons;
+	color		lastlacon;
+	int			lastlaresult;
 
 	/* for convenience, we can be called even if it might not be a miss */
 	if (css->outs[co] != NULL)
@@ -709,7 +711,7 @@ miss(struct vars *v,
 	gotstate = 0;
 	for (i = 0; i < d->nstates; i++)
 		if (ISBSET(css->states, i))
-			for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
+			for (ca = cnfa->firstarc[i]; ca->co != COLORLESS; ca++)
 				if (ca->co == co ||
 					(ca->co == RAINBOW && !ispseudocolor))
 				{
@@ -725,27 +727,33 @@ miss(struct vars *v,
 		return NULL;			/* character cannot reach any new state */
 	dolacons = (cnfa->flags & HASLACONS);
 	sawlacons = 0;
+	lastlacon = COLORLESS;
+	lastlaresult = 0;			/* doesn't matter */
 	/* outer loop handles transitive closure of reachable-by-LACON states */
 	while (dolacons)
 	{
+		int			j;
+
 		dolacons = 0;
-		for (i = 0; i < d->nstates; i++)
+		for (j = 0; j < cnfa->nlastates; j++)
+		{
+			i = cnfa->lastates[j];
 			if (ISBSET(d->work, i))
-				for (ca = cnfa->states[i]; ca->co != COLORLESS; ca++)
+				for (ca = cnfa->firstlaarc[j]; ca->co != COLORLESS; ca++)
 				{
-					if (ca->co < cnfa->ncolors)
-						continue;	/* not a LACON arc */
 					if (ISBSET(d->work, ca->to))
 						continue;	/* arc would be a no-op anyway */
 					sawlacons = 1;	/* this LACON affects our result */
-					if (!lacon(v, cnfa, cp, ca->co))
+					/* we might have just eval'd this same LACON */
+					if (ca->co != lastlacon)
 					{
+						lastlacon = ca->co;
+						lastlaresult = lacon(v, cnfa, cp, lastlacon);
 						if (ISERR())
 							return NULL;
-						continue;	/* LACON arc cannot be traversed */
 					}
-					if (ISERR())
-						return NULL;
+					if (!lastlaresult)
+						continue;	/* LACON arc cannot be traversed */
 					BSET(d->work, ca->to);
 					dolacons = 1;
 					if (ca->to == cnfa->post)
@@ -754,6 +762,7 @@ miss(struct vars *v,
 						noprogress = 0;
 					FDEBUG(("%d :> %d\n", i, ca->to));
 				}
+		}
 	}
 	h = HASH(d->work, d->wordsper);
 
@@ -805,9 +814,8 @@ static int						/* predicate:  constraint satisfied? */
 lacon(struct vars *v,
 	  struct cnfa *pcnfa,		/* parent cnfa */
 	  chr *cp,
-	  color co)					/* "color" of the lookaround constraint */
+	  int n)					/* index of the lookaround constraint */
 {
-	int			n;
 	struct subre *sub;
 	struct dfa *d;
 	chr		   *end;
@@ -820,7 +828,6 @@ lacon(struct vars *v,
 		return 0;
 	}
 
-	n = co - pcnfa->ncolors;
 	assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL);
 	FDEBUG(("=== testing lacon %d\n", n));
 	sub = &v->g->lacons[n];
diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c
index cf95989948..69acc285f7 100644
--- a/src/backend/regex/regexec.c
+++ b/src/backend/regex/regexec.c
@@ -160,7 +160,7 @@ static void freedfa(struct dfa *);
 static unsigned hash(unsigned *, int);
 static struct sset *initialize(struct vars *, struct dfa *, chr *);
 static struct sset *miss(struct vars *, struct dfa *, struct sset *, color, chr *, chr *);
-static int	lacon(struct vars *, struct cnfa *, chr *, color);
+static int	lacon(struct vars *, struct cnfa *, chr *, int);
 static struct sset *getvacant(struct vars *, struct dfa *, chr *, chr *);
 static struct sset *pickss(struct vars *, struct dfa *, chr *, chr *);
 
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index a493dbe88c..964a017f84 100644
--- a/src/backend/regex/regexport.c
+++ b/src/backend/regex/regexport.c
@@ -95,6 +95,7 @@ traverse_lacons(struct cnfa *cnfa, int st,
 				regex_arc_t *arcs, int arcs_len)
 {
 	struct carc *ca;
+	int			i;
 
 	/*
 	 * Since this function recurses, it could theoretically be driven to stack
@@ -103,20 +104,24 @@ traverse_lacons(struct cnfa *cnfa, int st,
 	 */
 	check_stack_depth();
 
-	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+	/* Count and possibly emit state's ordinary out-arcs */
+	for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++)
 	{
-		if (ca->co < cnfa->ncolors)
+		int			ndx = (*arcs_count)++;
+
+		if (ndx < arcs_len)
 		{
-			/* Ordinary arc, so count and possibly emit it */
-			int			ndx = (*arcs_count)++;
-
-			if (ndx < arcs_len)
-			{
-				arcs[ndx].co = ca->co;
-				arcs[ndx].to = ca->to;
-			}
+			arcs[ndx].co = ca->co;
+			arcs[ndx].to = ca->to;
 		}
-		else
+	}
+
+	/* Find state's LACON out-arcs, if any */
+	for (i = 0; i < cnfa->nlastates; i++)
+	{
+		if (cnfa->lastates[i] != st)
+			continue;
+		for (ca = cnfa->firstlaarc[i]; ca->co != COLORLESS; ca++)
 		{
 			/* LACON arc --- assume it's satisfied and recurse... */
 			/* ... but first, assert it doesn't lead directly to post state */
@@ -124,6 +129,7 @@ traverse_lacons(struct cnfa *cnfa, int st,
 
 			traverse_lacons(cnfa, ca->to, arcs_count, arcs, arcs_len);
 		}
+		break;					/* needn't look through rest of lastates[] */
 	}
 }
 
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index ec435b6f5f..31e158000f 100644
--- a/src/backend/regex/regprefix.c
+++ b/src/backend/regex/regprefix.c
@@ -131,7 +131,7 @@ findprefix(struct cnfa *cnfa,
 	 */
 	st = cnfa->pre;
 	nextst = -1;
-	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+	for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++)
 	{
 		if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 		{
@@ -146,6 +146,8 @@ findprefix(struct cnfa *cnfa,
 	if (nextst == -1)
 		return REG_NOMATCH;
 
+	/* The "pre" state won't have LACON out-arcs, so don't bother looking */
+
 	/*
 	 * Scan through successive states, stopping as soon as we find one with
 	 * more than one acceptable transition character (either multiple colors
@@ -164,18 +166,14 @@ findprefix(struct cnfa *cnfa,
 		st = nextst;
 		nextst = -1;
 		thiscolor = COLORLESS;
-		for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+		for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++)
 		{
 			/* We can ignore BOS/BOL arcs */
 			if (ca->co == cnfa->bos[0] || ca->co == cnfa->bos[1])
 				continue;
-
-			/*
-			 * ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs
-			 * and LACONs
-			 */
+			/* ... but EOS/EOL arcs terminate the search, as do RAINBOW arcs */
 			if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1] ||
-				ca->co == RAINBOW || ca->co >= cnfa->ncolors)
+				ca->co == RAINBOW)
 			{
 				thiscolor = COLORLESS;
 				break;
@@ -198,6 +196,21 @@ findprefix(struct cnfa *cnfa,
 				break;
 			}
 		}
+		/* If good so far, check to see if state has LACON outarcs */
+		if (thiscolor != COLORLESS)
+		{
+			int			i;
+
+			for (i = 0; i < cnfa->nlastates; i++)
+			{
+				if (cnfa->lastates[i] == st)
+				{
+					/* Yes, so terminate the search */
+					thiscolor = COLORLESS;
+					break;
+				}
+			}
+		}
 		/* Done if we didn't find exactly one color on plain outarcs */
 		if (thiscolor == COLORLESS)
 			break;
@@ -235,7 +248,7 @@ findprefix(struct cnfa *cnfa,
 	 * even if the string is of zero length.
 	 */
 	nextst = -1;
-	for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++)
+	for (ca = cnfa->firstarc[st]; ca->co != COLORLESS; ca++)
 	{
 		if (ca->co == cnfa->eos[0] || ca->co == cnfa->eos[1])
 		{
@@ -253,6 +266,20 @@ findprefix(struct cnfa *cnfa,
 			break;
 		}
 	}
+	/* If good so far, check to see if state has LACON outarcs */
+	if (nextst != -1)
+	{
+		int			i;
+
+		for (i = 0; i < cnfa->nlastates; i++)
+		{
+			if (cnfa->lastates[i] == st)
+			{
+				nextst = -1;
+				break;
+			}
+		}
+	}
 	if (nextst == cnfa->post)
 		return REG_EXACT;
 
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 306525eb5f..6b3b399b2c 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -345,17 +345,15 @@ struct nfa
  *
  * The main space savings in a compacted NFA is from making the arcs as small
  * as possible.  We store only the transition color and next-state number for
- * each arc.  The list of out arcs for each state is an array beginning at
- * cnfa.states[statenumber], and terminated by a dummy carc struct with
- * co == COLORLESS.
+ * each arc.  The list of plain out-arcs for each state is an array beginning
+ * at cnfa.firstarc[statenumber], and terminated by a dummy carc struct with
+ * co == COLORLESS.  If a state has LACON out-arcs (most don't), then its
+ * state number appears in cnfa.lastates[k] for some k, and its first LACON
+ * out-arc is at cnfa.firstlaarc[k].  That array is likewise terminated by a
+ * dummy carc struct with co == COLORLESS.  The "co" values for LACON out-arcs
+ * are lookaround constraint numbers.
  *
- * The non-dummy carc structs are of two types: plain arcs and LACON arcs.
- * Plain arcs just store the transition color number as "co".  LACON arcs
- * store the lookaround constraint number plus cnfa.ncolors as "co".  LACON
- * arcs can be distinguished from plain by testing for co >= cnfa.ncolors.
- *
- * Note that in a plain arc, "co" can be RAINBOW; since that's negative,
- * it doesn't break the rule about how to recognize LACON arcs.
+ * Note that in a plain arc, "co" can be RAINBOW rather than a regular color.
  *
  * We have special markings for "trivial" NFAs that can match any string
  * (possibly with limits on the number of characters therein).  In such a
@@ -375,17 +373,20 @@ struct cnfa
 {
 	int			nstates;		/* number of states */
 	int			ncolors;		/* number of colors (max color in use + 1) */
-	int			flags;
+	int			flags;			/* bitmask of the following flags: */
 #define  HASLACONS	01			/* uses lookaround constraints */
 #define  MATCHALL	02			/* matches all strings of a range of lengths */
+	int			nlastates;		/* number of states with lookaround outarcs */
 	int			pre;			/* setup state number */
 	int			post;			/* teardown state number */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
 	char	   *stflags;		/* vector of per-state flags bytes */
 #define  CNFA_NOPROGRESS	01	/* flag bit for a no-progress state */
-	struct carc **states;		/* vector of pointers to outarc lists */
-	/* states[n] are pointers into a single malloc'd array of arcs */
+	struct carc **firstarc;		/* per-state pointers to plain outarc lists */
+	int		   *lastates;		/* numbers of states with lookaround outarcs */
+	struct carc **firstlaarc;	/* pointers to LA outarcs for such states */
+	/* firstarc[n] & firstlaarc[n] are pointers into a single array of arcs */
 	struct carc *arcs;			/* the area for the lists */
 	/* these fields are used only in a MATCHALL NFA (else they're -1): */
 	int			minmatchall;	/* min number of chrs to match */

#32

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#31)

1 attachment(s)

Re: Some regular-expression performance hacking

I wrote:

On my machine, the combination of these two ideas reduces the
runtime of the example above from ~150 seconds to ~53 seconds,
or nearly 3x better. I see something like a 2% improvement on
Joel's test corpus, which might just be noise. So this isn't
any sort of universal panacea, but it sure helps when LACON
evaluation is the bottleneck.

After another round of testing, I really can't see any improvement
at all from that patch on anything except the original Tcl test
case. Indeed, a lot of cases seem very slightly worse, perhaps
because compact() now has to make two passes over all the arcs.
So that's leaving me a bit dissatisfied with it; I'm going to
stick it on the back burner for now, in hopes of a better idea.

However, in a different line of thought, I realized that the
memory allocation logic could use some polishing. It gives out
ten arcs per NFA state initially, and then adds ten more at a time.
However, that's not very bright when you look at the actual usage
patterns, because most states have only one or two out-arcs,
but some have lots and lots. I instrumented things to gather
stats about arcs-per-state on your larger corpus, and I got this,
where the seond column is the total fraction of states having
the given number of arcs or fewer:

arcs | cum_fraction
------+------------------------
0 | 0.03152871318455725868
1 | 0.55852399556959499493
2 | 0.79408539124378449284
3 | 0.86926656199366447221
4 | 0.91726891675794579062
5 | 0.92596934405572457792
6 | 0.93491612836055807037
7 | 0.94075102352639209644
8 | 0.94486598829672779379
9 | 0.94882085883928361399
10 | 0.95137992908336444821
11 | 0.95241399914559696173
12 | 0.95436547669138874594
13 | 0.95534682472329051385
14 | 0.95653340893356523452
15 | 0.95780804864876924571
16 | 0.95902387577636979702
17 | 0.95981494467267418552
18 | 0.96048662216159976997
19 | 0.96130294229052153065
20 | 0.96196856160309755204
...
3238 | 0.99999985870142624926
3242 | 0.99999987047630739515
4095 | 0.99999987342002768163
4535 | 0.99999987930746825457
4642 | 0.99999988225118854105
4706 | 0.99999989402606968694
5890 | 0.99999989696978997342
6386 | 0.99999990874467111931
7098 | 0.99999991168839140579
7751 | 0.99999994701303484347
7755 | 0.99999998233767828116
7875 | 0.99999998822511885410
8049 | 1.00000000000000000000

So it seemed clear to me that we should only give out a couple of arcs
per state initially, but then let it ramp up faster than 10 arcs per
additional malloc. After a bit of fooling I have the attached.
This does nothing for the very largest examples in the corpus (the
ones that cause "regex too complex") --- those were well over the
REG_MAX_COMPILE_SPACE limit before and they still are. But all the
rest get nicely smaller. The average pg_regcomp memory consumption
drops from ~89K to ~48K.

regards, tom lane

Attachments:

0007-smarter-arc-allocation.patchtext/x-diff; charset=us-ascii; name=0007-smarter-arc-allocation.patchDownload

diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 69929eddfc..411c86f4ff 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -142,9 +142,12 @@ newstate(struct nfa *nfa)
 	{
 		s = nfa->free;
 		nfa->free = s->next;
+		/* the state's free arcs stay as they are */
 	}
 	else
 	{
+		int			i;
+
 		if (nfa->v->spaceused >= REG_MAX_COMPILE_SPACE)
 		{
 			NERR(REG_ETOOBIG);
@@ -157,9 +160,16 @@ newstate(struct nfa *nfa)
 			return NULL;
 		}
 		nfa->v->spaceused += sizeof(struct state);
-		s->oas.next = NULL;
-		s->free = NULL;
-		s->noas = 0;
+		s->lastab = NULL;
+		/* initialize free arcs; this should match allocarc() */
+		for (i = 0; i < NINITARCS - 1; i++)
+		{
+			s->inita[i].type = 0;
+			s->inita[i].freechain = &s->inita[i + 1];
+		}
+		s->inita[i].type = 0;
+		s->inita[i].freechain = NULL;
+		s->free = &s->inita[0];
 	}
 
 	assert(nfa->nstates >= 0);
@@ -255,11 +265,11 @@ destroystate(struct nfa *nfa,
 	struct arcbatch *abnext;
 
 	assert(s->no == FREESTATE);
-	for (ab = s->oas.next; ab != NULL; ab = abnext)
+	for (ab = s->lastab; ab != NULL; ab = abnext)
 	{
 		abnext = ab->next;
+		nfa->v->spaceused -= ARCBATCHSIZE(ab->narcs);
 		FREE(ab);
-		nfa->v->spaceused -= sizeof(struct arcbatch);
 	}
 	s->ins = NULL;
 	s->outs = NULL;
@@ -377,41 +387,39 @@ allocarc(struct nfa *nfa,
 {
 	struct arc *a;
 
-	/* shortcut */
-	if (s->free == NULL && s->noas < ABSIZE)
-	{
-		a = &s->oas.a[s->noas];
-		s->noas++;
-		return a;
-	}
-
 	/* if none at hand, get more */
 	if (s->free == NULL)
 	{
 		struct arcbatch *newAb;
-		int			i;
+		size_t		narcs;
+		size_t		i;
 
 		if (nfa->v->spaceused >= REG_MAX_COMPILE_SPACE)
 		{
 			NERR(REG_ETOOBIG);
 			return NULL;
 		}
-		newAb = (struct arcbatch *) MALLOC(sizeof(struct arcbatch));
+		narcs = (s->lastab != NULL) ? s->lastab->narcs * 2 : FIRSTABSIZE;
+		if (narcs > MAXABSIZE)
+			narcs = MAXABSIZE;
+		newAb = (struct arcbatch *) MALLOC(ARCBATCHSIZE(narcs));
 		if (newAb == NULL)
 		{
 			NERR(REG_ESPACE);
 			return NULL;
 		}
-		nfa->v->spaceused += sizeof(struct arcbatch);
-		newAb->next = s->oas.next;
-		s->oas.next = newAb;
+		nfa->v->spaceused += ARCBATCHSIZE(narcs);
+		newAb->narcs = narcs;
+		newAb->next = s->lastab;
+		s->lastab = newAb;
 
-		for (i = 0; i < ABSIZE; i++)
+		for (i = 0; i < narcs - 1; i++)
 		{
 			newAb->a[i].type = 0;
 			newAb->a[i].freechain = &newAb->a[i + 1];
 		}
-		newAb->a[ABSIZE - 1].freechain = NULL;
+		newAb->a[i].type = 0;
+		newAb->a[i].freechain = NULL;
 		s->free = &newAb->a[0];
 	}
 	assert(s->free != NULL);
@@ -3413,7 +3421,7 @@ dumparc(struct arc *a,
 		FILE *f)
 {
 	struct arc *aa;
-	struct arcbatch *ab;
+	bool		found = false;
 
 	fprintf(f, "\t");
 	switch (a->type)
@@ -3451,15 +3459,29 @@ dumparc(struct arc *a,
 	}
 	if (a->from != s)
 		fprintf(f, "?%d?", a->from->no);
-	for (ab = &a->from->oas; ab != NULL; ab = ab->next)
+
+	/*
+	 * Check that arc belongs to its from-state's arc space.  This is
+	 * approximate, since we don't verify the arc isn't misaligned within the
+	 * space; but it's enough to catch the likely problem of an arc being
+	 * misassigned to a different parent state.
+	 */
+	if (a >= a->from->inita && a < a->from->inita + NINITARCS)
+		found = true;			/* member of initial arcset */
+	else
 	{
-		for (aa = &ab->a[0]; aa < &ab->a[ABSIZE]; aa++)
-			if (aa == a)
+		struct arcbatch *ab;
+
+		for (ab = a->from->lastab; ab != NULL; ab = ab->next)
+		{
+			if (a >= ab->a && a < ab->a + ab->narcs)
+			{
+				found = true;
 				break;			/* NOTE BREAK OUT */
-		if (aa < &ab->a[ABSIZE])	/* propagate break */
-			break;				/* NOTE BREAK OUT */
+			}
+		}
 	}
-	if (ab == NULL)
+	if (!found)
 		fprintf(f, "?!?");		/* not in allocated space */
 	fprintf(f, "->");
 	if (a->to == NULL)
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 306525eb5f..65e6ace9d1 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -294,12 +294,23 @@ struct arc
 	struct arc *colorchainRev;	/* back-link in color's arc chain */
 };
 
+/*
+ * We include NINITARCS arcs directly in the parent state, to save a separate
+ * malloc for states with few out-arcs.  If more arcs are needed, the first
+ * arcbatch has size FIRSTABSIZE; each succeeding batch doubles in size,
+ * up to MAXABSIZE.
+ */
+#define  NINITARCS		2		/* 2 is enough for about 80% of states */
+#define  FIRSTABSIZE	8		/* 2+8 is enough for 95% of states */
+#define  MAXABSIZE		64		/* don't waste too much space per arcbatch */
+
 struct arcbatch
 {								/* for bulk allocation of arcs */
-	struct arcbatch *next;
-#define  ABSIZE  10
-	struct arc	a[ABSIZE];
+	struct arcbatch *next;		/* chain link */
+	size_t		narcs;			/* number of arcs allocated in this arcbatch */
+	struct arc	a[FLEXIBLE_ARRAY_MEMBER];
 };
+#define  ARCBATCHSIZE(n)  ((n) * sizeof(struct arc) + offsetof(struct arcbatch, a))
 
 struct state
 {
@@ -312,10 +323,10 @@ struct state
 	struct arc *outs;			/* chain of outarcs */
 	struct arc *free;			/* chain of free arcs */
 	struct state *tmp;			/* temporary for traversal algorithms */
-	struct state *next;			/* chain for traversing all */
-	struct state *prev;			/* back chain */
-	struct arcbatch oas;		/* first arcbatch, avoid malloc in easy case */
-	int			noas;			/* number of arcs used in first arcbatch */
+	struct state *next;			/* chain for traversing all live states */
+	struct state *prev;			/* back-link in chain of all live states */
+	struct arcbatch *lastab;	/* chain of associated arcbatches */
+	struct arc	inita[NINITARCS];	/* a few arcs to avoid extra malloc */
 };
 
 struct nfa
@@ -413,7 +424,7 @@ struct cnfa
  */
 #ifndef REG_MAX_COMPILE_SPACE
 #define REG_MAX_COMPILE_SPACE  \
-	(100000 * sizeof(struct state) + 100000 * sizeof(struct arcbatch))
+	(100000 * sizeof(struct state) + 2000000 * sizeof(struct arc))
 #endif
 
 /*

#33

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#32)

1 attachment(s)

Re: Some regular-expression performance hacking

I wrote:

However, in a different line of thought, I realized that the
memory allocation logic could use some polishing. It gives out
ten arcs per NFA state initially, and then adds ten more at a time.
However, that's not very bright when you look at the actual usage
patterns, because most states have only one or two out-arcs,
but some have lots and lots.

Hold the phone ... after a bit I started to wonder why Spencer made
arc allocation be per-state at all, rather than using one big pool
of arcs. Maybe there's some locality-of-reference argument to be
made for that, but I doubt he was worrying about that back in the
90s. Besides, the regex compiler spends a lot of time iterating
over in-chains and color-chains, not just out-chains; it's hard
to see why trying to privilege the latter case would help much.

What I suspect, based on this old comment in regguts.h:
* Having a "from" pointer within each arc may seem redundant, but it
* saves a lot of hassle.
is that Henry did it like this initially to save having a "from"
pointer in each arc, and never re-thought the allocation mechanism
after he gave up on that idea.

So I rearranged things to allocate arcs out of a common pool, and for
good measure made the state allocation code do the same thing. I was
pretty much blown away by the results: not only is the average-case
space usage about half what it is on HEAD, but the worst-case drops
by well more than a factor of ten. I'd previously found, by raising
REG_MAX_COMPILE_SPACE, that the regexes in the second corpus that
trigger "regex too complex" errors all need 300 to 360 MB to compile
with our HEAD code. With the new patch attached, they compile
successfully in a dozen or so MB. (Yesterday's patch really did
nothing at all for these worst-case regexes, BTW.)

I also see about a 10% speedup overall, which I'm pretty sure is
down to needing fewer interactions with malloc() (this is partially
a function of having batched the state allocations, of course).
So even if there is a locality-of-reference loss, it's swamped by
fewer mallocs and less total space used.

regards, tom lane

Attachments:

0007-smarter-regex-allocation-2.patchtext/x-diff; charset=us-ascii; name=0007-smarter-regex-allocation-2.patchDownload

diff --git a/src/backend/regex/regc_nfa.c b/src/backend/regex/regc_nfa.c
index 69929eddfc..df3351b3d0 100644
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@@ -59,7 +59,10 @@ newnfa(struct vars *v,
 
 	nfa->states = NULL;
 	nfa->slast = NULL;
-	nfa->free = NULL;
+	nfa->freestates = NULL;
+	nfa->freearcs = NULL;
+	nfa->lastsb = NULL;
+	nfa->lastab = NULL;
 	nfa->nstates = 0;
 	nfa->cm = cm;
 	nfa->v = v;
@@ -99,23 +102,27 @@ newnfa(struct vars *v,
 static void
 freenfa(struct nfa *nfa)
 {
-	struct state *s;
+	struct statebatch *sb;
+	struct statebatch *sbnext;
+	struct arcbatch *ab;
+	struct arcbatch *abnext;
 
-	while ((s = nfa->states) != NULL)
+	for (sb = nfa->lastsb; sb != NULL; sb = sbnext)
 	{
-		s->nins = s->nouts = 0; /* don't worry about arcs */
-		freestate(nfa, s);
+		sbnext = sb->next;
+		nfa->v->spaceused -= STATEBATCHSIZE(sb->nstates);
+		FREE(sb);
 	}
-	while ((s = nfa->free) != NULL)
+	nfa->lastsb = NULL;
+	for (ab = nfa->lastab; ab != NULL; ab = abnext)
 	{
-		nfa->free = s->next;
-		destroystate(nfa, s);
+		abnext = ab->next;
+		nfa->v->spaceused -= ARCBATCHSIZE(ab->narcs);
+		FREE(ab);
 	}
+	nfa->lastab = NULL;
 
-	nfa->slast = NULL;
 	nfa->nstates = -1;
-	nfa->pre = NULL;
-	nfa->post = NULL;
 	FREE(nfa);
 }
 
@@ -138,29 +145,45 @@ newstate(struct nfa *nfa)
 		return NULL;
 	}
 
-	if (nfa->free != NULL)
-	{
-		s = nfa->free;
-		nfa->free = s->next;
-	}
-	else
+	/* if none at hand, get more */
+	if (nfa->freestates == NULL)
 	{
+		struct statebatch *newSb;
+		size_t		nstates;
+		size_t		i;
+
 		if (nfa->v->spaceused >= REG_MAX_COMPILE_SPACE)
 		{
 			NERR(REG_ETOOBIG);
 			return NULL;
 		}
-		s = (struct state *) MALLOC(sizeof(struct state));
-		if (s == NULL)
+		nstates = (nfa->lastsb != NULL) ? nfa->lastsb->nstates * 2 : FIRSTSBSIZE;
+		if (nstates > MAXSBSIZE)
+			nstates = MAXSBSIZE;
+		newSb = (struct statebatch *) MALLOC(STATEBATCHSIZE(nstates));
+		if (newSb == NULL)
 		{
 			NERR(REG_ESPACE);
 			return NULL;
 		}
-		nfa->v->spaceused += sizeof(struct state);
-		s->oas.next = NULL;
-		s->free = NULL;
-		s->noas = 0;
+		nfa->v->spaceused += STATEBATCHSIZE(nstates);
+		newSb->nstates = nstates;
+		newSb->next = nfa->lastsb;
+		nfa->lastsb = newSb;
+
+		for (i = 0; i < nstates - 1; i++)
+		{
+			newSb->s[i].no = FREESTATE;
+			newSb->s[i].next = &newSb->s[i + 1];
+		}
+		newSb->s[i].no = FREESTATE;
+		newSb->s[i].next = NULL;
+		nfa->freestates = &newSb->s[0];
 	}
+	assert(nfa->freestates != NULL);
+
+	s = nfa->freestates;
+	nfa->freestates = s->next;
 
 	assert(nfa->nstates >= 0);
 	s->no = nfa->nstates++;
@@ -240,32 +263,8 @@ freestate(struct nfa *nfa,
 		nfa->states = s->next;
 	}
 	s->prev = NULL;
-	s->next = nfa->free;		/* don't delete it, put it on the free list */
-	nfa->free = s;
-}
-
-/*
- * destroystate - really get rid of an already-freed state
- */
-static void
-destroystate(struct nfa *nfa,
-			 struct state *s)
-{
-	struct arcbatch *ab;
-	struct arcbatch *abnext;
-
-	assert(s->no == FREESTATE);
-	for (ab = s->oas.next; ab != NULL; ab = abnext)
-	{
-		abnext = ab->next;
-		FREE(ab);
-		nfa->v->spaceused -= sizeof(struct arcbatch);
-	}
-	s->ins = NULL;
-	s->outs = NULL;
-	s->next = NULL;
-	FREE(s);
-	nfa->v->spaceused -= sizeof(struct state);
+	s->next = nfa->freestates;	/* don't delete it, put it on the free list */
+	nfa->freestates = s;
 }
 
 /*
@@ -334,8 +333,7 @@ createarc(struct nfa *nfa,
 {
 	struct arc *a;
 
-	/* the arc is physically allocated within its from-state */
-	a = allocarc(nfa, from);
+	a = allocarc(nfa);
 	if (NISERR())
 		return;
 	assert(a != NULL);
@@ -369,55 +367,52 @@ createarc(struct nfa *nfa,
 }
 
 /*
- * allocarc - allocate a new out-arc within a state
+ * allocarc - allocate a new arc within an NFA
  */
 static struct arc *				/* NULL for failure */
-allocarc(struct nfa *nfa,
-		 struct state *s)
+allocarc(struct nfa *nfa)
 {
 	struct arc *a;
 
-	/* shortcut */
-	if (s->free == NULL && s->noas < ABSIZE)
-	{
-		a = &s->oas.a[s->noas];
-		s->noas++;
-		return a;
-	}
-
 	/* if none at hand, get more */
-	if (s->free == NULL)
+	if (nfa->freearcs == NULL)
 	{
 		struct arcbatch *newAb;
-		int			i;
+		size_t		narcs;
+		size_t		i;
 
 		if (nfa->v->spaceused >= REG_MAX_COMPILE_SPACE)
 		{
 			NERR(REG_ETOOBIG);
 			return NULL;
 		}
-		newAb = (struct arcbatch *) MALLOC(sizeof(struct arcbatch));
+		narcs = (nfa->lastab != NULL) ? nfa->lastab->narcs * 2 : FIRSTABSIZE;
+		if (narcs > MAXABSIZE)
+			narcs = MAXABSIZE;
+		newAb = (struct arcbatch *) MALLOC(ARCBATCHSIZE(narcs));
 		if (newAb == NULL)
 		{
 			NERR(REG_ESPACE);
 			return NULL;
 		}
-		nfa->v->spaceused += sizeof(struct arcbatch);
-		newAb->next = s->oas.next;
-		s->oas.next = newAb;
+		nfa->v->spaceused += ARCBATCHSIZE(narcs);
+		newAb->narcs = narcs;
+		newAb->next = nfa->lastab;
+		nfa->lastab = newAb;
 
-		for (i = 0; i < ABSIZE; i++)
+		for (i = 0; i < narcs - 1; i++)
 		{
 			newAb->a[i].type = 0;
 			newAb->a[i].freechain = &newAb->a[i + 1];
 		}
-		newAb->a[ABSIZE - 1].freechain = NULL;
-		s->free = &newAb->a[0];
+		newAb->a[i].type = 0;
+		newAb->a[i].freechain = NULL;
+		nfa->freearcs = &newAb->a[0];
 	}
-	assert(s->free != NULL);
+	assert(nfa->freearcs != NULL);
 
-	a = s->free;
-	s->free = a->freechain;
+	a = nfa->freearcs;
+	nfa->freearcs = a->freechain;
 	return a;
 }
 
@@ -478,7 +473,7 @@ freearc(struct nfa *nfa,
 	}
 	to->nins--;
 
-	/* clean up and place on from-state's free list */
+	/* clean up and place on NFA's free list */
 	victim->type = 0;
 	victim->from = NULL;		/* precautions... */
 	victim->to = NULL;
@@ -486,8 +481,8 @@ freearc(struct nfa *nfa,
 	victim->inchainRev = NULL;
 	victim->outchain = NULL;
 	victim->outchainRev = NULL;
-	victim->freechain = from->free;
-	from->free = victim;
+	victim->freechain = nfa->freearcs;
+	nfa->freearcs = victim;
 }
 
 /*
@@ -3413,7 +3408,6 @@ dumparc(struct arc *a,
 		FILE *f)
 {
 	struct arc *aa;
-	struct arcbatch *ab;
 
 	fprintf(f, "\t");
 	switch (a->type)
@@ -3451,16 +3445,11 @@ dumparc(struct arc *a,
 	}
 	if (a->from != s)
 		fprintf(f, "?%d?", a->from->no);
-	for (ab = &a->from->oas; ab != NULL; ab = ab->next)
-	{
-		for (aa = &ab->a[0]; aa < &ab->a[ABSIZE]; aa++)
-			if (aa == a)
-				break;			/* NOTE BREAK OUT */
-		if (aa < &ab->a[ABSIZE])	/* propagate break */
+	for (aa = a->from->outs; aa != NULL; aa = aa->outchain)
+		if (aa == a)
 			break;				/* NOTE BREAK OUT */
-	}
-	if (ab == NULL)
-		fprintf(f, "?!?");		/* not in allocated space */
+	if (aa == NULL)
+		fprintf(f, "?!?");		/* missing from out-chain */
 	fprintf(f, "->");
 	if (a->to == NULL)
 	{
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index d3540fdd0f..a51555cfde 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -127,10 +127,9 @@ static struct state *newstate(struct nfa *);
 static struct state *newfstate(struct nfa *, int flag);
 static void dropstate(struct nfa *, struct state *);
 static void freestate(struct nfa *, struct state *);
-static void destroystate(struct nfa *, struct state *);
 static void newarc(struct nfa *, int, color, struct state *, struct state *);
 static void createarc(struct nfa *, int, color, struct state *, struct state *);
-static struct arc *allocarc(struct nfa *, struct state *);
+static struct arc *allocarc(struct nfa *);
 static void freearc(struct nfa *, struct arc *);
 static void changearctarget(struct arc *, struct state *);
 static int	hasnonemptyout(struct state *);
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index 0e76a828f8..f3e7752171 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -284,9 +284,6 @@ struct cvec
 
 /*
  * definitions for NFA internal representation
- *
- * Having a "from" pointer within each arc may seem redundant, but it
- * saves a lot of hassle.
  */
 struct state;
 
@@ -294,7 +291,7 @@ struct arc
 {
 	int			type;			/* 0 if free, else an NFA arc type code */
 	color		co;				/* color the arc matches (possibly RAINBOW) */
-	struct state *from;			/* where it's from (and contained within) */
+	struct state *from;			/* where it's from */
 	struct state *to;			/* where it's to */
 	struct arc *outchain;		/* link in *from's outs chain or free chain */
 	struct arc *outchainRev;	/* back-link in *from's outs chain */
@@ -308,28 +305,39 @@ struct arc
 
 struct arcbatch
 {								/* for bulk allocation of arcs */
-	struct arcbatch *next;
-#define  ABSIZE  10
-	struct arc	a[ABSIZE];
+	struct arcbatch *next;		/* chain link */
+	size_t		narcs;			/* number of arcs allocated in this arcbatch */
+	struct arc	a[FLEXIBLE_ARRAY_MEMBER];
 };
+#define  ARCBATCHSIZE(n)  ((n) * sizeof(struct arc) + offsetof(struct arcbatch, a))
+#define  FIRSTABSIZE	64
+#define  MAXABSIZE		1024
 
 struct state
 {
-	int			no;
+	int			no;				/* state number, zero and up; or FREESTATE */
 #define  FREESTATE	 (-1)
 	char		flag;			/* marks special states */
 	int			nins;			/* number of inarcs */
-	struct arc *ins;			/* chain of inarcs */
 	int			nouts;			/* number of outarcs */
+	struct arc *ins;			/* chain of inarcs */
 	struct arc *outs;			/* chain of outarcs */
-	struct arc *free;			/* chain of free arcs */
 	struct state *tmp;			/* temporary for traversal algorithms */
-	struct state *next;			/* chain for traversing all */
-	struct state *prev;			/* back chain */
-	struct arcbatch oas;		/* first arcbatch, avoid malloc in easy case */
-	int			noas;			/* number of arcs used in first arcbatch */
+	struct state *next;			/* chain for traversing all live states */
+	/* the "next" field is also used to chain free states together */
+	struct state *prev;			/* back-link in chain of all live states */
 };
 
+struct statebatch
+{								/* for bulk allocation of states */
+	struct statebatch *next;	/* chain link */
+	size_t		nstates;		/* number of states allocated in this batch */
+	struct state s[FLEXIBLE_ARRAY_MEMBER];
+};
+#define  STATEBATCHSIZE(n)  ((n) * sizeof(struct state) + offsetof(struct statebatch, s))
+#define  FIRSTSBSIZE	32
+#define  MAXSBSIZE		1024
+
 struct nfa
 {
 	struct state *pre;			/* pre-initial state */
@@ -337,9 +345,12 @@ struct nfa
 	struct state *final;		/* final state */
 	struct state *post;			/* post-final state */
 	int			nstates;		/* for numbering states */
-	struct state *states;		/* state-chain header */
+	struct state *states;		/* chain of live states */
 	struct state *slast;		/* tail of the chain */
-	struct state *free;			/* free list */
+	struct state *freestates;	/* chain of free states */
+	struct arc *freearcs;		/* chain of free arcs */
+	struct statebatch *lastsb;	/* chain of statebatches */
+	struct arcbatch *lastab;	/* chain of arcbatches */
 	struct colormap *cm;		/* the color map */
 	color		bos[2];			/* colors, if any, assigned to BOS and BOL */
 	color		eos[2];			/* colors, if any, assigned to EOS and EOL */
@@ -387,7 +398,7 @@ struct cnfa
 {
 	int			nstates;		/* number of states */
 	int			ncolors;		/* number of colors (max color in use + 1) */
-	int			flags;
+	int			flags;			/* bitmask of the following flags: */
 #define  HASLACONS	01			/* uses lookaround constraints */
 #define  MATCHALL	02			/* matches all strings of a range of lengths */
 	int			pre;			/* setup state number */
@@ -422,10 +433,12 @@ struct cnfa
  * transient data is generally not large enough to notice compared to those.
  * Note that we do not charge anything for the final output data structures
  * (the compacted NFA and the colormap).
+ * The scaling here is based on an empirical measurement that very large
+ * NFAs tend to have about 4 arcs/state.
  */
 #ifndef REG_MAX_COMPILE_SPACE
 #define REG_MAX_COMPILE_SPACE  \
-	(100000 * sizeof(struct state) + 100000 * sizeof(struct arcbatch))
+	(500000 * (sizeof(struct state) + 4 * sizeof(struct arc)))
 #endif
 
 /*

#34

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Tom Lane (#33)

Re: Some regular-expression performance hacking

I wrote:

So I rearranged things to allocate arcs out of a common pool, and for
good measure made the state allocation code do the same thing. I was
pretty much blown away by the results: not only is the average-case
space usage about half what it is on HEAD, but the worst-case drops
by well more than a factor of ten.

BTW, I was initially a bit baffled by how this could be. Per previous
measurements, the average number of arcs per state is around 4; so
if the code is allocating ten arcs for each state right off the bat,
it's pretty clear how we could have a factor-of-two-or-so bloat
problem. And I think that does explain the average-case results.
But it can't possibly explain bloats of more than 10x.

After further study I think this is what explains it:

* The "average number of arcs" is pretty misleading, because in
a large NFA some of the states have hundreds of out-arcs, while
most have only a couple.

* The NFA is not static; the code moves arcs around all the time.
There's actually a function (moveouts) that deletes all the
out-arcs of a state and creates images of them on another state.
That operation can be invoked a lot of times during NFA optimization.

* Once a given state has acquired N out-arcs, it keeps that pool
of arc storage, even if some or all of those arcs get deleted.
Indeed, the state itself could be dropped and later recycled,
but it still keeps its arc pool. Unfortunately, even if it does
get recycled for re-use, it's likely to be resurrected as a state
with only a couple of out-arcs.

So I think the explanation for 20x or 30x bloat arises from the
optimize pass resulting in having a bunch of states that have large
but largely unused arc pools. Getting rid of the per-state arc pools
in favor of one common pool fixes that nicely.

I realized while looking at this that some cycles could be shaved
from moveouts, because there's no longer a reason why it can't just
scribble on the arcs in-place (cf. now-obsolete comment on
changearctarget()). It's late but I'll see about improving that
tomorrow.

regards, tom lane

#35

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#33)

Re: Some regular-expression performance hacking

Hi,

On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:

0007-smarter-regex-allocation-2.patch

I've successfully tested this patch.

I had to re-create the performance_test table
since some cases the previously didn't give an error,
now gives error "invalid regular expression: invalid character range".
This is expected and of course an improvement,
but just wanted to explain why the number of rows
don't match the previous test runs.

CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
patterns.flags,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
--
-- the below part is added to ignore cases
-- that now results in error:
--
AND NOT EXISTS (
SELECT 1 FROM deviations
WHERE deviations.test_id = tests.test_id
AND deviations.error IS NOT NULL
);
SELECT 3253889

Comparing 13.2 with HEAD,
not a single test resulted in a different is_match value,
i.e. the test just using the ~ regex operator,
to only check if it matches or not. Good.

SELECT COUNT(*)
FROM deviations
JOIN tests ON tests.test_id = deviations.test_id
WHERE tests.is_match <> deviations.is_match

count
-------
0
(1 row)

The below query shows a frequency count per error message:

SELECT error, COUNT(*)
FROM deviations
GROUP BY 1
ORDER BY 2 DESC

As we can see, 106173 cases now goes through without an error,
that previously gave an error. This is thanks to now allowing escape
sequences within bracket expressions.

The other errors are expected and all good.

End of correctness analysis. Now let's look at performance!
I reran the same query three times to get a feeling for the stddev.

\timing

SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern, flags),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2;

?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)

HEAD (b3a9e9897ec702d56602b26a8cdc0950f23b29dc)
Time: 125938.747 ms (02:05.939)
Time: 125414.792 ms (02:05.415)
Time: 126185.496 ms (02:06.185)

HEAD (b3a9e9897ec702d56602b26a8cdc0950f23b29dc)+0007-smarter-regex-allocation-2.patch

?column? | ?column? | count
----------+----------+---------
f | f | 3253889
(1 row)

Time: 89145.030 ms (01:29.145)
Time: 89083.210 ms (01:29.083)
Time: 89166.442 ms (01:29.166)

That's a 29% speed-up compared to HEAD! Truly amazing.

Let's have a look at the total speed-up compared to PostgreSQL 13.

In my previous benchmarks testing against old versions,
I used precompiled binaries, but this time I compiled REL_13_STABLE:

Time: 483390.132 ms (08:03.390)

That's a 82% speed-up in total! Amazing!

/Joel

#36

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Joel Jacobson (#35)

Re: Some regular-expression performance hacking

"Joel Jacobson" <joel@compiler.org> writes:

On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:

0007-smarter-regex-allocation-2.patch

I've successfully tested this patch.

Cool, thanks for testing!

That's a 29% speed-up compared to HEAD! Truly amazing.

Hmm, I'm still only seeing about 10% or a little better.
I wonder why the difference in your numbers. Either way,
though, I'll take it, since the main point here is to cut
memory consumption and not so much cycles.

regards, tom lane

#37

Joel Jacobson

joel@compiler.org

almost 5 years ago

In reply to: Tom Lane (#36)

Re: Some regular-expression performance hacking

On Fri, Feb 26, 2021, at 19:55, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

On Fri, Feb 26, 2021, at 01:16, Tom Lane wrote:

0007-smarter-regex-allocation-2.patch

I've successfully tested this patch.

Cool, thanks for testing!

I thought it would be interesting to see if any differences
in *where* matches occur not only *what* matches.

I've compared the output from regexp_positions()
between REL_13_STABLE and HEAD.

I'm happy to report no differences were found,
except some new expected

invalid regular expression: invalid character range

errors due to the fixes.

This time I also ran into the

(["'`])(?:\\\1|.)*?\1

pattern due to using the flags,
which caused a timeout on REL_13_STABLE,
but the same pattern is fast on HEAD.

All good.

/Joel

#38

Noah Misch

noah@leadboat.com

almost 5 years ago

In reply to: Joel Jacobson (#2)

Re: Some regular-expression performance hacking

On Sat, Feb 13, 2021 at 06:19:34PM +0100, Joel Jacobson wrote:

To test the correctness of the patches,
I thought it would be nice with some real-life regexes,
and just as important, some real-life text strings,
to which the real-life regexes are applied to.

I therefore patched Chromium's v8 regexes engine,
to log the actual regexes that get compiled when
visiting websites, and also the text strings that
are the regexes are applied to during run-time
when the regexes are executed.

I logged the regex and text strings as base64 encoded
strings to STDOUT, to make it easy to grep out the data,
so it could be imported into PostgreSQL for analytics.

In total, I scraped the first-page of some ~50k websites,
which produced 45M test rows to import,
which when GROUP BY pattern and flags was reduced
down to 235k different regex patterns,
and 1.5M different text string subjects.

It's great to see this kind of testing. Thanks for doing it.