Regex with > 32k different chars causes a backend crash

Started by Heikki Linnakangasalmost 13 years ago8 messages
#1Heikki Linnakangas
hlinnakangas@vmware.com

While playing with Alexander's pg_trgm regexp patch, I noticed that the
regexp library trips an assertion (if enabled) or crashes, when passed
an input string that contains more than 32k different characters:

select 'foo' ~ (select string_agg(chr(x),'') from generate_series(100,
35000) x) as nastyregex;

This is because it uses 'short' as the datatype to identify colors. When
it overflows, -32768 is used as index to the colordesc array, and you
get a crash. AFAICS this can't reliably be used for anything more
sinister than crashing the backend.

A regex with that many different colors is an extreme case, so I think
it's enough to turn the assertion in newcolor() into a run-time check,
and throw a "too many colors in regexp" error. Alternatively, we could
expand 'color' from short to int, but that would double the memory usage
of sane regexps with less different characters.

Thoughts?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Regex with > 32k different chars causes a backend crash

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

A regex with that many different colors is an extreme case, so I think
it's enough to turn the assertion in newcolor() into a run-time check,
and throw a "too many colors in regexp" error. Alternatively, we could
expand 'color' from short to int, but that would double the memory usage
of sane regexps with less different characters.

Obviously Henry didn't think that far ahead. I agree that throwing
an error is the best solution, and that widening "color" is probably
not what we want to do. You want to fix that, or shall I?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#2)
Re: Regex with > 32k different chars causes a backend crash

On 03.04.2013 18:21, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

A regex with that many different colors is an extreme case, so I think
it's enough to turn the assertion in newcolor() into a run-time check,
and throw a "too many colors in regexp" error. Alternatively, we could
expand 'color' from short to int, but that would double the memory usage
of sane regexps with less different characters.

Obviously Henry didn't think that far ahead. I agree that throwing
an error is the best solution, and that widening "color" is probably
not what we want to do. You want to fix that, or shall I?

I can do it. I assume that Tcl has the same bug, so I'll submit a report
there, too.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: Regex with > 32k different chars causes a backend crash

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 03.04.2013 18:21, Tom Lane wrote:

Obviously Henry didn't think that far ahead. I agree that throwing
an error is the best solution, and that widening "color" is probably
not what we want to do. You want to fix that, or shall I?

I can do it. I assume that Tcl has the same bug, so I'll submit a report
there, too.

Yes, definitely.

It occurs to me that at some point it might be useful to convert "color"
to unsigned short, so that you could have 64K of them --- but we'd still
need the error check anyway, and there's no reason to tackle such a
change today. (This possible change is, however, another reason to not
want pg_trgm looking directly at the internals of the data structure...)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#4)
1 attachment(s)
Re: Regex with > 32k different chars causes a backend crash

On 03.04.2013 18:41, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 03.04.2013 18:21, Tom Lane wrote:

Obviously Henry didn't think that far ahead. I agree that throwing
an error is the best solution, and that widening "color" is probably
not what we want to do. You want to fix that, or shall I?

I can do it. I assume that Tcl has the same bug, so I'll submit a report
there, too.

Yes, definitely.

It occurs to me that at some point it might be useful to convert "color"
to unsigned short, so that you could have 64K of them ...--- but we'd still
need the error check anyway, and there's no reason to tackle such a
change today.

I was just thinking the same. In practice, expanding it to 64k doesn't
get you much farther. There is this in newdfa():

d->incarea = (struct arcp *) MALLOC(nss * cnfa->ncolors *
sizeof(struct arcp));

That's (number of states) * (number of colors) * (constant). The test
case I posted earlier would require about 40 GB of RAM for that
allocation alone, and fails with an "out of memory" error. Maybe it
would be possible to construct a regexp that has a lot of colors but few
states, but that's an even more marginal use case.

Attached is a patch to add the overflow check. I used the error message
"too many distinct characters in regex". That's not totally accurate,
because there isn't a limit on distinct characters per se, but on the
number of colors. Conceivably, you could have a regexp with more than
32k different characters, but where most of them are mapped to the same
color. In practice, it's not helpful to the user to say "too many
colors"; he will have no clue what a color is.

PS. I was mistaken when I said that this causes an assertion failure; it
segfaults even with assertions enabled.

- Heikki

Attachments:

fix-regexp-color-overflow-1.patchtext/x-diff; name=fix-regexp-color-overflow-1.patchDownload
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 1c60566..e6aa899 100644
--- a/src/backend/regex/regc_color.c
+++ b/src/backend/regex/regc_color.c
@@ -247,7 +247,15 @@ newcolor(struct colormap * cm)
 		/* oops, must allocate more */
 		struct colordesc *newCd;
 
+		if (cm->max == MAX_COLOR)
+		{
+			CERR(REG_ECOLORS);
+			return COLORLESS;	/* too many colors */
+		}
+
 		n = cm->ncds * 2;
+		if (n > MAX_COLOR + 1)
+			n = MAX_COLOR + 1;
 		if (cm->cd == cm->cdspace)
 		{
 			newCd = (struct colordesc *) MALLOC(n * sizeof(struct colordesc));
diff --git a/src/include/regex/regerrs.h b/src/include/regex/regerrs.h
index a761371..6516d7e 100644
--- a/src/include/regex/regerrs.h
+++ b/src/include/regex/regerrs.h
@@ -77,3 +77,7 @@
 {
 	REG_ETOOBIG, "REG_ETOOBIG", "nfa has too many states"
 },
+
+{
+	REG_ECOLORS, "REG_ECOLORS", "too many distinct characters in regex"
+},
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 616c2c6..a7fb14f 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -153,6 +153,7 @@ typedef struct
 #define REG_MIXED	17			/* character widths of regex and string differ */
 #define REG_BADOPT	18			/* invalid embedded option */
 #define REG_ETOOBIG 19			/* nfa has too many states */
+#define REG_ECOLORS 20			/* too many distinct characters in regex */
 /* two specials for debugging and testing */
 #define REG_ATOI	101			/* convert error-code name to number */
 #define REG_ITOA	102			/* convert error-code number to name */
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index e1e406f..fc7f5ae 100644
--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -148,6 +148,7 @@
 typedef short color;			/* colors of characters */
 typedef int pcolor;				/* what color promotes to */
 
+#define MAX_COLOR	32767		/* max value that fits in 'color' datatype */
 #define COLORLESS	(-1)		/* impossible color */
 #define WHITE		0			/* default color, parent of all others */
 
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: Regex with > 32k different chars causes a backend crash

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Attached is a patch to add the overflow check. I used the error message
"too many distinct characters in regex". That's not totally accurate,
because there isn't a limit on distinct characters per se, but on the
number of colors. Conceivably, you could have a regexp with more than
32k different characters, but where most of them are mapped to the same
color. In practice, it's not helpful to the user to say "too many
colors"; he will have no clue what a color is.

Patch looks good except perhaps for wordsmithing the message text.

One thought is that we don't need to identify this as a regex error
because the PG code will report it with "invalid regular expression: %s".

I think there's a good argument for saying "too many character colors"
and relying on the "invalid regular expression" context to clue in the
clueless. After all, most of them don't know what an NFA is either, but
no one has complained about the REG_ETOOBIG message. I think if you get
to the point where you're triggering this error, you probably know
something about regexes, or even if you don't the phrase "too many" will
give you a fair idea what's wrong. There is much to be said for
specifically identifying the implementation limit that's been hit, even
if most users don't know what it is. So I'd just as soon not fall back
on something imprecise.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#5)
Re: Regex with > 32k different chars causes a backend crash

On Wed, Apr 03, 2013 at 08:09:15PM +0300, Heikki Linnakangas wrote:

--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -148,6 +148,7 @@
typedef short color;			/* colors of characters */
typedef int pcolor;				/* what color promotes to */

+#define MAX_COLOR 32767 /* max value that fits in 'color' datatype */

This should use SHRT_MAX, no? (Not that any supported platform differs here.)

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Noah Misch (#7)
Re: Regex with > 32k different chars causes a backend crash

On 04.04.2013 03:32, Noah Misch wrote:

On Wed, Apr 03, 2013 at 08:09:15PM +0300, Heikki Linnakangas wrote:

--- a/src/include/regex/regguts.h
+++ b/src/include/regex/regguts.h
@@ -148,6 +148,7 @@
typedef short color;			/* colors of characters */
typedef int pcolor;				/* what color promotes to */

+#define MAX_COLOR 32767 /* max value that fits in 'color' datatype */

This should use SHRT_MAX, no? (Not that any supported platform differs here.)

I considered that, but I got all confused on whether limits.h needs to
be included and where, if we use that. So I just used a constant 32767
in the end. Committed that way.

I opened a ticket in TCL bug tracker for this:
https://sourceforge.net/tracker/?func=detail&amp;aid=3610026&amp;group_id=10894&amp;atid=110894.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers