Undocumented(?) limits on regexp functions

Started by Andrew Gierthover 7 years ago6 messages

andrew@tao11.riddles.org.uk

over 7 years ago

All the regexp functions blow up with "invalid memory alloc request"
errors when the input string exceeds 256MB in length. This restriction
does not seem to be documented anywhere that I could see.

(Also for regexp_split* and regexp_matches, there's a limit of 64M total
matches, which also doesn't seem to be documented anywhere).

Should these limits:

a) be removed

b) be documented

c) have better error messages?

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrew Gierth (#1)

Re: Undocumented(?) limits on regexp functions

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

All the regexp functions blow up with "invalid memory alloc request"
errors when the input string exceeds 256MB in length. This restriction
does not seem to be documented anywhere that I could see.

(Also for regexp_split* and regexp_matches, there's a limit of 64M total
matches, which also doesn't seem to be documented anywhere).

Should these limits:

a) be removed

Doubt it --- we could use the "huge" request variants, maybe, but
I wonder whether the engine could run fast enough that you'd want to.

c) have better error messages?

+1 for that, though.

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Tom Lane (#2)

Re: Undocumented(?) limits on regexp functions

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Should these limits:

a) be removed

Tom> Doubt it --- we could use the "huge" request variants, maybe, but
Tom> I wonder whether the engine could run fast enough that you'd want
Tom> to.

I do wonder (albeit without evidence) whether the quadratic slowdown
problem I posted a patch for earlier was ignored for so long because
people just went "meh, regexps are slow" rather than wondering why a
trivial splitting of a 40kbyte string was taking more than a second.

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrew Gierth (#3)

Re: Undocumented(?) limits on regexp functions

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Doubt it --- we could use the "huge" request variants, maybe, but
Tom> I wonder whether the engine could run fast enough that you'd want
Tom> to.

I do wonder (albeit without evidence) whether the quadratic slowdown
problem I posted a patch for earlier was ignored for so long because
people just went "meh, regexps are slow" rather than wondering why a
trivial splitting of a 40kbyte string was taking more than a second.

I have done performance measurements on the regex stuff in the past,
and not noticed any huge penalty in regexp.c. I was planning to try
to figure out what test case you were using that was different from
what I'd looked at, but not got round to it yet.

In the light of morning I'm reconsidering my initial thought of
not wanting to use MemoryContextAllocHuge. My reaction was based
on thinking that that would allow people to reach indefinitely
large regexp inputs, but really that's not so; the maximum input
length will be a 1GB text object, hence at most 1G characters.
regexp.c needs to expand that into 4-bytes-each "chr" characters,
so it could be at most 4GB of data. The fact that inputs between
256M and 1G characters fail could be seen as an implementation
rough edge that we ought to sand down, at least on 64-bit platforms.

regards, tom lane

Tels

nospam-pg-abuse@bloodgate.com

over 7 years ago

In reply to: Andrew Gierth (#3)

Re: Undocumented(?) limits on regexp functions

Moin Andrew,

On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Should these limits:

a) be removed

Tom> Doubt it --- we could use the "huge" request variants, maybe, but
Tom> I wonder whether the engine could run fast enough that you'd want
Tom> to.

I do wonder (albeit without evidence) whether the quadratic slowdown
problem I posted a patch for earlier was ignored for so long because
people just went "meh, regexps are slow" rather than wondering why a
trivial splitting of a 40kbyte string was taking more than a second.

Pretty much this. :)

First of all, thank you for working in this area, this is very welcome.

We do use UTF-8 and we did notice that regexp are not actually the fastest
around, albeit we did not (yet) run into the memory limit. Mostly, because
the regexp_match* stuff we use is only used in places where the
performance is not key and the input/output is small (albeit, now that I
mention it, the quadratic behaviour might explain a few slowdowns in other
cases I need to investigate).

Anyway, in a few places we have functions that use a lot (> a dozend)
regexps that are also moderate complex (e.g. span multiple lines). In
these cases the performance was not really up to par, so I experimented
and in the end rewrote the functions in plperl. Which fixed the
performance, so we no longer had this issue.

All the best,

Tels

Mark Dilger

hornschnorter@gmail.com

over 7 years ago

In reply to: Tels (#5)

Re: Undocumented(?) limits on regexp functions

On Aug 14, 2018, at 10:01 AM, Tels <nospam-pg-abuse@bloodgate.com> wrote:

Moin Andrew,

On Tue, August 14, 2018 9:16 am, Andrew Gierth wrote:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Should these limits:

a) be removed

Tom> Doubt it --- we could use the "huge" request variants, maybe, but
Tom> I wonder whether the engine could run fast enough that you'd want
Tom> to.

I do wonder (albeit without evidence) whether the quadratic slowdown
problem I posted a patch for earlier was ignored for so long because
people just went "meh, regexps are slow" rather than wondering why a
trivial splitting of a 40kbyte string was taking more than a second.

Pretty much this. :)

First of all, thank you for working in this area, this is very welcome.

We do use UTF-8 and we did notice that regexp are not actually the fastest
around, albeit we did not (yet) run into the memory limit. Mostly, because
the regexp_match* stuff we use is only used in places where the
performance is not key and the input/output is small (albeit, now that I
mention it, the quadratic behaviour might explain a few slowdowns in other
cases I need to investigate).

Anyway, in a few places we have functions that use a lot (> a dozend)
regexps that are also moderate complex (e.g. span multiple lines). In
these cases the performance was not really up to par, so I experimented
and in the end rewrote the functions in plperl. Which fixed the
performance, so we no longer had this issue.

+1. I have done something similar, though in C rather than plperl.

As for the length limit, I have only hit that in stress testing, not in
practice.

mark