regex cache

Started by Josh Berkusalmost 18 years ago7 messageshackers
Jump to latest
#1Josh Berkus
josh@agliodbs.com

Folks,

I'm doing some analysis of PostgreSQL site traffic, and am being frequently
hung up by the compile-time-fixed size of our regex cache (32 regexes, per
MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem
or some other dynamically changeable limit for regex caching?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#1)
Re: regex cache

Josh Berkus <josh@agliodbs.com> writes:

I'm doing some analysis of PostgreSQL site traffic, and am being frequently
hung up by the compile-time-fixed size of our regex cache (32 regexes, per
MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem
or some other dynamically changeable limit for regex caching?

Hmmm ... Spencer's regex library makes a point of hiding its internal
representation of a compiled regex from the calling code. So measuring
the size of the regex cache in bytes would involve doing a lot of
violence to that API. We could certainly allow the size of the cache
measured in number-of-regexes to be controlled, though.

Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?

regards, tom lane

#3Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#2)
Re: regex cache

Tom,

Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?

Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#3)
Re: regex cache

Josh Berkus <josh@agliodbs.com> writes:

Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?

Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.

Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?

regards, tom lane

#5Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#4)
Re: regex cache

Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?

Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.

Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?

Yeah. You know me, I was just trying to avoid having more GUCs.

--Josh

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#5)
Re: regex cache

Josh Berkus <josh@agliodbs.com> writes:

Tom Lane wrote:

Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?

Yeah. You know me, I was just trying to avoid having more GUCs.

I'm not excited about it either, but I think if we're going to make
this adjustable it does need its own knob. I can easily believe
that a large list of precompiled GUCs could be counterproductive
given a workload where you don't get much reuse, so I don't want
the list size going to the moon just because someone cranked up
work_mem for other purposes.

(I'm not real sure that that "self-organizing list" data structure
would work well beyond 1000 or so entries even if you did have
enough re-use to justify them all. Anyone want to try to do some
performance testing? In particular I think we might want to drop
the move-to-front approach in favor of move-up-one, just to avoid
O(N^2) memmove costs.)

regards, tom lane

#7Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#6)
Re: regex cache

Tom,

I'm not excited about it either, but I think if we're going to make
this adjustable it does need its own knob. I can easily believe
that a large list of precompiled GUCs could be counterproductive
given a workload where you don't get much reuse, so I don't want
the list size going to the moon just because someone cranked up
work_mem for other purposes.

Yes. I was just trying to avoid thinking about it. ;-)

(I'm not real sure that that "self-organizing list" data structure
would work well beyond 1000 or so entries even if you did have
enough re-use to justify them all. Anyone want to try to do some
performance testing? In particular I think we might want to drop
the move-to-front approach in favor of move-up-one, just to avoid
O(N^2) memmove costs.)

Hmmm. Yeah, I can see that.

Well, I have a test case here (the PostgreSQL download logs), or I
wouldn't have brought up the issue. I just need to find a way to
multi-thread it so I can get the effect of multiple clients.

--Josh