regex cache
Folks,
I'm doing some analysis of PostgreSQL site traffic, and am being frequently
hung up by the compile-time-fixed size of our regex cache (32 regexes, per
MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem
or some other dynamically changeable limit for regex caching?
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
I'm doing some analysis of PostgreSQL site traffic, and am being frequently
hung up by the compile-time-fixed size of our regex cache (32 regexes, per
MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem
or some other dynamically changeable limit for regex caching?
Hmmm ... Spencer's regex library makes a point of hiding its internal
representation of a compiled regex from the calling code. So measuring
the size of the regex cache in bytes would involve doing a lot of
violence to that API. We could certainly allow the size of the cache
measured in number-of-regexes to be controlled, though.
Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?
regards, tom lane
Tom,
Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?
Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.
--
Josh Berkus
PostgreSQL @ Sun
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?
Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.
Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?
regards, tom lane
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Having said that, I'm not sure it'd help your problem. If your query is
using more than 32 regexes concurrently, it likely is using $BIGNUM
regexes concurrently. How do we fix that?Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example,
where I'm hitting a wall is 300 regexes. Some quick testing on my opteron
text machine right now shows that the execution time difference between 20rx
and 50rx is around 20x.Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?
Yeah. You know me, I was just trying to avoid having more GUCs.
--Josh
Josh Berkus <josh@agliodbs.com> writes:
Tom Lane wrote:
Hmm. Well, I still don't want to tie it to work_mem; how do you feel
about a new GUC to determine the max number of cached REs?
Yeah. You know me, I was just trying to avoid having more GUCs.
I'm not excited about it either, but I think if we're going to make
this adjustable it does need its own knob. I can easily believe
that a large list of precompiled GUCs could be counterproductive
given a workload where you don't get much reuse, so I don't want
the list size going to the moon just because someone cranked up
work_mem for other purposes.
(I'm not real sure that that "self-organizing list" data structure
would work well beyond 1000 or so entries even if you did have
enough re-use to justify them all. Anyone want to try to do some
performance testing? In particular I think we might want to drop
the move-to-front approach in favor of move-up-one, just to avoid
O(N^2) memmove costs.)
regards, tom lane
Tom,
I'm not excited about it either, but I think if we're going to make
this adjustable it does need its own knob. I can easily believe
that a large list of precompiled GUCs could be counterproductive
given a workload where you don't get much reuse, so I don't want
the list size going to the moon just because someone cranked up
work_mem for other purposes.
Yes. I was just trying to avoid thinking about it. ;-)
(I'm not real sure that that "self-organizing list" data structure
would work well beyond 1000 or so entries even if you did have
enough re-use to justify them all. Anyone want to try to do some
performance testing? In particular I think we might want to drop
the move-to-front approach in favor of move-up-one, just to avoid
O(N^2) memmove costs.)
Hmmm. Yeah, I can see that.
Well, I have a test case here (the PostgreSQL download logs), or I
wouldn't have brought up the issue. I just need to find a way to
multi-thread it so I can get the effect of multiple clients.
--Josh