Hashjoin status report

Started by Tom Laneover 26 years ago13 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I've committed fixes that deal with all of the coredump problems
I could find in nodeHash.c (there were several :-().

But the code still has a fundamental design flaw: it uses a fixed-size
overflow area to hold tuples that don't fit into the hashbuckets they
are assigned to. This means you get "hashtable out of memory" errors
if the distribution of tuples is skewed enough, or if the number of
hashbuckets is too small because the system underestimated the number
of tuples in the relation. Better than a coredump I suppose, but still
very bad, especially since the optimizer likes to use hashjoins more
than it used to.

What I would like to do to fix this is to store the tuples in a Portal
instead of in a fixed-size palloc block. While at it, I'd rip out the
"relative vs. absolute address" cruft that is in the code now.
(Apparently there was once some thought of using a shared memory block
so that multiple processes could share the work of a hashjoin. All that
remains is ugly, bug-prone code ...)

The reason I bring all this up is that it'd be a nontrivial revision
to nodeHash.c, and I'm uncomfortable with the notion of making such a
change this late in a beta cycle. On the other hand it *is* a bug fix,
and a fairly important one IMHO.

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

regards, tom lane

#2The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#1)
Re: [HACKERS] Hashjoin status report

On Thu, 6 May 1999, Tom Lane wrote:

I've committed fixes that deal with all of the coredump problems
I could find in nodeHash.c (there were several :-().

But the code still has a fundamental design flaw: it uses a fixed-size
overflow area to hold tuples that don't fit into the hashbuckets they
are assigned to. This means you get "hashtable out of memory" errors
if the distribution of tuples is skewed enough, or if the number of
hashbuckets is too small because the system underestimated the number
of tuples in the relation. Better than a coredump I suppose, but still
very bad, especially since the optimizer likes to use hashjoins more
than it used to.

What I would like to do to fix this is to store the tuples in a Portal
instead of in a fixed-size palloc block. While at it, I'd rip out the
"relative vs. absolute address" cruft that is in the code now.
(Apparently there was once some thought of using a shared memory block
so that multiple processes could share the work of a hashjoin. All that
remains is ugly, bug-prone code ...)

The reason I bring all this up is that it'd be a nontrivial revision
to nodeHash.c, and I'm uncomfortable with the notion of making such a
change this late in a beta cycle. On the other hand it *is* a bug fix,
and a fairly important one IMHO.

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

Estimate of time involved to fix this? vs likelihood of someone
triggering the bug in production?

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#2)
Re: [HACKERS] Hashjoin status report

The Hermit Hacker <scrappy@hub.org> writes:

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

Estimate of time involved to fix this? vs likelihood of someone
triggering the bug in production?

I could probably get the coding done this weekend, unless something else
comes up to distract me. It's the question of how much testing it'd
receive before release that worries me...

As for the likelihood, that's hard to say. It's very easy to trigger
the bug as a test case. (Arrange for a hashjoin where the inner table
has a lot of identical rows, or at least many sets of more-than-10-
rows-with-the-same-value-in-the-field-being-hashed-on.) In real life
you'd like to think that that's pretty improbable.

What started this go-round was Contzen's report of seeing the
"hash table out of memory. Use -B parameter to increase buffers"
message in what was evidently a real-life scenario. So it can happen.
Do you recall having seen many complaints about that error before?

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: [HACKERS] Hashjoin status report

I said:

What started this go-round was Contzen's report of seeing the
"hash table out of memory. Use -B parameter to increase buffers"
message in what was evidently a real-life scenario. So it can happen.
Do you recall having seen many complaints about that error before?

A little bit of poking through the mailing list archives found four
other complaints in the past six months (scattered through the sql,
admin, and novices lists). There might have been more that my search
query missed.

What do you think, does that reach your threshold of pain or not?

regards, tom lane

#5The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#3)
Re: [HACKERS] Hashjoin status report

On Thu, 6 May 1999, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

Estimate of time involved to fix this? vs likelihood of someone
triggering the bug in production?

I could probably get the coding done this weekend, unless something else
comes up to distract me. It's the question of how much testing it'd
receive before release that worries me...

We're looking at 3 weeks till release...I'll let you call it on this one.
If you feel confident about getting the bug fixed before release, with
enough time for testing, go for it. If it makes more sense to make it an
'untested patch', go for that one.

I believe you can fix this bug, I'm just kinda nervous about adding a
different one, which I believe is your worry also, with the limited amount
of testing possible :(

My preference would be a "use at own risk" patch...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#6Vadim Mikheev
vadim@krs.ru
In reply to: Tom Lane (#1)
Re: [HACKERS] Hashjoin status report

Tom Lane wrote:

I've committed fixes that deal with all of the coredump problems
I could find in nodeHash.c (there were several :-().

But the code still has a fundamental design flaw: it uses a fixed-size
overflow area to hold tuples that don't fit into the hashbuckets they
are assigned to. This means you get "hashtable out of memory" errors
if the distribution of tuples is skewed enough, or if the number of
hashbuckets is too small because the system underestimated the number
of tuples in the relation. Better than a coredump I suppose, but still
very bad, especially since the optimizer likes to use hashjoins more
than it used to.

What I would like to do to fix this is to store the tuples in a Portal
instead of in a fixed-size palloc block. While at it, I'd rip out the
"relative vs. absolute address" cruft that is in the code now.
(Apparently there was once some thought of using a shared memory block
so that multiple processes could share the work of a hashjoin. All that
remains is ugly, bug-prone code ...)

Fix it! Testing is easy...
Though, I would use chain of big blocks for overflow area,
not Portal - it's too big thing to be directly used in join method.

Vadim

#7The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#4)
Re: [HACKERS] Hashjoin status report

On Thu, 6 May 1999, Tom Lane wrote:

I said:

What started this go-round was Contzen's report of seeing the
"hash table out of memory. Use -B parameter to increase buffers"
message in what was evidently a real-life scenario. So it can happen.
Do you recall having seen many complaints about that error before?

A little bit of poking through the mailing list archives found four
other complaints in the past six months (scattered through the sql,
admin, and novices lists). There might have been more that my search
query missed.

What do you think, does that reach your threshold of pain or not?

Considering the number of instances we have out there, it sounds like a
very rare 'tweak' to the bug...I'd say keep it as an 'untested patch' for
the release, for those that wish to try it if they ahve a problem, and
include it for v6.6 and v6.5.1 ...

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: The Hermit Hacker (#7)
Re: [HACKERS] Hashjoin status report

The Hermit Hacker <scrappy@hub.org> writes:

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

Estimate of time involved to fix this? vs likelihood of someone
triggering the bug in production?

I could probably get the coding done this weekend, unless something else
comes up to distract me. It's the question of how much testing it'd
receive before release that worries me...

As for the likelihood, that's hard to say. It's very easy to trigger
the bug as a test case. (Arrange for a hashjoin where the inner table
has a lot of identical rows, or at least many sets of more-than-10-
rows-with-the-same-value-in-the-field-being-hashed-on.) In real life
you'd like to think that that's pretty improbable.

What started this go-round was Contzen's report of seeing the
"hash table out of memory. Use -B parameter to increase buffers"
message in what was evidently a real-life scenario. So it can happen.
Do you recall having seen many complaints about that error before?

We already have a good example for this "hash table out of memory. Use
-B parameter to increase buffers" syndrome in our source tree. Go
src/test/bench, remove "-B 256" from the last line of runwisc.sh then
run the test. The "-B 256" used to not be in there. That was added by
me while fixing the test suit and elog() (see included posting). I
don't see the error message in 6.4.2. I guess this is due to the
change of the optimizer.

IMHO, we should fix this before 6.5 is out, or should change the
default settings of -B to 256 or so, this may cause short of shmem,
however.

P.S. At that time I misunderstood in that I didn't have enough sort
memory :-<

Show quoted text

Message-Id: <199904160654.PAA00221@srapc451.sra.co.jp>
From: Tatsuo Ishii <t-ishii@sra.co.jp>
To: hackers@postgreSQL.org
Subject: [HACKERS] elog() and wisconsin bench test fix
Date: Fri, 16 Apr 1999 15:54:16 +0900

I have modified elog() so that it uses its own pid(using getpid()) as
the first parameter for kill() in some cases. It used to get its own
pid from MyProcPid global variable. This was fine until I ran the
wisconsin benchmark test suit (test/bench/). In the test, postgres is
run as a command and MyProcPid is set to 0. As a result elog() calls
kill() with the first parameter being set to 0 and SIGQUIT was issued
to the process group, not the postgres process itself! This was why
/bin/sh got core dumped whenever I ran the bench test.

Also, I fixed several bugs in the test quries.

One thing still remains is some queries fail due to insufficient sort
memory. I modified the test script adding -B option. But is this
normal? I think not. I thought postgres should use disk files instead
of memory if there's enough sort buffer.

Comments?
--
Tatsuo Ishii

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#8)
Re: [HACKERS] Hashjoin status report

The Hermit Hacker <scrappy@hub.org> writes:

On Thu, 6 May 1999, Tom Lane wrote:

What do you think, does that reach your threshold of pain or not?

Considering the number of instances we have out there, it sounds like a
very rare 'tweak' to the bug...I'd say keep it as an 'untested patch' for
the release, for those that wish to try it if they ahve a problem, and
include it for v6.6 and v6.5.1 ...

OK, we'll deal with it as a post-release patch then. It's not like
I haven't got anything else to do before the end of the month :-(

regards, tom lane

#10Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#3)
Re: [HACKERS] Hashjoin status report

The Hermit Hacker <scrappy@hub.org> writes:

Opinions? Should I plow ahead, or leave this to fix after 6.5 release?

Estimate of time involved to fix this? vs likelihood of someone
triggering the bug in production?

I could probably get the coding done this weekend, unless something else
comes up to distract me. It's the question of how much testing it'd
receive before release that worries me...

As for the likelihood, that's hard to say. It's very easy to trigger
the bug as a test case. (Arrange for a hashjoin where the inner table
has a lot of identical rows, or at least many sets of more-than-10-
rows-with-the-same-value-in-the-field-being-hashed-on.) In real life
you'd like to think that that's pretty improbable.

What started this go-round was Contzen's report of seeing the
"hash table out of memory. Use -B parameter to increase buffers"
message in what was evidently a real-life scenario. So it can happen.
Do you recall having seen many complaints about that error before?

New optimizer does more hashjoins, so we will see it more often.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#11Michael Contzen
mcontzen@dohle.com
In reply to: Bruce Momjian (#10)
Re: [HACKERS] Hashjoin status report

Hello,

(using snapshot of May, 5th)

because we have the need to have a workaround to this hash problem, I looked into the hashing code (well, without having the background).

For reducing the probability of an overflow I increased
#define FUDGE_FAC 3
witch was originally 1.5. I think it´s a data dependend constant (correct?) and for my data it works...

It does the job, but certainly that this is not the solution.

Increasing -B 256 doesn´t work:
NOTICE: Buffer Leak: [248] (freeNext=0, freePrev=0, relname=, blockNum=0, flags=0x0, refcount=0 25453)
pq_flush: send() failed, errno 88
pq_recvbuf: recv() failed, errno=88

Kind regards,

Michael Contzen
mcontzen@dohle.com
Dohle Systemberatung, Germany

Kind regards,

Michael Contzen
mcontzen@dohle.com
Dohle Systemberatung, Germany

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Contzen (#11)
Re: [HACKERS] Hashjoin status report

Michael Contzen <mcontzen@dohle.com> writes:

(using snapshot of May, 5th)
because we have the need to have a workaround to this hash problem, I
looked into the hashing code (well, without having the background).
For reducing the probability of an overflow I increased
#define FUDGE_FAC 3
witch was originally 1.5.

For a given -B setting, that would mean that more of the hashtable space
is reserved for overflow records and less for hashbuckets, which should
reduce the probability of an overrun --- but it would also make the
system more prone to decide that it needs to divide the hash merge into
"batches", so performance will suffer. Still, it seems like a
reasonable workaround until a proper fix can be made. In fact I think
maybe I should change FUDGE_FAC to 2.0 for the 6.5 release, as a stopgap
measure...

A more critical problem is that there were some severe bugs in the code
for handling batches. I fixed at least some of 'em, but I committed
those fixes on the evening of 5 May, so I suspect they are not in your
snapshot. (Check the date of src/backend/executor/nodeHash.c to see.)

Increasing -B 256 doesn't work:
NOTICE: Buffer Leak: [248] (freeNext=3D0, freePrev=3D0, relname=3D, =
blockNum=3D0, flags=3D0x0, refcount=3D0 25453)
pq_flush: send() failed, errno 88

This behavior could be an artifact of one of the bugs I fixed (which
was a large-scale memory clobber). Or it could be another bug entirely.
This one actually worries me a great deal more than the "out of memory"
problem, because that one I know how and where to fix. If this is a
separate bug then I don't know where it's coming from. Please upgrade
to latest snapshot and check -B 256 again.

regards, tom lane

#13Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#12)
Re: [HACKERS] Hashjoin status report

Michael Contzen <mcontzen@dohle.com> writes:

(using snapshot of May, 5th)
because we have the need to have a workaround to this hash problem, I
looked into the hashing code (well, without having the background).
For reducing the probability of an overflow I increased
#define FUDGE_FAC 3
witch was originally 1.5.

For a given -B setting, that would mean that more of the hashtable space
is reserved for overflow records and less for hashbuckets, which should
reduce the probability of an overrun --- but it would also make the
system more prone to decide that it needs to divide the hash merge into
"batches", so performance will suffer. Still, it seems like a
reasonable workaround until a proper fix can be made. In fact I think
maybe I should change FUDGE_FAC to 2.0 for the 6.5 release, as a stopgap
measure...

A more critical problem is that there were some severe bugs in the code
for handling batches. I fixed at least some of 'em, but I committed
those fixes on the evening of 5 May, so I suspect they are not in your
snapshot. (Check the date of src/backend/executor/nodeHash.c to see.)

One thing to consider. If you decide to wait on the patch until after
6.5, but then we find the new optimizer is causing this bug too often,
we will have to fix it later in the beta cycle with less testing time
available.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026