Client failure allows backed to continue

Started by Bruce Momjianover 23 years ago7 messageshackers

bruce@momjian.us

over 23 years ago

As part of the training class I did, some people tested what happens
when the client allocates tons of memory to store a result and aborts.

What we found was that though elog was properly called:

elog(COMMERROR, "pq_recvbuf: recv() failed: %m");

(I think that was the message.) the backend did not exit and kept
eating CPU. I think the problem is that the elog code only exits on
ERROR, not COMMERROR. Is there some way to fix this?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Bruce Momjian (#1)

Re: Client failure allows backed to continue

Bruce Momjian <pgman@candle.pha.pa.us> writes:

As part of the training class I did, some people tested what happens
when the client allocates tons of memory to store a result and aborts.

What we found was that though elog was properly called:

elog(COMMERROR, "pq_recvbuf: recv() failed: %m");

(I think that was the message.) the backend did not exit and kept
eating CPU. I think the problem is that the elog code only exits on
ERROR, not COMMERROR. Is there some way to fix this?

There's been talk of setting the QueryCancel flag after detecting a
client communication failure ... but no one has ever done the legwork
to see if that works nicely, or what downsides it might have.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Tom Lane (#2)

Re: Client failure allows backed to continue

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

As part of the training class I did, some people tested what happens
when the client allocates tons of memory to store a result and aborts.

What we found was that though elog was properly called:

elog(COMMERROR, "pq_recvbuf: recv() failed: %m");

(I think that was the message.) the backend did not exit and kept
eating CPU. I think the problem is that the elog code only exits on
ERROR, not COMMERROR. Is there some way to fix this?

There's been talk of setting the QueryCancel flag after detecting a
client communication failure ... but no one has ever done the legwork
to see if that works nicely, or what downsides it might have.

Why is COMMERROR not doing the longjump like ERROR?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Bruce Momjian (#3)

Re: Client failure allows backed to continue

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Why is COMMERROR not doing the longjump like ERROR?

Because it's defined to be like LOG.

A more useful reply might be that I'm not sure it's safe to abort in the
client I/O routines.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Tom Lane (#4)

Re: Client failure allows backed to continue

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Why is COMMERROR not doing the longjump like ERROR?

Because it's defined to be like LOG.

A more useful reply might be that I'm not sure it's safe to abort in the
client I/O routines.

Well, if we get an I/O error, I can't imagine why we would continue
doing anything --- are any of those recoverable? Do we need a separate
error type for I/O messages?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Tom Lane

tgl@sss.pgh.pa.us

over 23 years ago

In reply to: Bruce Momjian (#5)

Re: Client failure allows backed to continue

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, if we get an I/O error, I can't imagine why we would continue
doing anything --- are any of those recoverable?

Well, that's what's not clear --- it's hard to tell if a write failure
is a hard error or just transient. If we make like elog(ERROR),
returning to the main loop, and then a read from the client *doesn't*
fail, we'll try to continue ... but we've just screwed the pooch,
because we have not sent a complete message and therefore certainly have
messed up frontend/backend synchronization. I have no idea whether it's
really possible to recover from this situation or not, but that approach
surely won't work.

If you want to take a kamikaze any-comm-error-means-we're-dead approach,
you might think about elog(FATAL). But that tries to send a message to
the client. Instant infinite loop, if the error is hard.

Complaints to the postmaster log, and abort at the next safe place
(*not* partway through message output) seem like the way to go to me.

Do we need a separate error type for I/O messages?

Uh ... see COMMERROR.

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 23 years ago

In reply to: Tom Lane (#6)

Re: Client failure allows backed to continue

Well, setting query_cancel then seems like a logical solution because it
will exit at a reasonable point, hopefully. Right now we have
statement_timeout and that exits at a give time, but I suppose it
doesn't exit while data is transfering, so it may be different.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, if we get an I/O error, I can't imagine why we would continue
doing anything --- are any of those recoverable?

Well, that's what's not clear --- it's hard to tell if a write failure
is a hard error or just transient. If we make like elog(ERROR),
returning to the main loop, and then a read from the client *doesn't*
fail, we'll try to continue ... but we've just screwed the pooch,
because we have not sent a complete message and therefore certainly have
messed up frontend/backend synchronization. I have no idea whether it's
really possible to recover from this situation or not, but that approach
surely won't work.

If you want to take a kamikaze any-comm-error-means-we're-dead approach,
you might think about elog(FATAL). But that tries to send a message to
the client. Instant infinite loop, if the error is hard.

Complaints to the postmaster log, and abort at the next safe place
(*not* partway through message output) seem like the way to go to me.

Do we need a separate error type for I/O messages?

Uh ... see COMMERROR.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073