Notice and share memory corruption

Started by Hannu Krosingover 25 years ago4 messageshackers

Jump to latest

Hannu Krosing

hannu@tm.ee

over 25 years ago

I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
rpm-s

NOTICE: RegisterSharedInvalid: SI buffer overflow
NOTICE: InvalidateSharedInvalid: cache state reset

Actually I get many of them ;(

I'm running a script that does a bunch of mixed INSERTS, UPDATES,
DELETES and SELECTS.

after getting that I'm unable to vacuum database until I reset the OS

Where/how should I start looking (or is it a known problem)

Are there any simple workarounds to stop it happening.

-----------
Hannu

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Hannu Krosing (#1)

Re: Notice and share memory corruption

Hannu Krosing <hannu@tm.ee> writes:

I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
rpm-s

NOTICE: RegisterSharedInvalid: SI buffer overflow
NOTICE: InvalidateSharedInvalid: cache state reset

Actually I get many of them ;(

AFAIK, these are just noise in 7.0. The only reason you see them is
we haven't got round to removing the messages or downgrading them to
elog(DEBUG).

I'm running a script that does a bunch of mixed INSERTS, UPDATES,
DELETES and SELECTS.

I'll bet you also have some backends sitting idle with open
transactions? The combination of idle and active backends is what
usually provokes SI overruns.

after getting that I'm unable to vacuum database until I reset the OS

Define your terms more carefully, please. What do you mean by
"unable to vacuum" --- what happens *exactly*? In any case,
surely it doesn't take an OS reboot to recover. I might believe
you need to restart the postmaster...

regards, tom lane

Hannu Krosing

hannu@tm.ee

over 25 years ago

In reply to: Hannu Krosing (#1)

Re: Notice and share memory corruption

Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
rpm-s

NOTICE: RegisterSharedInvalid: SI buffer overflow
NOTICE: InvalidateSharedInvalid: cache state reset

Actually I get many of them ;(

AFAIK, these are just noise in 7.0. The only reason you see them is
we haven't got round to removing the messages or downgrading them to
elog(DEBUG).

I'm running a script that does a bunch of mixed INSERTS, UPDATES,
DELETES and SELECTS.

I'll bet you also have some backends sitting idle with open
transactions? The combination of idle and active backends is what
usually provokes SI overruns.

after getting that I'm unable to vacuum database until I reset the OS

Define your terms more carefully, please. What do you mean by
"unable to vacuum" --- what happens *exactly*?

NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is
referenced (private 0, global 2)
FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

In any case,
surely it doesn't take an OS reboot to recover. I might believe
you need to restart the postmaster...

on one machine a simple restart worked

Maybe i have to really restart it (instead of doing
/etc/rc.d/init.d/postgresql restart)
by running killall -9 /usr/bin/postgres

I was quite sure that just restarting it did not help, but maybe
it really did not restart, just claimed to .

On the other I still get

amphora2=# vacuum;
NOTICE: FlushRelationBuffers(item, 30): block 2 is referenced (private
0, global 1)
FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

after stopping postmaster (and checking it is stopped)

I could do a vacuum after restarting the whole machine...

OTOH it _may_ be that someone started another backend right after
restart and did something,
but must this be a FATAL error ?

-----------
Hannu

Tom Lane

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Hannu Krosing (#3)

Re: Notice and share memory corruption

Hannu Krosing <hannu@tm.ee> writes:

Define your terms more carefully, please. What do you mean by
"unable to vacuum" --- what happens *exactly*?

NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is
referenced (private 0, global 2)
FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2

Oh, that's interesting. This error indicates that some prior
transaction neglected to release a reference count on a shared buffer.
We have seen sporadic reports of this problem in 7.0, but so far no
one has come up with a reproducible example. If you can boil down
your script to something that reproducibly causes the problem then
that'd be a great help in tracking it down.

If you have clients that sometimes disconnect in the middle of a
transaction, it might help to apply the attached patch.

Maybe i have to really restart it (instead of doing
/etc/rc.d/init.d/postgresql restart)
by running killall -9 /usr/bin/postgres

Restarting the postmaster should clear the problem (by releasing and
reinitializing shared memory). I dunno where you got the idea that
kill -9 was a recommended way of shutting down the system, but I sure
wouldn't recommend it. A plain kill on the postmaster ought to do it
(see the pg_ctl script in release 7.0.*).

regards, tom lane

*** src/backend/tcop/postgres.c.orig	Sat May 20 22:23:30 2000
--- src/backend/tcop/postgres.c	Wed Aug 30 16:47:51 2000
***************
*** 1459,1465 ****
  	 * Initialize the deferred trigger manager
  	 */
  	if (DeferredTriggerInit() != 0)
! 		proc_exit(0);

SetProcessingMode(NormalProcessing);

--- 1459,1465 ----
  	 * Initialize the deferred trigger manager
  	 */
  	if (DeferredTriggerInit() != 0)
! 		goto normalexit;

SetProcessingMode(NormalProcessing);

***************
*** 1479,1490 ****
TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");

AbortCurrentTransaction();
! InError = false;
if (ExitAfterAbort)
! {
! ProcReleaseLocks(); /* Just to be sure... */
! proc_exit(0);
! }
}

  	Warn_restart_ready = true;	/* we can now handle elog(ERROR) */
--- 1479,1489 ----
  			TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");

AbortCurrentTransaction();
!
if (ExitAfterAbort)
! goto errorexit;
!
! InError = false;
}

Warn_restart_ready = true; /* we can now handle elog(ERROR) */
***************
*** 1553,1560 ****
if (HandleFunctionRequest() == EOF)
{
/* lost frontend connection during F message input */
! pq_close();
! proc_exit(0);
}
break;

--- 1552,1558 ----
  				if (HandleFunctionRequest() == EOF)
  				{
  					/* lost frontend connection during F message input */
! 					goto normalexit;
  				}
  				break;

***************
*** 1608,1618 ****
*/
case 'X':
case EOF:
! if (!IsUnderPostmaster)
! ShutdownXLOG();
! pq_close();
! proc_exit(0);
! break;

  			default:
  				elog(ERROR, "unknown frontend message was received");
--- 1606,1612 ----
  				 */
  			case 'X':
  			case EOF:
! 				goto normalexit;

default:
elog(ERROR, "unknown frontend message was received");
***************
*** 1642,1651 ****
if (IsUnderPostmaster)
NullCommand(Remote);
}
! } /* infinite for-loop */

! proc_exit(0); /* shouldn't get here... */
! return 1;
}

  #ifndef HAVE_GETRUSAGE
--- 1636,1655 ----
  			if (IsUnderPostmaster)
  				NullCommand(Remote);
  		}
! 	}							/* end of main loop */
! 
! normalexit:
! 	ExitAfterAbort = true;		/* ensure we will exit if elog during abort */
! 	AbortOutOfAnyTransaction();
! 	if (!IsUnderPostmaster)
! 		ShutdownXLOG();
! 
! errorexit:
! 	pq_close();
! 	ProcReleaseLocks();			/* Just to be sure... */
! 	proc_exit(0);

! return 1; /* keep compiler quiet */
}

#ifndef HAVE_GETRUSAGE