Shared buffer hash table corrupted

Started by Mark Fletcherover 6 years ago5 messagesgeneral

markf@corp.groups.io

over 6 years ago

Hi All,

Running 9.6.15, this morning we got a 'shared buffer hash table corrupted'
error on a query. I reran the query a couple hours later, and it completed
without error. This is running in production on a Linode instance which
hasn't seen any config changes in months.

I didn't find much on-line about this. How concerned should I be? Would you
move the instance to a different physical host?

Thanks,
Mark

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Mark Fletcher (#1)

Re: Shared buffer hash table corrupted

Mark Fletcher <markf@corp.groups.io> writes:

Running 9.6.15, this morning we got a 'shared buffer hash table corrupted'
error on a query. I reran the query a couple hours later, and it completed
without error. This is running in production on a Linode instance which
hasn't seen any config changes in months.

I didn't find much on-line about this. How concerned should I be? Would you
move the instance to a different physical host?

Personally, I'd restart the postmaster, but not do more than that unless
the error recurs.

regards, tom lane

Mark Fletcher

markf@corp.groups.io

over 6 years ago

In reply to: Tom Lane (#2)

Re: Shared buffer hash table corrupted

On Fri, Feb 21, 2020 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Personally, I'd restart the postmaster, but not do more than that unless
the error recurs.

Thanks for the response. I did restart the postmaster yesterday. Earlier
this morning, a query that normally completes fine started to error out
with 'invalid memory alloc request size 18446744073709551613'. Needless to
say our database isn't quite that size. This query was against a table in a
different database than the one that had the corruption warning yesterday.
Restarting the postmaster again fixed the problem. For good measure I
restarted the machine as well.

I need to decide what to do next, if anything. We have a hot standby that
we also run queries against, and it hasn't shown any errors. I can switch
over to that as the primary. Or I can move the main database to a different
physical host.

Thoughts appreciated.

Thanks,
Mark

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Mark Fletcher (#3)

Re: Shared buffer hash table corrupted

Mark Fletcher <markf@corp.groups.io> writes:

Thanks for the response. I did restart the postmaster yesterday. Earlier
this morning, a query that normally completes fine started to error out
with 'invalid memory alloc request size 18446744073709551613'. Needless to
say our database isn't quite that size. This query was against a table in a
different database than the one that had the corruption warning yesterday.
Restarting the postmaster again fixed the problem. For good measure I
restarted the machine as well.

Um. At that point I'd agree with your concern about developing hardware
problems. Both of these symptoms could be easily explained by dropped
bits in PG's shared memory area. Do you happen to know if the server
has ECC RAM?

regards, tom lane

Mark Fletcher

markf@corp.groups.io

over 6 years ago

In reply to: Tom Lane (#4)

Re: Shared buffer hash table corrupted

On Sat, Feb 22, 2020 at 9:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Um. At that point I'd agree with your concern about developing hardware
problems. Both of these symptoms could be easily explained by dropped
bits in PG's shared memory area. Do you happen to know if the server
has ECC RAM?

Yes, it appears that Linode uses ECC and other server grade hardware for

their machines.

Thanks,
Mark