Memory Errors

Started by Sam Nelsonover 15 years ago11 messagesgeneral
Jump to latest
#1Sam Nelson
samn@consistentstate.com

Hey, a client of ours has been having some data corruption in their
database. We got the data corruption fixed and we believe we've discovered
the cause (they had a script killing any waiting queries if the locks on
their database hit 1000), but they're still getting errors from one table:

pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: invalid memory alloc request
size 18446744073709551613
pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

That seems like an incredibly large memory allocation request - it shouldn't
be possible for the table to really be that large, should it? Any idea what
may be wrong if it's actually trying to allocate that much memory for a copy
command?

#2Scott Marlowe
scott.marlowe@gmail.com
In reply to: Sam Nelson (#1)
Re: Memory Errors

On Wed, Sep 8, 2010 at 12:56 PM, Sam Nelson <samn@consistentstate.com> wrote:

Hey, a client of ours has been having some data corruption in their
database.  We got the data corruption fixed and we believe we've discovered
the cause (they had a script killing any waiting queries if the locks on
their database hit 1000), but they're still getting errors from one table:

Not sure that's really the underlying problem. Depending on how they
killed the processes there's a slight chance for corruption, but more
likely they've got bad hardware. Can they take their machine down for
testing? memtest86+ is a good tool to get an idea if you've got a
good cpu mobo ram combo or not.

The last bit you included definitely looks like something's corrupted
in the database.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sam Nelson (#1)
Re: Memory Errors

Sam Nelson <samn@consistentstate.com> writes:

pg_dump: Error message from server: ERROR: invalid memory alloc request
size 18446744073709551613
pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

That seems like an incredibly large memory allocation request - it shouldn't
be possible for the table to really be that large, should it? Any idea what
may be wrong if it's actually trying to allocate that much memory for a copy
command?

What that looks like is data corruption; specifically, a bogus length
word for a variable-length field.

regards, tom lane

#4Sam Nelson
samn@consistentstate.com
In reply to: Tom Lane (#3)
Re: Memory Errors

It figures I'd have an idea right after posting to the mailing list.

Yeah, running COPY foo TO stdout; gets me a list of data before erroring
out, so I did a copy (select * from foo order by id asc) to stdout; to see
if I could make some kind of guess as to whether this was related to a
single row or something else.

I got the id of the last row the copy to command was able to grab normally
and tried to figure out the next id. The following started to make me think
along the lines of some kinda bad corruption (even before getting responses
that agreed with that):

Assuming that the last id copied was 1500:

1) select * from foo where id = (select min(id) from foo where id > 1500);
Results in 0 rows

2) select min(id) from foo where id > 1500;
Results in, for example, 200000

3) select max(id) from foo where id > 1500;
Results in, for example, 90000 (a much lower number than returned by min)

4) select id from foo where id > 1500 order by id asc limit 10;
Results in (for example):

200000
202000
210273
220980
15005
15102
15104
15110
15111
15113

So ... yes, it seems that those four id's are somehow part of the problem.

They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
either), so memtest isn't available, but no new corruption has cropped up
since they stopped killing the waiting queries (I just double checked - they
were getting corrupted rows constantly, and we haven't gotten one since that
script stopped killing queries).

We're going to have them attempt to delete the rows with those id's (even
though the rows don't exist) and if that fails, we're going to copy (select
* from foo where id not in (<list>)) to file;, drop table foo;, create table
foo;, and copy foo from file. I'll try to remember to write back with
whether or not any of those things worked.

On Wed, Sep 8, 2010 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Sam Nelson <samn@consistentstate.com> writes:

pg_dump: Error message from server: ERROR: invalid memory alloc request
size 18446744073709551613
pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

That seems like an incredibly large memory allocation request - it

shouldn't

be possible for the table to really be that large, should it? Any idea

what

may be wrong if it's actually trying to allocate that much memory for a

copy

command?

What that looks like is data corruption; specifically, a bogus length
word for a variable-length field.

regards, tom lane

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Sam Nelson (#4)
Re: Memory Errors

On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com> wrote:

It figures I'd have an idea right after posting to the mailing list.
Yeah, running COPY foo TO stdout; gets me a list of data before erroring
out, so I did a copy (select * from foo order by id asc) to stdout; to see
if I could make some kind of guess as to whether this was related to a
single row or something else.
I got the id of the last row the copy to command was able to grab normally
and tried to figure out the next id.  The following started to make me think
along the lines of some kinda bad corruption (even before getting responses
that agreed with that):
Assuming that the last id copied was 1500:
1) select * from foo where id = (select min(id) from foo where id > 1500);
Results in 0 rows
2) select min(id) from foo where id > 1500;
Results in, for example, 200000
3) select max(id) from foo where id > 1500;
Results in, for example, 90000 (a much lower number than returned by min)
4) select id from foo where id > 1500 order by id asc limit 10;
Results in (for example):
200000
202000
210273
220980
15005
15102
15104
15110
15111
15113
So ... yes, it seems that those four id's are somehow part of the problem.
They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
either), so memtest isn't available, but no new corruption has cropped up
since they stopped killing the waiting queries (I just double checked - they
were getting corrupted rows constantly, and we haven't gotten one since that
script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly? You say this is repeatable? What's your
setting of full_page_writes?

one way to identify and potentially nuke bad records of this kind is
to do something like:

select max(length(field1)) from foo order by 1 desc limit 5;

where field1 is the first varlen field (text, bytea, etc) from left to
right order. look for bogously high values and move on to the next
field if you don't find any. once you hit a bad value, try deleting
the record by it's key.

once you've found/deleted them all, immediately pull off a dump, then
rebuild the table. as always, take a filesystem dump before doing
this type of surgery...

merlin
merlin

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#5)
Re: Memory Errors

Merlin Moncure <mmoncure@gmail.com> writes:

On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com> wrote:

So ... yes, it seems that those four id's are somehow part of the problem.
They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
either), so memtest isn't available, but no new corruption has cropped up
since they stopped killing the waiting queries (I just double checked - they
were getting corrupted rows constantly, and we haven't gotten one since that
script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly? You say this is repeatable? What's your
setting of full_page_writes?

I think we'd established that they were doing kill -9 on backend
processes :-(. However, PG has a lot of track record that says that
backend crashes don't result in corrupt data. What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.

regards, tom lane

#7Sam Nelson
samn@consistentstate.com
In reply to: Tom Lane (#6)
Re: Memory Errors

My (our) complaints about EC2 aren't particularly extensive, but last time I
posted to the mailing list saying they were using EC2, the first reply was
someone saying that the corruption was the fault of EC2.

Not that we don't have complaints at all (there are some aspects that are
very frustrating), but I was just trying to stave off anyone who was going
to reply saying "Tell them to stop using EC2".

-- More detail about the script that kills queries:

Honestly, we (or, at least, I) haven't discovered which type of kill they
were doing, but it does seem to be the culprit in some way. I don't talk to
the customers (that's my boss's job), so I didn't get to ask specifics. If
my boss did ask specifics, he didn't tell me.

The previous issue involved toast corruption showing up very regularly (e.g.
once a day, in some cases), the end result being that we had to delete the
corrupted rows. Coming back the next day to see the same corruption on
different rows was not very encouraging.

We found out after that that they had a script running as a daemon that
would, every ten minutes (I believe), check the number of locks on the table
and kill all waiting queries if there were >= 1000 locks.

Even if the corruption wasn't a result of that, we weren't too excited about
the process being there to begin with. We thought there had to be a better
solution than just killing the processes. So we had a discussion about the
intent of that script and my boss dealt with something that solved the same
problem without killing queries, then had them stop that daemon and we have
been working with that database to make sure it doesn't go screwy again. No
new corruption has shown up since stopping that daemon.

That memory allocation issue looked drastically different from the toast
value errors, though, so it seemed like a separate problem. But now it's
looking like more corruption.

---

We're requesting that they do a few things (this is their production
database, so we usually don't alter any data unless they ask us to),
including deleting those rows. My memory is insufficient, so there's a good
chance that I'll forget to post back to the mailing list with the results,
but I'll try to remember to do so.

Thank you for the help - I'm sure I'll be back soon with many more
questions.

-Sam

On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Merlin Moncure <mmoncure@gmail.com> writes:

On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com>

wrote:

So ... yes, it seems that those four id's are somehow part of the

problem.

They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
either), so memtest isn't available, but no new corruption has cropped

up

since they stopped killing the waiting queries (I just double checked -

they

were getting corrupted rows constantly, and we haven't gotten one since

that

script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly? You say this is repeatable? What's your
setting of full_page_writes?

I think we'd established that they were doing kill -9 on backend
processes :-(. However, PG has a lot of track record that says that
backend crashes don't result in corrupt data. What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.

regards, tom lane

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Sam Nelson (#7)
Re: Memory Errors

On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson <samn@consistentstate.com> wrote:

Even if the corruption wasn't a result of that, we weren't too excited about
the process being there to begin with.  We thought there had to be a better
solution than just killing the processes.  So we had a discussion about the
intent of that script and my boss dealt with something that solved the same
problem without killing queries, then had them stop that daemon and we have
been working with that database to make sure it doesn't go screwy again.  No
new corruption has shown up since stopping that daemon.
That memory allocation issue looked drastically different from the toast
value errors, though, so it seemed like a separate problem.  But now it's
looking like more corruption.
---
We're requesting that they do a few things (this is their production
database, so we usually don't alter any data unless they ask us to),
including deleting those rows.  My memory is insufficient, so there's a good
chance that I'll forget to post back to the mailing list with the results,
but I'll try to remember to do so.
Thank you for the help - I'm sure I'll be back soon with many more
questions.

Any information on repeatable data corruption, whether it is ec2
improperly flushing data on instance resets, postgres misbehaving
under atypical conditions, or bad interactions between ec2 and
postgres is highly valuable. The only cases of 'understandable' data
corruption are hardware failures, sync issues (either fsync off, or
fsync not honored by hardware), torn pages on non journaling file
systems, etc.

Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware. Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

merlin

#9Sam Nelson
samn@consistentstate.com
In reply to: Merlin Moncure (#8)
Re: Memory Errors

Okay, we're finally getting the last bits of corruption fixed, and I finally
remembered to ask my boss about the kill script.

The only details I have are these:

1) The script does nothing if there are fewer than 1000 locks on tables in
the database

2) If there are 1000 or more locks, it will grab the processes in
pg_stat_activity that are in a waiting state

3) for each of the previous processes, it will do a system kill $pid call

The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not
a kill -9. Just a normal kill.

As far as the postgres and EC2 instances go, we're not really sure if anyone
shut down, created, or migrated them in a weird way, but Kevin (my boss)
said that it wouldn't surprise him.

All I can say is that where we were getting 1 new row of corruption every
day when the kill script was running, we haven't gotten any new corruption
since we stopped it.

As far as the table with memory errors goes, we had asked them to rebuild
the table, and they came back saying that they no longer need that table.
So they're just going to drop it.

We'll try to keep digging, but I'm not sure we'll get much more info than
that. We're quite busy and my ability to remember things is ...
questionable.

-Sam

On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

Show quoted text

On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson <samn@consistentstate.com>
wrote:

Even if the corruption wasn't a result of that, we weren't too excited

about

the process being there to begin with. We thought there had to be a

better

solution than just killing the processes. So we had a discussion about

the

intent of that script and my boss dealt with something that solved the

same

problem without killing queries, then had them stop that daemon and we

have

been working with that database to make sure it doesn't go screwy again.

No

new corruption has shown up since stopping that daemon.
That memory allocation issue looked drastically different from the toast
value errors, though, so it seemed like a separate problem. But now it's
looking like more corruption.
---
We're requesting that they do a few things (this is their production
database, so we usually don't alter any data unless they ask us to),
including deleting those rows. My memory is insufficient, so there's a

good

chance that I'll forget to post back to the mailing list with the

results,

but I'll try to remember to do so.
Thank you for the help - I'm sure I'll be back soon with many more
questions.

Any information on repeatable data corruption, whether it is ec2
improperly flushing data on instance resets, postgres misbehaving
under atypical conditions, or bad interactions between ec2 and
postgres is highly valuable. The only cases of 'understandable' data
corruption are hardware failures, sync issues (either fsync off, or
fsync not honored by hardware), torn pages on non journaling file
systems, etc.

Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware. Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Merlin Moncure
mmoncure@gmail.com
In reply to: Sam Nelson (#9)
Re: Memory Errors

On Tue, Sep 21, 2010 at 12:57 PM, Sam Nelson <samn@consistentstate.com> wrote:

On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware. Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

As far as the postgres and EC2 instances go, we're not really sure if anyone
shut down, created, or migrated them in a weird way, but Kevin (my boss)
said that it wouldn't surprise him.

<please try to avoid top-posting -- it destroys the context of the conversation>

The shutdown/migration point is key, along with fsync settings and a
description of whatever durability guarantees ec2 gives on the storage
you are using. It's the difference between this being a non-event and
something much more interesting. The correct way btw to kill backends
is with pg_ctl, but what you did is not related to data corruption.

merlin

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sam Nelson (#9)
Re: Memory Errors

Sam Nelson <samn@consistentstate.com> writes:

Okay, we're finally getting the last bits of corruption fixed, and I finally
remembered to ask my boss about the kill script.

The only details I have are these:

1) The script does nothing if there are fewer than 1000 locks on tables in
the database

2) If there are 1000 or more locks, it will grab the processes in
pg_stat_activity that are in a waiting state

3) for each of the previous processes, it will do a system kill $pid call

The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not
a kill -9. Just a normal kill.

SIGTERM then. Since (according to the other thread) this was 8.3.11,
that should in theory be safe; but it's not something I'd consider
tremendously well tested before 8.4.x.

I'd still lean to the theory of data lost during an EC2 instance
shutdown.

regards, tom lane