Help - corruption issue?

Started by Phoenix Kiulaalmost 15 years ago39 messagesgeneral
Jump to latest
#1Phoenix Kiula
phoenix.kiula@gmail.com

While doing a PG dump, I seem to have a problem:

ERROR: invalid memory alloc request size 4294967293

Upon googling, this seems to be a data corruption issue!

( Came about while doing performance tuning as being discussed on the
PG-PERFORMANCE list:
http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html
)

One of the older messages suggests that I do "file level backup and
restore the data".
http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php

How does one do this -- should I copy the data folder? What are the
specific steps?

I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII
disks on RAID 1.

Thanks!

#2Filip Rembiałkowski
plk.zuber@gmail.com
In reply to: Phoenix Kiula (#1)
Re: Help - corruption issue?

Phoenix,

how large (in total) is this database)?

can you copy (cp -a) the data directory somewhere? I would do this
just in case :-)

regarding the manual recovery process:

1. you'll have to isolate corrupted table.
you can do this by dumping all tables one-by-one (pg_dump -t TABLE)
until you get the error.

2. find the record which is corupted... approach like this might work:
select count(*) from the_corrupted_table where PK_column <= some_value.

3 .you should try to dump the table by chunks - skipping the corrupted
row(s) if possible

4. if above method does not work, you can try manually hex-editing
(zeroing) some bytes (with postgres shut down) to make dump work
again.

PS. obligatory note:

8.2.9 Release Date: 2008-06-12; 8.2.21 Release Date: 2011-04-18
seems like you were running almost three years without bugfixes.
aside from fixing your current problem, I would first do the upgrade
to avoid more corruption.

2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com>

Show quoted text

While doing a PG dump, I seem to have a problem:

ERROR: invalid memory alloc request size 4294967293

Upon googling, this seems to be a data corruption issue!

( Came about while doing performance tuning as being discussed on the
PG-PERFORMANCE list:
http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html
)

One of the older messages suggests that I do "file level backup and
restore the data".
http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php

How does one do this -- should I copy the data folder? What are the
specific steps?

I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII
disks on RAID 1.

Thanks!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Filip Rembiałkowski (#2)
Re: Help - corruption issue?

Thanks Filip.

I know which table it is. It's my largest table with over 125 million rows.

All the others are less than 100,000 rows. Most are in fact less than 25,000.

Now, which specific part of the table is corrupted -- if it is row
data, then can I dump specific parts of that table? How? Pg_dumpall
does not seem to have an option to have a "WHERE" clause?

If the lead index is corrupt, then issuing a reindex should work. So I
disconnected all other users. The DB was doing nothing. And then I
started a psql session and issued the command "reindex database MYDB".
After 3 hours, I see this error:

[QUOTE]
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:
terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!>
[/UNQUOTE]

What am I to do now? Even reindex is not working. I can try to drop
indexes and create them again. Will that help?

2011/4/18 Filip Rembiałkowski <plk.zuber@gmail.com>:

Show quoted text

Phoenix,

how large (in total) is this database)?

can you copy (cp -a) the data directory somewhere? I would do this
just in case :-)

regarding the manual recovery process:

1. you'll have to isolate corrupted table.
you can do this by dumping all tables one-by-one (pg_dump -t TABLE)
until you get the error.

2. find the record which is corupted... approach like this might work:
select count(*) from the_corrupted_table where PK_column <= some_value.

3 .you should try to dump the table by chunks - skipping the corrupted
row(s) if possible

4. if above method does not work, you can try manually hex-editing
(zeroing) some bytes (with postgres shut down) to make dump work
again.

PS. obligatory note:

8.2.9 Release Date: 2008-06-12; 8.2.21 Release Date: 2011-04-18
seems like you were running almost three years without bugfixes.
aside from fixing your current problem, I would first do the upgrade
to avoid more corruption.

2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com>

While doing a PG dump, I seem to have a problem:

ERROR: invalid memory alloc request size 4294967293

Upon googling, this seems to be a data corruption issue!

( Came about while doing performance tuning as being discussed on the
PG-PERFORMANCE list:
http://postgresql.1045698.n5.nabble.com/REINDEX-takes-half-a-day-and-still-not-complete-td4005943.html
)

One of the older messages suggests that I do "file level backup and
restore the data".
http://archives.postgresql.org/pgsql-admin/2008-05/msg00191.php

How does one do this -- should I copy the data folder? What are the
specific steps?

I'm on PG 8.2.9, CentOS 5, with 8GB of RAM. The disks are four SATAII
disks on RAID 1.

Thanks!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#3)
Re: Help - corruption issue?

Thanks Filip.

I know which table it is. It's my largest table with over 125 million
rows.

All the others are less than 100,000 rows. Most are in fact less than
25,000.

Now, which specific part of the table is corrupted -- if it is row
data, then can I dump specific parts of that table? How? Pg_dumpall
does not seem to have an option to have a "WHERE" clause?

If the lead index is corrupt, then issuing a reindex should work. So I
disconnected all other users. The DB was doing nothing. And then I
started a psql session and issued the command "reindex database MYDB".
After 3 hours, I see this error:

[QUOTE]
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:
terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!>
[/UNQUOTE]

What am I to do now? Even reindex is not working. I can try to drop
indexes and create them again. Will that help?

It might help, but as someone already pointed out, you're running a
version that's 3 years old. So do a hot file backup (stop the db and copy
the data directory to another machine), check the hardware (especially the
RAID controller and RAM), upgrade to the latest 8.2.x version and then try
again.

I'll post a bit more info into the other thread, as it's related to the
reindex performance and not to this issue.

regards
Tomas

#5Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#4)
Re: Help - corruption issue?

On Mon, Apr 18, 2011 at 11:02 PM, <tv@fuzzy.cz> wrote:

Thanks Filip.

I know which table it is. It's my largest table with over 125 million
rows.

All the others are less than 100,000 rows. Most are in fact less than
25,000.

Now, which specific part of the table is corrupted -- if it is row
data, then can I dump specific parts of that table? How? Pg_dumpall
does not seem to have an option to have a "WHERE" clause?

If the lead index is corrupt, then issuing a reindex should work. So I
disconnected all other users. The DB was doing nothing. And then I
started a psql session and issued the command "reindex database MYDB".
After 3 hours, I see this error:

[QUOTE]
server closed the connection unexpectedly
      This probably means the server terminated abnormally
      before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:
terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!>
[/UNQUOTE]

What am I to do now? Even reindex is not working. I can try to drop
indexes and create them again. Will that help?

It might help, but as someone already pointed out, you're running a
version that's 3 years old. So do a hot file backup (stop the db and copy
the data directory to another machine), check the hardware (especially the
RAID controller and RAM), upgrade to the latest 8.2.x version and then try
again.

I'll post a bit more info into the other thread, as it's related to the
reindex performance and not to this issue.

regards
Tomas

Thanks. For CentOS (RedHat?) the latest is 8.2.19 right? Not the
8.2.20 that's mentioned on front page of PG.org.

http://www.pgrpms.org/8.2/redhat/rhel-4-i386/repoview/

Question: will upgrading from 8.2.9 to 8.2.19 have some repercussions
in terms of huge changes or problems?

I know 9.x had some new additions including "casting" etc (or is that
irrelevant to me?) but if 8.2.19 is safe in terms of not requiring
anything new from my side, then I can do the upgrade quickly.

Welcome any advice.

Thanks!

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Phoenix Kiula (#3)
Re: Help - corruption issue?

2011/4/18 Phoenix Kiula <phoenix.kiula@gmail.com>:

Thanks Filip.

I know which table it is. It's my largest table with over 125 million rows.

All the others are less than 100,000 rows. Most are in fact less than 25,000.

Now, which specific part of the table is corrupted -- if it is row
data, then can I dump specific parts of that table? How? Pg_dumpall
does not seem to have an option to have a "WHERE" clause?

If the lead index is corrupt, then issuing a reindex should work. So I
disconnected all other users. The DB was doing nothing. And then I
started a psql session and issued the command "reindex database MYDB".
After 3 hours, I see this error:

[QUOTE]
server closed the connection unexpectedly
       This probably means the server terminated abnormally
       before or while processing the request.
The connection to the server was lost. Attempting reset: WARNING:
terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!>
[/UNQUOTE]

What am I to do now? Even reindex is not working. I can try to drop
indexes and create them again. Will that help?

it might. take a full file system backup first and drop the indexes.
before recreating them, take a regular dump (with pg_dump) and if it
goes through, you're golden, rebuild the indexes, *update the
postmaster to latest 8.2*, and you can go back online.

merllin

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#5)
Re: Help - corruption issue?

Dne 18.4.2011 20:27, Phoenix Kiula napsal(a):

What am I to do now? Even reindex is not working. I can try to drop
indexes and create them again. Will that help?

It might help, but as someone already pointed out, you're running a
version that's 3 years old. So do a hot file backup (stop the db and copy
the data directory to another machine), check the hardware (especially the
RAID controller and RAM), upgrade to the latest 8.2.x version and then try
again.

I'll post a bit more info into the other thread, as it's related to the
reindex performance and not to this issue.

regards
Tomas

Thanks. For CentOS (RedHat?) the latest is 8.2.19 right? Not the
8.2.20 that's mentioned on front page of PG.org.

Centos is probably a bit delayed behind the source version. If you want
to stick with the binary version, go with the 8.2.19.

http://www.pgrpms.org/8.2/redhat/rhel-4-i386/repoview/

Question: will upgrading from 8.2.9 to 8.2.19 have some repercussions
in terms of huge changes or problems?

Those minor versions are mostly bugfixes and small improvements. So no,
I wouldn't expect huge problems.

I know 9.x had some new additions including "casting" etc (or is that
irrelevant to me?) but if 8.2.19 is safe in terms of not requiring
anything new from my side, then I can do the upgrade quickly.

Don't do that right now. When doing 'minor' upgrades, you don't need to
dump/restore the database - you can just replace the binaries and it
should work as the file format does not change between minor versions
(and 8.2.9 -> 8.2.19 is a minor upgrade).

Still, do the file backup as described in the previous posts. You could
even do an online backup using pg_backup_start/pg_backup_stop etc.

To upgrade from 8.2 to 9.0 you'd need to do pg_dump backup and then
restore the database. Which is of scope right now, I guess.

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: Tomas Vondra (#7)
Re: Help - corruption issue?

On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Still, do the file backup as described in the previous posts. You could
even do an online backup using pg_backup_start/pg_backup_stop etc.

As soon as you have a working file system backup, get the tw_cli
utility for the 3ware cards downloaded and LOOK at what it has to say
about your RAID controller, drives, and array health.

#9Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Scott Marlowe (#8)
Re: Help - corruption issue?

On Tue, Apr 19, 2011 at 8:35 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Still, do the file backup as described in the previous posts. You could
even do an online backup using pg_backup_start/pg_backup_stop etc.

As soon as you have a working file system backup, get the tw_cli
utility for the 3ware cards downloaded and LOOK at what it has to say
about your RAID controller, drives, and array health.

I am with SoftLayer. They're a very professional bunch. They even
changed my BBU last night. The RAID card is working. The memory and
the hardware are also tested.

I have now upgraded to 8.2.19.

Then I restarted the server, and dropped indexes. When I recreate the
first index, the same thing happens:

------
# CREATE INDEX idx_links_userid ON links (user_id);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
------

There is nothing going on in the server other than this command. All
other users are blocked!

Logging is enabled but does not have anything!

I am now worried. What is this problem?

#10Scott Marlowe
scott.marlowe@gmail.com
In reply to: Phoenix Kiula (#9)
Re: Help - corruption issue?

On Mon, Apr 18, 2011 at 8:52 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On Tue, Apr 19, 2011 at 8:35 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Mon, Apr 18, 2011 at 5:44 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Still, do the file backup as described in the previous posts. You could
even do an online backup using pg_backup_start/pg_backup_stop etc.

As soon as you have a working file system backup, get the tw_cli
utility for the 3ware cards downloaded and LOOK at what it has to say
about your RAID controller, drives, and array health.

I am with SoftLayer. They're a very professional bunch. They even
changed my BBU last night. The RAID card is working. The memory and
the hardware are also tested.

So, RAID is good for sure? As in someone logged into the machine, and
went to the tw_cli utility and asked it about the status of the
physical drives and virtual RAID array and the card said yes they're
good? No bad sectors being remapped? Hmmm. One of my old tests when
things were acting up was to see if the server could compile the linux
kernel or pgsql back when it took 1.5 hours to do. If you keep
getting sig 11s on production kernel compiles something's wrong with
the system, software or hardware.

I have now upgraded to 8.2.19.

Then I restarted the server, and dropped indexes. When I recreate the
first index, the same thing happens:

------
# CREATE INDEX idx_links_userid ON links (user_id);
server closed the connection unexpectedly
       This probably means the server terminated abnormally
       before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
------

What do the Postgresql logs say at this time? oh wait...

There is nothing going on in the server other than this command. All
other users are blocked!

Logging is enabled but does not have anything!

System logs maybe? Something about a process getting killed? Have
you tried turning up the verbosity of the pg logs?

I am now worried. What is this problem?

We gotta check one thing at a time really.

If you copy the dir off to another machine and run pgsql 8.2.latest or
thereabouts, can you then create the index?

#11Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Scott Marlowe (#10)
Re: Help - corruption issue?

System logs maybe?  Something about a process getting killed?  Have
you tried turning up the verbosity of the pg logs?

Syslog has to be compiled with PG? How do I enable it? Where should I
look for it?

The documentation, whenever it mentions "syslog", always just assumes
the expression "If syslog is enabled". Well where do I enable it? -
http://www.postgresql.org/docs/8.2/static/runtime-config-logging.html

Would appreciate some guidance on this.

We gotta check one thing at a time really.

If you copy the dir off to another machine and run pgsql 8.2.latest or
thereabouts, can you then create the index?

I will try this. Transferring 106GB of data, even zipped, is a huge
ask and just the management will take over a day or so. I was hoping
we could do without this.

#12Scott Marlowe
scott.marlowe@gmail.com
In reply to: Phoenix Kiula (#11)
Re: Help - corruption issue?

On Mon, Apr 18, 2011 at 9:23 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

System logs maybe?  Something about a process getting killed?  Have
you tried turning up the verbosity of the pg logs?

Syslog has to be compiled with PG? How do I enable it? Where should I
look for it?

The documentation, whenever it mentions "syslog", always just assumes
the expression "If syslog is enabled". Well where do I enable it?  -
http://www.postgresql.org/docs/8.2/static/runtime-config-logging.html

Would appreciate some guidance on this.

No I meant the system logs, the ones in /var/log/yadayada. Like
/var/log/message, things like that. See if any of them have anything
interesting happening when things go badly.

syslog is logging using the syslog system which puts logs from various
processes into the /var/log dir, like /var/log/pgsql. Assuming you
have a stock RHEL install I'd expect the pgsql logs to be in
/var/log/pgsql or thereabouts.

We gotta check one thing at a time really.

If you copy the dir off to another machine and run pgsql 8.2.latest or
thereabouts, can you then create the index?

I will try this. Transferring 106GB of data, even zipped, is a huge
ask and just the management will take over a day or so. I was hoping
we could do without this.

On a fast network it should only take a few minutes. Now rsyncing
live 2.4 TB databases, that takes time. :) Your raptors, if they're
working properly, should be able to transfer at around 80 to
100Megabytes a second. 10 to 15 seconds a gig. 30 minutes or so via
gig ethernet. I'd run iostat and see how well my drive array was
performing during a large, largely sequential copy.

#13Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Scott Marlowe (#12)
Re: Help - corruption issue?

On a fast network it should only take a few minutes.  Now rsyncing
live 2.4 TB databases, that takes time. :)  Your raptors, if they're
working properly, should be able to transfer at around 80 to
100Megabytes a second.  10 to 15 seconds a gig.  30 minutes or so via
gig ethernet.  I'd run iostat and see how well my drive array was
performing during a large, largely sequential copy.

OK. An update.

We have changed all the hardware except disks.

REINDEX still gave this problem:

--
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
--

So I rebooted and logged back in a single user mode. All services
stopped. All networking stopped. Only postgresql started. I tried the
REINDEX again.

Same problem :(

This means the problem is likely with data?

I do have a "pg_dumpall" dump from 1 day before. Will lose some data,
but should have most of it.

Is it worth it for me to try and restore from there? What's the best
thing to do right now?

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#13)
Re: Help - corruption issue?

Dne 20.4.2011 12:56, Phoenix Kiula napsal(a):

On a fast network it should only take a few minutes. Now rsyncing
live 2.4 TB databases, that takes time. :) Your raptors, if they're
working properly, should be able to transfer at around 80 to
100Megabytes a second. 10 to 15 seconds a gig. 30 minutes or so via
gig ethernet. I'd run iostat and see how well my drive array was
performing during a large, largely sequential copy.

OK. An update.

We have changed all the hardware except disks.

OK, so the card is working and the drives are fine. Have you run the
tw_cli tool to check the drives? Because it's probably the last thing
that might be faulty and was not replaced.

REINDEX still gave this problem:

--
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
--

Hm, have you checked if there's something else in the logs? More details
about the crash or something like that.

I'd probably try to run strace on the backend, to get more details about
where it crashes. Just find out the PID of the backend dedicated to your
psql session, do

$ strace -p PID > crash.log 2>&1

and then run the REINDEX. Once it crashes you can see the last few lines
from the logfile.

So I rebooted and logged back in a single user mode. All services
stopped. All networking stopped. Only postgresql started. I tried the
REINDEX again.

Same problem :(

This means the problem is likely with data?

Well, maybe. It might be a problem with the data, it might be a bug in
postgres ...

I do have a "pg_dumpall" dump from 1 day before. Will lose some data,
but should have most of it.

Is it worth it for me to try and restore from there? What's the best
thing to do right now?

So have you done the file backup? That's the first thing I'd do.

Anyway what's best depends on how important is the missing piece of
data. We still don't know how to fix the problem, but it sure seems like
a corrupted data.

I think you already know which table is corrupted, right? In that case
you may actually try to find the bad block and erase it (and maybe do a
copy so that we can see what's wrong with it and how it might happen).
There's a very nice guide on how to do that

http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

It sure seems like the problem you have (invalid alloc request etc.).
The really annoying part is locating the block, as you have to scan
through the table (which sucks with such big table).

And yes, if there's corruption, there might be more corrupted blocks.

regards
Tomas

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#14)
Re: Help - corruption issue?

Dne 20.4.2011 22:11, Tomas Vondra napsal(a):

There's a very nice guide on how to do that

http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

It sure seems like the problem you have (invalid alloc request etc.).
The really annoying part is locating the block, as you have to scan
through the table (which sucks with such big table).

And yes, if there's corruption, there might be more corrupted blocks.

BTW, there's a setting 'zero_damaged_pages' that might help with this

http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html

see this talk for more details how to use it

http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf

Anyway don't play with this without the file backup, as this will zero
the blocks.

Tomas

#16Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#15)
Re: Help - corruption issue?

On Thu, Apr 21, 2011 at 7:27 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 20.4.2011 22:11, Tomas Vondra napsal(a):

There's a very nice guide on how to do that

http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

It sure seems like the problem you have (invalid alloc request etc.).
The really annoying part is locating the block, as you have to scan
through the table (which sucks with such big table).

And yes, if there's corruption, there might be more corrupted blocks.

BTW, there's a setting 'zero_damaged_pages' that might help with this

http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html

see this talk for more details how to use it

http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf

Anyway don't play with this without the file backup, as this will zero
the blocks.

Tomas

Thanks Tomas. Very handy info.

FIRST: is there anyone on this list who offers PG admin support?
Please write to me directly.

Second, for the strace, which process should I use?

ps auxwww|grep ^postgres
postgres 4320 0.0 0.1 440192 10824 ? Ss 08:49 0:00
/usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 4355 0.0 0.0 11724 964 ? Ss 08:49 0:00
postgres: logger process
postgres 4365 0.0 0.0 440396 3268 ? Ss 08:49 0:00
postgres: writer process
postgres 4366 0.0 0.0 11860 1132 ? Ss 08:49 0:00
postgres: stats collector process
postgres 15795 0.0 0.0 7136 1440 pts/0 S 22:44 0:00 -bash
postgres 15900 0.0 0.0 7860 1956 pts/0 S+ 22:44 0:00 psql -h
localhost MYDOMAIN -E MYDOMAIN_MYDOMAIN
postgres 15901 0.0 0.0 441124 3072 ? Ss 22:44 0:00
postgres: MYDOMAIN_MYDOMAIN MYDOMAIN 127.0.0.1(34346) idle

Third, I have the backup in two ways:

1. I took a backup of the entire "/pgsql/data" folder. PG was shutdown
at the time.
2. I have a pg_dumpall file but it is missing one day's data (still
useful as last resort).

Will #1 have corrupt data in it?

#17Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Phoenix Kiula (#16)
Re: Help - corruption issue?

On Thu, Apr 21, 2011 at 11:49 AM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On Thu, Apr 21, 2011 at 7:27 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 20.4.2011 22:11, Tomas Vondra napsal(a):

There's a very nice guide on how to do that

http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

It sure seems like the problem you have (invalid alloc request etc.).
The really annoying part is locating the block, as you have to scan
through the table (which sucks with such big table).

And yes, if there's corruption, there might be more corrupted blocks.

BTW, there's a setting 'zero_damaged_pages' that might help with this

http://www.postgresql.org/docs/8.2/static/runtime-config-developer.html

see this talk for more details how to use it

http://www.casitconf.org/casitconf11/Tech_track_2_files/cascadia_postgres_rbernier.pdf

Anyway don't play with this without the file backup, as this will zero
the blocks.

Tomas

Thanks Tomas. Very handy info.

FIRST: is there anyone on this list who offers PG admin support?
Please write to me directly.

Second, for the strace, which process should I use?

ps auxwww|grep ^postgres
postgres  4320  0.0  0.1 440192 10824 ?      Ss   08:49   0:00
/usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  4355  0.0  0.0 11724  964 ?        Ss   08:49   0:00
postgres: logger process
postgres  4365  0.0  0.0 440396 3268 ?       Ss   08:49   0:00
postgres: writer process
postgres  4366  0.0  0.0 11860 1132 ?        Ss   08:49   0:00
postgres: stats collector process
postgres 15795  0.0  0.0  7136 1440 pts/0    S    22:44   0:00 -bash
postgres 15900  0.0  0.0  7860 1956 pts/0    S+   22:44   0:00 psql -h
localhost MYDOMAIN -E MYDOMAIN_MYDOMAIN
postgres 15901  0.0  0.0 441124 3072 ?       Ss   22:44   0:00
postgres: MYDOMAIN_MYDOMAIN MYDOMAIN 127.0.0.1(34346) idle

Third, I have the backup in two ways:

1. I took a backup of the entire "/pgsql/data" folder. PG was shutdown
at the time.
2. I have a pg_dumpall file but it is missing one day's data (still
useful as last resort).

Will #1 have corrupt data in it?

Tomas,

I did a crash log with the strace for PID of the index command as you
suggested.

Here's the output:
http://www.heypasteit.com/clip/WNR

Also including below, but because this will wrap etc, you can look at
the link above.

Thanks for any ideas or pointers!

Process 15900 attached - interrupt to quit
read(0, "r", 1) = 1
write(1, "r", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "e", 1) = 1
write(1, "e", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "n", 1) = 1
write(1, "n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "d", 1) = 1
write(1, "d", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "e", 1) = 1
write(1, "e", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "x", 1) = 1
write(1, "x", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "l", 1) = 1
write(1, "l", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "n", 1) = 1
write(1, "n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(1, "\10\33[K", 4) = 4
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(2, "\7", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(2, "\7", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(2, "\7", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\177", 1) = 1
write(2, "\7", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\\", 1) = 1
write(1, "\\", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "d", 1) = 1
write(1, "d", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "l", 1) = 1
write(1, "l", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "n", 1) = 1
write(1, "n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "k", 1) = 1
write(1, "k", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "s", 1) = 1
write(1, "s", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\r", 1) = 1
write(1, "\n", 1) = 1
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {0x804ddd2, [], SA_RESTORER|SA_RESTART,
0xda2a08}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTSTP, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTTOU, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTTIN, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGWINCH, {SIG_DFL}, {0x12afd0, [], SA_RESTORER, 0xda2a08}, 8) = 0
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT c.oid,\n n.nspname,\n c.r"..., 207) = 207
write(1, "ORDER BY 2, 3;\n*****************"..., 43) = 43
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0\342SELECT c.oid,\n n.nspname,\n"..., 227, 0) = 227
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0P\0\3oid\0\0\0\4\353\377\376\0\0\0\32\0\4\377\377\377\377\0\0nsp"...,
16384, 0) = 134
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT relhasindex, relkind, rel"..., 95) = 95
write(1, "FROM pg_catalog.pg_class WHERE o"..., 73) = 73
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0\220SELECT relhasindex, relkind"..., 145, 0) = 145
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0\323\0\7relhasindex\0\0\0\4\353\0\f\0\0\0\20\0\1\377"...,
16384, 0) = 272
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT a.attname,\n pg_catalog.f"..., 369) = 369
write(1, "ORDER BY a.attnum\n**************"..., 46) = 46
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\1\207SELECT a.attname,\n pg_cata"..., 392, 0) = 392
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0\217\0\5attname\0\0\0\4\341\0\2\0\0\0\23\0@\377\377\377\377\0"...,
16384, 0) = 1123
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT c2.relname, i.indisprimar"..., 295) = 295
write(1, "ORDER BY i.indisprimary DESC, i."..., 89) = 89
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\1hSELECT c2.relname, i.indisp"..., 361, 0) = 361
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0\335\0\7relname\0\0\0\4\353\0\1\0\0\0\23\0@\377\377\377\377\0"...,
16384, 0) = 629
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT r.conname, pg_catalog.pg_"..., 95) = 95
write(1, "WHERE r.conrelid = \'50002\' AND r"..., 86) = 86
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0\235SELECT r.conname, pg_catalo"..., 158, 0) = 158
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0G\0\2conname\0\0\0\n.\0\1\0\0\0\23\0@\377\377\377\377\0"...,
16384, 0) = 330
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT t.tgname, pg_catalog.pg_g"..., 82) = 82
write(1, "WHERE t.tgrelid = \'50002\' AND (n"..., 328) = 328
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\1\202SELECT t.tgname, pg_catalog"..., 387, 0) = 387
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0C\0\2tgname\0\0\0\n<\0\2\0\0\0\23\0@\377\377\377\377\0\0"...,
16384, 0) = 86
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT conname,\n pg_catalog.pg_"..., 103) = 103
write(1, "WHERE r.conrelid = \'50002\' AND r"..., 86) = 86
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0\245SELECT conname,\n pg_catalo"..., 166, 0) = 166
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0009\0\2conname\0\0\0\n.\0\1\0\0\0\23\0@\377\377\377\377\0"...,
16384, 0) = 76
write(1, "********* QUERY **********\n", 27) = 27
write(1, "SELECT c.relname FROM pg_catalog"..., 169) = 169
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0\221SELECT c.relname FROM pg_ca"..., 146, 0) = 146
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "T\0\0\0
\0\1relname\0\0\0\4\353\0\1\0\0\0\23\0@\377\377\377\377\0"..., 16384,
0) = 51
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0
write(1, " "..., 53) = 53
write(1, " Column | "..., 85) = 85
write(1, "-----------------------+--------"..., 85) = 85
write(1, " id | bigint "..., 64) = 64
write(1, " link_id | charact"..., 64) = 64
write(1, " alias | charact"..., 64) = 64
write(1, " aliasentered | charact"..., 75) = 75
write(1, " url | text "..., 64) = 64
write(1, " user_known | smallin"..., 74) = 74
write(1, " user_id | charact"..., 64) = 64
write(1, " url_encrypted | charact"..., 74) = 74
write(1, " title | charact"..., 56) = 56
write(1, " private | charact"..., 56) = 56
write(1, " private_key | charact"..., 56) = 56
write(1, " status | charact"..., 75) = 75
write(1, " create_date | timesta"..., 69) = 69
write(1, " modify_date | timesta"..., 56) = 56
write(1, " disable_in_statistics | charact"..., 84) = 84
write(1, " user_running_id | integer"..., 56) = 56
write(1, " url_host_long | integer"..., 56) = 56
write(1, "Indexes:\n", 9) = 9
write(1, " \"links2_pkey\" PRIMARY KEY, b"..., 42) = 42
write(1, " \"links2_alias_key\" UNIQUE, b"..., 66) = 66
write(1, " \"new_idx_userknown\" btree (u"..., 61) = 61
write(1, "Check constraints:\n", 19) = 19
write(1, " \"links2_id_check\" CHECK (id "..., 37) = 37
write(1, " \"links2_url_check\" CHECK (ur"..., 47) = 47
write(1, " \"links2_user_id_check\" CHECK"..., 61) = 61
write(1, " \"links_alias_check\" CHECK (a"..., 67) = 67
write(1, "\n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0
ioctl(0, TIOCSWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig -icanon -echo ...}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {0x12b4cd, [], SA_RESTORER, 0xda2a08},
{0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, 8) = 0
rt_sigaction(SIGTERM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGALRM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTSTP, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTTOU, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTTIN, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGWINCH, {0x12afd0, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
write(1, "MYDOMAIN=# ", 10) = 10
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "C", 1) = 1
write(1, "C", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "R", 1) = 1
write(1, "R", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "E", 1) = 1
write(1, "E", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "A", 1) = 1
write(1, "A", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "T", 1) = 1
write(1, "T", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "E", 1) = 1
write(1, "E", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "I", 1) = 1
write(1, "I", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "N", 1) = 1
write(1, "N", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "D", 1) = 1
write(1, "D", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "E", 1) = 1
write(1, "E", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "X", 1) = 1
write(1, "X", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "d", 1) = 1
write(1, "d", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "x", 1) = 1
write(1, "x", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "_", 1) = 1
write(1, "_", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "l", 1) = 1
write(1, "l", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "n", 1) = 1
write(1, "n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "k", 1) = 1
write(1, "k", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "s", 1) = 1
write(1, "s", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "_", 1) = 1
write(1, "_", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "u", 1) = 1
write(1, "u", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "s", 1) = 1
write(1, "s", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "e", 1) = 1
write(1, "e", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "r", 1) = 1
write(1, "r", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "d", 1) = 1
write(1, "d", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "O", 1) = 1
write(1, "O", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "N", 1) = 1
write(1, "N", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "l", 1) = 1
write(1, "l", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "n", 1) = 1
write(1, "n", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "k", 1) = 1
write(1, "k", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "s", 1) = 1
write(1, "s", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, " ", 1) = 1
write(1, " ", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "(", 1) = 1
write(1, "(", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "u", 1) = 1
write(1, "u", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "s", 1) = 1
write(1, "s", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "e", 1) = 1
write(1, "e", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "r", 1) = 1
write(1, "r", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "_", 1) = 1
write(1, "_", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "i", 1) = 1
write(1, "i", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "d", 1) = 1
write(1, "d", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, ")", 1) = 1
write(1, ")", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, ";", 1) = 1
write(1, ";", 1) = 1
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, "\r", 1) = 1
write(1, "\n", 1) = 1
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig icanon echo ...}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {0x804ddd2, [], SA_RESTORER|SA_RESTART,
0xda2a08}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGALRM, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTSTP, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTTOU, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGTTIN, {SIG_DFL}, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, 8) = 0
rt_sigaction(SIGWINCH, {SIG_DFL}, {0x12afd0, [], SA_RESTORER, 0xda2a08}, 8) = 0
gettimeofday({1303357859, 831087}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "Q\0\0\0006CREATE INDEX idx_links_user"..., 55, 0) = 55
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "", 16384, 0) = 0
time(NULL) = 1303359051
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, 0) = 1
recv(3, "", 16384, 0) = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libpq.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libpq.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libpq.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libpq.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libpq.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
close(3) = 0
write(2, "server closed the connection une"..., 137) = 137
write(2, "The connection to the server was"..., 57) = 57
open("/etc/hosts", O_RDONLY) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7c5b000
read(3, "#71.71.71.71\t\tMYHOST pkiula"..., 4096) = 220
read(3, "", 4096) = 0
close(3) = 0
munmap(0xb7c5b000, 4096) = 0
open("/etc/hosts", O_RDONLY) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=220, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7c5b000
read(3, "#71.71.71.71\t\tMYHOST pkiula"..., 4096) = 220
close(3) = 0
munmap(0xb7c5b000, 4096) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(5432),
sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now
in progress)
poll([{fd=3, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, -1) = 1
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_INET, sin_port=htons(35241),
sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
poll([{fd=3, events=POLLOUT|POLLERR, revents=POLLOUT}], 1, -1) = 1
rt_sigprocmask(SIG_BLOCK, [PIPE], [], 8) = 0
send(3, "\0\0\0/\0\3\0\0user\0MYDOMAIN_MYDOMAIN\0dat"..., 47, 0) = 47
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=3, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(3, "E\0\0\0aSFATAL\0C57P03\0Mthe database"..., 16384, 0) = 98
write(2, "Failed.\n", 8) = 8
close(3) = 0
gettimeofday({1303359052, 64835}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
ioctl(0, TIOCGWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0
ioctl(0, TIOCSWINSZ, {ws_row=35, ws_col=128, ws_xpixel=1152, ws_ypixel=630}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_STOP or TCSETSW, {B38400 opost isig -icanon -echo ...}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {0x12b4cd, [], SA_RESTORER, 0xda2a08},
{0x804ddd2, [], SA_RESTORER|SA_RESTART, 0xda2a08}, 8) = 0
rt_sigaction(SIGTERM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGALRM, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTSTP, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTTOU, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTTIN, {0x12b4cd, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGWINCH, {0x12afd0, [], SA_RESTORER, 0xda2a08}, {SIG_DFL}, 8) = 0
write(1, "!> ", 3) = 3
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
read(0, <unfinished ...>
Process 15900 detached

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#17)
Re: Help - corruption issue?

Dne 21.4.2011 07:16, Phoenix Kiula napsal(a):

Tomas,

I did a crash log with the strace for PID of the index command as you
suggested.

Here's the output:
http://www.heypasteit.com/clip/WNR

Also including below, but because this will wrap etc, you can look at
the link above.

Thanks for any ideas or pointers!

Process 15900 attached - interrupt to quit

Nope, that's the "psql" process - you need to attach to the backend
process that's created to handle the connection. Whenever you create a
connection (from a psql), a new backend process is forked to handle that
single connection - this is the process you need to strace.

You can either see that in 'ps ax' (the PID is usually +1 with respect
to the psql process), or you can do this

SELECT pg_backend_pid();

as that will give you PID of the backend for the current connection.

regards
Tomas

#19Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#18)
Re: Help - corruption issue?

On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 21.4.2011 07:16, Phoenix Kiula napsal(a):

Tomas,

I did a crash log with the strace for PID of the index command as you
suggested.

Here's the output:
http://www.heypasteit.com/clip/WNR

Also including below, but because this will wrap etc, you can look at
the link above.

Thanks for any ideas or pointers!

Process 15900 attached - interrupt to quit

Nope, that's the "psql" process - you need to attach to the backend
process that's created to handle the connection. Whenever you create a
connection (from a psql), a new backend process is forked to handle that
single connection - this is the process you need to strace.

You can either see that in 'ps ax' (the PID is usually +1 with respect
to the psql process), or you can do this

 SELECT pg_backend_pid();

as that will give you PID of the backend for the current connection.

Thanks. Did that.

The crash.log is a large-ish file, about 24KB. Here's the last 10
lines though. Does this help?

 ~ > tail -10 /root/crash.log
read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3
\374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192
write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."...,
8192) = 8192
read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3
0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192
read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3
\254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192
read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3
\237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192
read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3
\200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192
read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3
\324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192
read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3
\240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 17161 detached

The full crash.log file is here if needed:
https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ

Btw, this happens when I try to create an index on one of the columns
in my table.

Just before this, I had created another index on modify_date (a
timestamp column) and it went fine.

Does that mean anything?

Thanks

#20Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Phoenix Kiula (#19)
Re: Help - corruption issue?

On Fri, Apr 22, 2011 at 12:06 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On Fri, Apr 22, 2011 at 12:51 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

Dne 21.4.2011 07:16, Phoenix Kiula napsal(a):

Tomas,

I did a crash log with the strace for PID of the index command as you
suggested.

Here's the output:
http://www.heypasteit.com/clip/WNR

Also including below, but because this will wrap etc, you can look at
the link above.

Thanks for any ideas or pointers!

Process 15900 attached - interrupt to quit

Nope, that's the "psql" process - you need to attach to the backend
process that's created to handle the connection. Whenever you create a
connection (from a psql), a new backend process is forked to handle that
single connection - this is the process you need to strace.

You can either see that in 'ps ax' (the PID is usually +1 with respect
to the psql process), or you can do this

 SELECT pg_backend_pid();

as that will give you PID of the backend for the current connection.

Thanks. Did that.

The crash.log is a large-ish file, about 24KB. Here's the last 10
lines though. Does this help?

 ~ > tail -10 /root/crash.log
read(58, "`\1\0\0\230\337\0\343\1\0\0\0P\0T\r\0 \3
\374\236\2\2T\215\312\1\354\235\32\2"..., 8192) = 8192
write(97, "213.156.60\0\0 \0\0\0\37\0\364P\3\0\34@\22\0\0\000210."...,
8192) = 8192
read(58, "`\1\0\0\274\362\0\343\1\0\0\0T\0\210\r\0 \3
0\217\352\1\240\236\272\0024\235\322\2"..., 8192) = 8192
read(58, "[\1\0\0\354)c*\1\0\0\0T\0\214\r\0 \3
\254\236\242\2\340\220\342\2\\\235\232\2"..., 8192) = 8192
read(58, "\\\1\0\0\200\245\207\32\1\0\0\0\\\0\340\r\0 \3
\237\272\1\304\235\262\2\340\215\322\1"..., 8192) = 8192
read(58, "\350\0\0\0\274\311x\323\1\0\0\0\\\0000\r\0 \3
\200\236\372\2(\235\252\2\34\234\22\2"..., 8192) = 8192
read(58, ";\1\0\0|#\265\30\1\0\0\0`\0h\r\0 \3
\324\236R\2\314\235\n\2h\215\362\1"..., 8192) = 8192
read(58, "c\1\0\0000\24%u\1\0\0\0\230\0\210\r\0 \3
\240\226\32\16\260\235\252\1p\222Z\10"..., 8192) = 8192
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 17161 detached

The full crash.log file is here if needed:
https://www.yousendit.com/download/ VnBxcmxjNDJlM1JjR0E9PQ

Btw, this happens when I try to create an index on one of the columns
in my table.

Just before this, I had created another index on modify_date  (a
timestamp column) and it went fine.

Does that mean anything?

Thanks

Probably a dumb and ignorant question, but should I be reseting the xlog?
http://postgresql.1045698.n5.nabble.com/SIGSEGV-when-trying-to-start-in-single-user-mode-td1924418.html

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#20)
#22Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#22)
#24Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#24)
#26Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#25)
#27Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Phoenix Kiula (#26)
#28Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Phoenix Kiula (#27)
#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alban Hertroys (#28)
#30Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Tomas Vondra (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phoenix Kiula (#30)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#27)
#34Shashank Tripathi
shanx@shanx.com
In reply to: Tomas Vondra (#33)
#35Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Shashank Tripathi (#34)
#36Scott Marlowe
scott.marlowe@gmail.com
In reply to: Phoenix Kiula (#35)
#37Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Scott Marlowe (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#35)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Phoenix Kiula (#37)