Hung Postgres Processes
Folks,
Just had this particular very unpleasant experience for the first time.
I had an overnight series of data transformations running ... usually,
they run from 12:30am to 1:20 am ... and the process hung. Badly.
Requiring a "fast" system shutdown and restoring the database from
backup.
Here's the details:
Platform: Hand-built Dual Athalon MP/Molex RAID 5 (UW SCSI) system.
PostgreSQL 7.2.3
SuSE Linux 7.3
Data imports started normally at 12:00am and apparently completed.
Data transformation process (16-35 UPDATES and INSERTs affecting a
combined 1, 300,000 records) started at about 12:30am after the import
ended. The data transformations are a series of functions called by a
Perl script through cron as the root user.
Sometime during the transformation process, a statement hung. The
procedure continued running for at least 2 hours, at which point
another script, set up to detect such problems, ran a "pg_ctl -m fast
stop". Instead of stopping, the postgresql server hung.
When I got to the machine in the morning, there were 3 processes, one
query, one checkpoint process and the postmaster which were frozen.
SIGHUP and SIGTERM were ignored by these; SIGKILL was able to kill
the postmaster process, but the two other processes went to "D" status
and were untouchable.
I was forced to fast-shutdown the server. While Postgres did restart OK
after restarting the machine, I did not trust the data integrity, and
restored from backup.
Has anyone else encountered this kind of situation? Is there a way to
prevent it, or a less drastic way to resolve it? What are likely
causes?
-Josh Berkus
"Josh Berkus" <josh@agliodbs.com> writes:
When I got to the machine in the morning, there were 3 processes, one
query, one checkpoint process and the postmaster which were frozen.
SIGHUP and SIGTERM were ignored by these; SIGKILL was able to kill
the postmaster process, but the two other processes went to "D" status
and were untouchable.
You've got hardware problems. It's not Postgres' fault if the disk
stops responding, which is what a process stuck in disk-wait state
implies.
regards, tom lane