help troubleshooting invalid page header error

Started by Cory Zueover 11 years ago9 messagesgeneral

czue@dimagi.com

over 11 years ago

Hi all,

Our postgres instance on one of our production machines has recently been
returning errors of the form "DatabaseError: invalid page header in block 1
of relation base/16384/76623" from normal queries. I've been reading that
these are often linked to hardware errors, but I would like to better
understand what else it could be or how to determine that for sure. I've
filled out the standard issue reporting template below. Any feedback or
troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you
expect.:

Intermittent queries are failing with the error "DatabaseError: invalid
page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting | source

It's also probably worth noting that postgres is installed on an encrypted
volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what else
I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to consistently
happen with certain queries. However, the system was rebooted shortly
before the errors started occuring. The system was rebooted because another
database (elasticsearch) was having problems on the same machine and the
reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID
configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

- Have you *ever* set fsync=off in the postgresql config file?

- Have you had any unexpected power loss lately? Replaced a failed RAID
disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.

- Have you run a file system check? (chkdsk / fsck)

No.

- Are there any error messages in the system logs? (unix/linux: dmesg,
/var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

thanks,
Cory

chiru r

chirupg@gmail.com

over 11 years ago

In reply to: Cory Zue (#1)

Re: help troubleshooting invalid page header error

Hi Cory,

We have *zero_damaged_pages* parameter in PostgreSQL configuration,by
default it is set be *off*.
To recover data from corrupted table,we can turn *on* this parameter as a
super user and populate new table using dump or copy utility.

Note : The damaged pages we can't recover from table,it will set to 0 and
it will skip while fetching data from table.

Please follow below steps, if decided to recover data from corrupted table.

*Sample case :*

[postgres@instructor ~]$ /usr/local/pgsql/bin/psql
psql (9.4rc1)
Type "help" for help.

postgres=# select count(*) from test;
*ERROR: invalid page in block 7 of relation base/13003/16384*
postgres=# show zero_damaged_pages;
zero_damaged_pages
--------------------
off
(1 row)

postgres=# *set zero_damaged_pages=on;*
SET
postgres=# show zero_damaged_pages;
zero_damaged_pages
--------------------
on
(1 row)

postgres=# select count(*) from test;
*WARNING: invalid page in block 7 of relation base/13003/16384; zeroing
out page*
WARNING: invalid page in block 8 of relation base/13003/16384; zeroing out
page
WARNING: invalid page in block 9 of relation base/13003/16384; zeroing out
page
WARNING: invalid page in block 10 of relation base/13003/16384; zeroing
out page
WARNING: invalid page in block 11 of relation base/13003/16384; zeroing
out page
WARNING: invalid page in block 12 of relation base/13003/16384; zeroing
out page
WARNING: invalid page in block 13 of relation base/13003/16384; zeroing
out page

count
--------
979163
(1 row)

On Tue, Dec 23, 2014 at 8:47 AM, Cory Zue <czue@dimagi.com> wrote:

Show quoted text

Hi all,

Our postgres instance on one of our production machines has recently been
returning errors of the form "DatabaseError: invalid page header in block
1 of relation base/16384/76623" from normal queries. I've been reading that
these are often linked to hardware errors, but I would like to better
understand what else it could be or how to determine that for sure. I've
filled out the standard issue reporting template below. Any feedback or
troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you
expect.:

Intermittent queries are failing with the error "DatabaseError: invalid
page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |
source

------------------------------+-----------------------------+----------------------
checkpoint_completion_target | 0.9 |
configuration file
checkpoint_segments | 32 |
configuration file
checkpoint_timeout | 15min |
configuration file
DateStyle | ISO, MDY |
configuration file
default_text_search_config | pg_catalog.english |
configuration file
effective_cache_size | 1GB |
configuration file
lc_messages | en_US.UTF-8 |
configuration file
lc_monetary | en_US.UTF-8 |
configuration file
lc_numeric | en_US.UTF-8 |
configuration file
lc_time | en_US.UTF-8 |
configuration file
log_checkpoints | on |
configuration file
log_connections | off |
configuration file
log_destination | csvlog |
configuration file
log_directory | /opt/data/pgsql/data/pg_log |
configuration file
log_disconnections | off |
configuration file
log_duration | on |
configuration file
log_filename | postgres-%Y-%m-%d_%H%M%S |
configuration file
log_lock_waits | on |
configuration file
log_min_duration_statement | 250ms |
configuration file
log_rotation_age | 1d |
configuration file
log_rotation_size | 1GB |
configuration file
log_temp_files | 0 |
configuration file
log_timezone | Asia/Kolkata | command line
log_truncate_on_rotation | on |
configuration file
logging_collector | on |
configuration file
maintenance_work_mem | 768MB |
configuration file
max_connections | 500 |
configuration file
max_stack_depth | 2MB | environment
variable
port | 5432 | command line
shared_buffers | 4GB |
configuration file
ssl | on |
configuration file
TimeZone | Asia/Kolkata | command line
timezone_abbreviations | Default | command line
wal_buffers | 16MB |
configuration file
work_mem | 48MB |
configuration file

It's also probably worth noting that postgres is installed on an encrypted
volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what
else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to consistently
happen with certain queries. However, the system was rebooted shortly
before the errors started occuring. The system was rebooted because another
database (elasticsearch) was having problems on the same machine and the
reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID
configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

- Have you *ever* set fsync=off in the postgresql config file?

No

- Have you had any unexpected power loss lately? Replaced a failed
RAID disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.

- Have you run a file system check? (chkdsk / fsck)

No.

- Are there any error messages in the system logs? (unix/linux: dmesg,
/var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

thanks,
Cory

Sameer Kumar

sameer.kumar@ashnik.com

over 11 years ago

In reply to: Cory Zue (#1)

Re: help troubleshooting invalid page header error

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently been

returning errors of the form "DatabaseError: invalid page header in block 1
of relation base/16384/76623" from normal queries. I've been reading that
these are often linked to hardware errors, but I would like to better
understand what else it could be or how to determine that for sure. I've
filled out the standard issue reporting template below. Any feedback or
troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError: invalid

page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)

4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command line
log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB | environment

variable

port | 5432 | command line
shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command line
timezone_abbreviations | Default | command line
wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what

else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to consistently

happen with certain queries. However, the system was rebooted shortly
before the errors started occuring. The system was rebooted because another
database (elasticsearch) was having problems on the same machine and the
reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.
Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel version
and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel. I
just had to update to latest patch. That worked for me.

Cory Zue

czue@dimagi.com

over 11 years ago

In reply to: Sameer Kumar (#3)

Re: help troubleshooting invalid page header error

Hi all,

Thanks for the responses. Chiru, I'm looking into your suggestion.

Sameer, here is the kernel version info:

Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT 2014
x86_64 x86_64 x86_64 GNU/Linux

Does that seem like it could be a problematic version?

More generally - I'm still wondering whether I should chalk this failure up
to a transient/random issue, or whether I should be more worried about the
hardware on the machine. According to our diagnostic tools, disk and
memory are fine, but it's still not clear to me how it got into this state.
Any general bits of information regarding the potential causes of these
types of issues would be much appreciated.

thanks,
Cory

On Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com>
wrote:

Show quoted text

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently

been returning errors of the form "DatabaseError: invalid page header in
block 1 of relation base/16384/76623" from normal queries. I've been
reading that these are often linked to hardware errors, but I would like to
better understand what else it could be or how to determine that for sure.
I've filled out the standard issue reporting template below. Any feedback
or troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError: invalid

page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)

4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command

line

log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB |

environment variable

port | 5432 | command

line

shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command

line

timezone_abbreviations | Default | command

line

wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what

else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to consistently

happen with certain queries. However, the system was rebooted shortly
before the errors started occuring. The system was rebooted because another
database (elasticsearch) was having problems on the same machine and the
reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation

base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.
Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel version
and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel. I
just had to update to latest patch. That worked for me.

chiru r

chirupg@gmail.com

over 11 years ago

In reply to: Cory Zue (#4)

Re: help troubleshooting invalid page header error

Hi Cory,

After recovering table turn off *zero_damaged_pages *parameter.

On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:

Show quoted text

Hi all,

Thanks for the responses. Chiru, I'm looking into your suggestion.

Sameer, here is the kernel version info:

Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT
2014 x86_64 x86_64 x86_64 GNU/Linux

Does that seem like it could be a problematic version?

More generally - I'm still wondering whether I should chalk this failure
up to a transient/random issue, or whether I should be more worried about
the hardware on the machine. According to our diagnostic tools, disk and
memory are fine, but it's still not clear to me how it got into this state.
Any general bits of information regarding the potential causes of these
types of issues would be much appreciated.

thanks,
Cory

On Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com>
wrote:

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently

been returning errors of the form "DatabaseError: invalid page header in
block 1 of relation base/16384/76623" from normal queries. I've been
reading that these are often linked to hardware errors, but I would like to
better understand what else it could be or how to determine that for sure.
I've filled out the standard issue reporting template below. Any feedback
or troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError: invalid

page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)

4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command

line

log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB |

environment variable

port | 5432 | command

line

shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command

line

timezone_abbreviations | Default | command

line

wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what

else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to

consistently happen with certain queries. However, the system was rebooted
shortly before the errors started occuring. The system was rebooted because
another database (elasticsearch) was having problems on the same machine
and the reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation

base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.
Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel
version and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel. I
just had to update to latest patch. That worked for me.

Cory Zue

czue@dimagi.com

over 11 years ago

In reply to: chiru r (#5)

Re: help troubleshooting invalid page header error

Hi Chiru,

I am trying to pg_dump the database to have a snapshot of the current
state. I've turned on 'zero_damaged_pages' but pg_dump is still failing
with an "invalid page header" error - this time from what looks like a
sequence object that is auto-setting IDs on a table. Any advice on how to
remove this error?

Here is the full query that's failing:

SELECT sequence_name, start_value, last_value, increment_by, CASE WHEN
increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN
increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS
max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN
increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE
min_value END AS min_value, cache_value, is_cycled, is_called from
unfinishedsubmissionstub_id_seq

On Fri, Dec 26, 2014 at 2:35 PM, chiru r <chirupg@gmail.com> wrote:

Show quoted text

Hi Cory,

After recovering table turn off *zero_damaged_pages *parameter.

On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:

Hi all,

Thanks for the responses. Chiru, I'm looking into your suggestion.

Sameer, here is the kernel version info:

Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT
2014 x86_64 x86_64 x86_64 GNU/Linux

Does that seem like it could be a problematic version?

More generally - I'm still wondering whether I should chalk this failure
up to a transient/random issue, or whether I should be more worried about
the hardware on the machine. According to our diagnostic tools, disk and
memory are fine, but it's still not clear to me how it got into this state.
Any general bits of information regarding the potential causes of these
types of issues would be much appreciated.

thanks,
Cory

On Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com>
wrote:

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently

been returning errors of the form "DatabaseError: invalid page header in
block 1 of relation base/16384/76623" from normal queries. I've been
reading that these are often linked to hardware errors, but I would like to
better understand what else it could be or how to determine that for sure.
I've filled out the standard issue reporting template below. Any feedback
or troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError:

invalid page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc

(GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command

line

log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB |

environment variable

port | 5432 | command

line

shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command

line

timezone_abbreviations | Default | command

line

wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure what

else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to

consistently happen with certain queries. However, the system was rebooted
shortly before the errors started occuring. The system was rebooted because
another database (elasticsearch) was having problems on the same machine
and the reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation

base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described above.
Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel
version and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel.
I just had to update to latest patch. That worked for me.

Cory Zue

czue@dimagi.com

over 11 years ago

In reply to: Cory Zue (#6)

Re: help troubleshooting invalid page header error

(nevermind - it looks like the zero_damaged_pages setting only took for the
duration of the session)

On Fri, Dec 26, 2014 at 5:15 PM, Cory Zue <czue@dimagi.com> wrote:

Show quoted text

Hi Chiru,

I am trying to pg_dump the database to have a snapshot of the current
state. I've turned on 'zero_damaged_pages' but pg_dump is still failing
with an "invalid page header" error - this time from what looks like a
sequence object that is auto-setting IDs on a table. Any advice on how to
remove this error?

Here is the full query that's failing:

SELECT sequence_name, start_value, last_value, increment_by, CASE WHEN
increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN
increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS
max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN
increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE
min_value END AS min_value, cache_value, is_cycled, is_called from
unfinishedsubmissionstub_id_seq

On Fri, Dec 26, 2014 at 2:35 PM, chiru r <chirupg@gmail.com> wrote:

Hi Cory,

After recovering table turn off *zero_damaged_pages *parameter.

On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:

Hi all,

Thanks for the responses. Chiru, I'm looking into your suggestion.

Sameer, here is the kernel version info:

Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT
2014 x86_64 x86_64 x86_64 GNU/Linux

Does that seem like it could be a problematic version?

More generally - I'm still wondering whether I should chalk this failure
up to a transient/random issue, or whether I should be more worried about
the hardware on the machine. According to our diagnostic tools, disk and
memory are fine, but it's still not clear to me how it got into this state.
Any general bits of information regarding the potential causes of these
types of issues would be much appreciated.

thanks,
Cory

On Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com>
wrote:

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently

been returning errors of the form "DatabaseError: invalid page header in
block 1 of relation base/16384/76623" from normal queries. I've been
reading that these are often linked to hardware errors, but I would like to
better understand what else it could be or how to determine that for sure.
I've filled out the standard issue reporting template below. Any feedback
or troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError:

invalid page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc

(GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command

line

log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB |

environment variable

port | 5432 | command

line

shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command

line

timezone_abbreviations | Default | command

line

wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure

what else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to

consistently happen with certain queries. However, the system was rebooted
shortly before the errors started occuring. The system was rebooted because
another database (elasticsearch) was having problems on the same machine
and the reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation

base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described

above.

Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel
version and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel.
I just had to update to latest patch. That worked for me.

Cory Zue

czue@dimagi.com

over 11 years ago

In reply to: Cory Zue (#6)

Re: help troubleshooting invalid page header error

Hi again,

I was able to get the database back to a normal functional state using
the zero_damaged_pages
flag. However, after getting everything working and starting to use the
database again, I am again getting "invalid page header" errors on a
certain table.

Does this imply there is a hardware issue on my machine? Is there anything
else that could be causing this to come back?

thanks,
Cory

On Fri, Dec 26, 2014 at 5:15 PM, Cory Zue <czue@dimagi.com> wrote:

Show quoted text

Hi Chiru,

I am trying to pg_dump the database to have a snapshot of the current
state. I've turned on 'zero_damaged_pages' but pg_dump is still failing
with an "invalid page header" error - this time from what looks like a
sequence object that is auto-setting IDs on a table. Any advice on how to
remove this error?

Here is the full query that's failing:

SELECT sequence_name, start_value, last_value, increment_by, CASE WHEN
increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN
increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS
max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN
increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE
min_value END AS min_value, cache_value, is_cycled, is_called from
unfinishedsubmissionstub_id_seq

On Fri, Dec 26, 2014 at 2:35 PM, chiru r <chirupg@gmail.com> wrote:

Hi Cory,

After recovering table turn off *zero_damaged_pages *parameter.

On Fri, Dec 26, 2014 at 9:13 PM, Cory Zue <czue@dimagi.com> wrote:

Hi all,

Thanks for the responses. Chiru, I'm looking into your suggestion.

Sameer, here is the kernel version info:

Linux dimagi 2.6.32-431.20.5.el6.x86_64 #1 SMP Wed Jul 16 05:26:53 EDT
2014 x86_64 x86_64 x86_64 GNU/Linux

Does that seem like it could be a problematic version?

More generally - I'm still wondering whether I should chalk this failure
up to a transient/random issue, or whether I should be more worried about
the hardware on the machine. According to our diagnostic tools, disk and
memory are fine, but it's still not clear to me how it got into this state.
Any general bits of information regarding the potential causes of these
types of issues would be much appreciated.

thanks,
Cory

On Fri, Dec 26, 2014 at 6:55 AM, Sameer Kumar <sameer.kumar@ashnik.com>
wrote:

On 23 Dec 2014 12:05, "Cory Zue" <czue@dimagi.com> wrote:

Hi all,

Our postgres instance on one of our production machines has recently

been returning errors of the form "DatabaseError: invalid page header in
block 1 of relation base/16384/76623" from normal queries. I've been
reading that these are often linked to hardware errors, but I would like to
better understand what else it could be or how to determine that for sure.
I've filled out the standard issue reporting template below. Any feedback
or troubleshooting instructions would be much appreciated.

---
A description of what you are trying to achieve and what results you

expect.:

Intermittent queries are failing with the error "DatabaseError:

invalid page header in block 1 of relation base/16384/76623"

PostgreSQL version number you are running:

PostgreSQL 8.4.13 on x86_64-redhat-linux-gnu, compiled by GCC gcc

(GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

from standard package installer

Changes made to the settings in the postgresql.conf file:

name | current_setting |

source

------------------------------+-----------------------------+----------------------

checkpoint_completion_target | 0.9 |

configuration file

checkpoint_segments | 32 |

configuration file

checkpoint_timeout | 15min |

configuration file

DateStyle | ISO, MDY |

configuration file

default_text_search_config | pg_catalog.english |

configuration file

effective_cache_size | 1GB |

configuration file

lc_messages | en_US.UTF-8 |

configuration file

lc_monetary | en_US.UTF-8 |

configuration file

lc_numeric | en_US.UTF-8 |

configuration file

lc_time | en_US.UTF-8 |

configuration file

log_checkpoints | on |

configuration file

log_connections | off |

configuration file

log_destination | csvlog |

configuration file

log_directory | /opt/data/pgsql/data/pg_log |

configuration file

log_disconnections | off |

configuration file

log_duration | on |

configuration file

log_filename | postgres-%Y-%m-%d_%H%M%S |

configuration file

log_lock_waits | on |

configuration file

log_min_duration_statement | 250ms |

configuration file

log_rotation_age | 1d |

configuration file

log_rotation_size | 1GB |

configuration file

log_temp_files | 0 |

configuration file

log_timezone | Asia/Kolkata | command

line

log_truncate_on_rotation | on |

configuration file

logging_collector | on |

configuration file

maintenance_work_mem | 768MB |

configuration file

max_connections | 500 |

configuration file

max_stack_depth | 2MB |

environment variable

port | 5432 | command

line

shared_buffers | 4GB |

configuration file

ssl | on |

configuration file

TimeZone | Asia/Kolkata | command

line

timezone_abbreviations | Default | command

line

wal_buffers | 16MB |

configuration file

work_mem | 48MB |

configuration file

It's also probably worth noting that postgres is installed on an

encrypted volume which is mounted using ecryptfs.

Operating system and version:

RedHatEnterpriseServer, version 6.6

What program you're using to connect to PostgreSQL:

Python (django)

Is there anything relevant or unusual in the PostgreSQL server logs?:

I see lots of instances of this error (and similar). I'm not sure

what else I should be looking for.

What you were doing when the error happened / how to cause the error:

I haven't explicitly tried to reproduce it, but it seems to

consistently happen with certain queries. However, the system was rebooted
shortly before the errors started occuring. The system was rebooted because
another database (elasticsearch) was having problems on the same machine
and the reboot was to attempt to resolve things.

The EXACT TEXT of the error message you're getting, if there is one:

DatabaseError: invalid page header in block 1 of relation

base/16384/76623

(block and relation numbers change)

Unfortunately, I'm not completely familiar with the CPU and disk/RAID

configurations used on the server. However it is storing to a (software)
encrypted volume as mentioned above.

Have you ever set fsync=off in the postgresql config file?
No
Have you had any unexpected power loss lately? Replaced a failed RAID

disk? Had an operating system crash?

Not recently, though the system did reboot normally as described

above.

Have you run a file system check? (chkdsk / fsck)
No.
Are there any error messages in the system logs?

(unix/linux: dmesg, /var/log/syslog ;

I haven't seen anything obvious but I wasn't sure what to look for.

I guess you missed to provide the details and kernel version (rhel
version and kernel level).
This will give you kernel patch level-

uname -a

I had once faced this issue and I was on a buggy patch of Linux kernel.
I just had to update to latest patch. That worked for me.

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 11 years ago

In reply to: Cory Zue (#8)

Re: help troubleshooting invalid page header error

Cory Zue <czue@dimagi.com> wrote:

I was able to get the database back to a normal functional state
using the zero_damaged_pages flag. However, after getting
everything working and starting to use the database again, I am
again getting "invalid page header" errors on a certain table.

Does this imply there is a hardware issue on my machine? Is there
anything else that could be causing this to come back?

In my personal experience bad hardware is the most common cause,
followed by buggy device drivers (where an OS software upgrade
prevented further corruption), followed by using incorrect
procedures for backup, restore, replication setup, or node
promotion. For example, not excluding files under pg_xlog from a
base backup or deleting (or moving) the backup_label file can cause
corruption.

For a more complete discussion, see this blog page:

http://rhaas.blogspot.com/2012/03/why-is-my-database-corrupted.html

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general