broken reading on standby (PostgreSQL 16.2)
Hi
yesterday, I had to fix strange issue on standby server
The query to freshly updated data fails
select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947
amcheck
select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg
-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid
transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
I verified xmin against the primary server, and it was the same. There was
not any replication gap.
I checked the fields from pg_database table, and looks same too
These rows were valid (and visible) on primary.
On this server there was not any long session (when I was connected),
unfortunately I cannot test restart of this server. One wal sender is
executing on standby. Fortunately, there was a possibility to run VACUUM
FULL, and it fixed the issue.
The customer has archived wals.
My question - is it possible to do some diagnostics from SQL level? I
didn't find a way to get values that are used for comparison by amcheck
from SQL.
Regards
Pavel
On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:
yesterday, I had to fix strange issue on standby server
It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably, but I’ll post a link here, just in case [0]/messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru. In February and March we had 3 clusters with similar problem, and this is unusually big number for us in just 2 months.
Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen on primary, but have usual xmin on standby.
Best regards, Andrey Borodin.
[0]: /messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru
čt 25. 4. 2024 v 8:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru>
napsal:
On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:
yesterday, I had to fix strange issue on standby server
It’s not just broken reading, if this standby is promoted in HA cluster -
this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on
14.11. This might be unrelated most probably, but I’ll post a link here,
just in case [0]. In February and March we had 3 clusters with similar
problem, and this is unusually big number for us in just 2 months.Can you check LSN of blocks with corrupted tuples with pageinpsect on
primary and on standby? I suspect they are frozen on primary, but have
usual xmin on standby.
Unfortunately, I have not direct access to backup, so I am not able to test
it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
Show quoted text
Best regards, Andrey Borodin.
[0]
/messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru
On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change anything. Primary will not emit WAL record to freeze tuples again.
Best regards, Andrey Borodin.
čt 25. 4. 2024 v 10:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru>
napsal:
On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Unfortunately, I have not direct access to backup, so I am not able to
test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
If Primary considers all freezable tuples frozen, but a standby does not,
"disable page skipping" won't change anything. Primary will not emit WAL
record to freeze tuples again.
yes, this is possible. I git just info about broken xmin, so I expected
broken xlog, and then I first tested FREEZE
Show quoted text
Best regards, Andrey Borodin.
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hi
yesterday, I had to fix strange issue on standby server
The query to freshly updated data fails
select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947amcheck
select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg
-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078I verified xmin against the primary server, and it was the same. There was not any replication gap.
I checked the fields from pg_database table, and looks same too
These rows were valid (and visible) on primary.
On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.
The customer has archived wals.
Did you mention what version of Postgres this is?
- Melanie
čt 25. 4. 2024 v 12:53 odesílatel Melanie Plageman <
melanieplageman@gmail.com> napsal:
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com>
wrote:Hi
yesterday, I had to fix strange issue on standby server
The query to freshly updated data fails
select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947amcheck
select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid
transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
I verified xmin against the primary server, and it was the same. There
was not any replication gap.
I checked the fields from pg_database table, and looks same too
These rows were valid (and visible) on primary.
On this server there was not any long session (when I was connected),
unfortunately I cannot test restart of this server. One wal sender is
executing on standby. Fortunately, there was a possibility to run VACUUM
FULL, and it fixed the issue.The customer has archived wals.
Did you mention what version of Postgres this is?
16.2 and related rows was today one day old
Show quoted text
- Melanie