broken reading on standby (PostgreSQL 16.2)

Started by Pavel Stehuleover 1 year ago7 messages
#1Pavel Stehule
pavel.stehule@gmail.com

Hi

yesterday, I had to fix strange issue on standby server

The query to freshly updated data fails

select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947

amcheck

select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg

-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid
transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid
transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid
transaction ID 3:1523885078

I verified xmin against the primary server, and it was the same. There was
not any replication gap.

I checked the fields from pg_database table, and looks same too

These rows were valid (and visible) on primary.

On this server there was not any long session (when I was connected),
unfortunately I cannot test restart of this server. One wal sender is
executing on standby. Fortunately, there was a possibility to run VACUUM
FULL, and it fixed the issue.

The customer has archived wals.

My question - is it possible to do some diagnostics from SQL level? I
didn't find a way to get values that are used for comparison by amcheck
from SQL.

Regards

Pavel

#2Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Pavel Stehule (#1)
Re: broken reading on standby (PostgreSQL 16.2)

On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:

yesterday, I had to fix strange issue on standby server

It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably, but I’ll post a link here, just in case [0]/messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru. In February and March we had 3 clusters with similar problem, and this is unusually big number for us in just 2 months.

Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen on primary, but have usual xmin on standby.

Best regards, Andrey Borodin.

[0]: /messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrey M. Borodin (#2)
Re: broken reading on standby (PostgreSQL 16.2)

čt 25. 4. 2024 v 8:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru>
napsal:

On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:

yesterday, I had to fix strange issue on standby server

It’s not just broken reading, if this standby is promoted in HA cluster -
this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on
14.11. This might be unrelated most probably, but I’ll post a link here,
just in case [0]. In February and March we had 3 clusters with similar
problem, and this is unusually big number for us in just 2 months.

Can you check LSN of blocks with corrupted tuples with pageinpsect on
primary and on standby? I suspect they are frozen on primary, but have
usual xmin on standby.

Unfortunately, I have not direct access to backup, so I am not able to test
it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help

Show quoted text

Best regards, Andrey Borodin.

[0]
/messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru

#4Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Pavel Stehule (#3)
Re: broken reading on standby (PostgreSQL 16.2)

On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help

If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change anything. Primary will not emit WAL record to freeze tuples again.

Best regards, Andrey Borodin.

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andrey M. Borodin (#4)
Re: broken reading on standby (PostgreSQL 16.2)

čt 25. 4. 2024 v 10:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru>
napsal:

On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Unfortunately, I have not direct access to backup, so I am not able to

test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help

If Primary considers all freezable tuples frozen, but a standby does not,
"disable page skipping" won't change anything. Primary will not emit WAL
record to freeze tuples again.

yes, this is possible. I git just info about broken xmin, so I expected
broken xlog, and then I first tested FREEZE

Show quoted text

Best regards, Andrey Borodin.

#6Melanie Plageman
melanieplageman@gmail.com
In reply to: Pavel Stehule (#1)
Re: broken reading on standby (PostgreSQL 16.2)

On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

yesterday, I had to fix strange issue on standby server

The query to freshly updated data fails

select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947

amcheck

select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg
-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078

I verified xmin against the primary server, and it was the same. There was not any replication gap.

I checked the fields from pg_database table, and looks same too

These rows were valid (and visible) on primary.

On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.

The customer has archived wals.

Did you mention what version of Postgres this is?

- Melanie

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Melanie Plageman (#6)
Re: broken reading on standby (PostgreSQL 16.2)

čt 25. 4. 2024 v 12:53 odesílatel Melanie Plageman <
melanieplageman@gmail.com> napsal:

On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Hi

yesterday, I had to fix strange issue on standby server

The query to freshly updated data fails

select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947

amcheck

select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg

-------+--------+--------+-------------------------------------------------------------------

5763 | 111 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5863 | 109 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5863 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5868 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5868 | 111 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5875 | 111 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5895 | 109 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

5895 | 110 | | xmin 1439564642 precedes oldest valid

transaction ID 3:1523885078

6245 | 108 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6245 | 109 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6245 | 110 | | xmin 1439564642 precedes oldest valid

transaction ID 3:1523885078

6245 | 112 | | xmin 1424677216 precedes oldest valid

transaction ID 3:1523885078

6378 | 109 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6378 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6382 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6590 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

6590 | 111 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

7578 | 112 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

7581 | 112 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

8390 | 112 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

10598 | 109 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

10598 | 110 | | xmin 1393466389 precedes oldest valid

transaction ID 3:1523885078

I verified xmin against the primary server, and it was the same. There

was not any replication gap.

I checked the fields from pg_database table, and looks same too

These rows were valid (and visible) on primary.

On this server there was not any long session (when I was connected),

unfortunately I cannot test restart of this server. One wal sender is
executing on standby. Fortunately, there was a possibility to run VACUUM
FULL, and it fixed the issue.

The customer has archived wals.

Did you mention what version of Postgres this is?

16.2 and related rows was today one day old

Show quoted text

- Melanie