Broken hint bits (freeze)

Started by Dmitriy Sarafannikovabout 9 years ago64 messageshackers

dsarafannikov@yandex.ru

about 9 years ago

Hi hackers,

We have some problems on our production with hint bits and frozen tuples.
More and more following errors began to appear on master after switchover:
ERROR: 58P01: could not access status of transaction 1952523525
DETAIL: Could not open file "pg_clog/0746": No such file or directory.
LOCATION: SlruReportIOError, slru.c:896

We investigated the problem with pageinspect and found the tuples that are the cause:

xdb311g(master)=# select * from mytable where ctid = '(4,21)';
ERROR: 58P01: could not access status of transaction 1951521353
DETAIL: Could not open file "pg_clog/0745": No such file or directory.
LOCATION: SlruReportIOError, slru.c:896

But the same query successfully executed on replica.

We found some difference in hint bits between master and replica:

xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]------------------------------
t_xmin | 1951521353
?column? | 00000000000000000000000000000000

old master, now replica:
xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]------------------------------
t_xmin | 1951521353
?column? | 00000000000000000000001100000000

X’0300’ = HEAP_XMIN_FROZEN according to

#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)

xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
relfrozenxid
--------------
2266835605
(1 row)

This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID on master

Another interesting thing that LSN of this page on master and replica are not the same:
xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));
lsn
---------------
8092/6A26DD08
(1 row)

xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));
lsn
---------------
838D/C4A0D280
(1 row)

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?

We wrote a query which returns ctid of frozen tuples, which must be frozen but not actually frozen.

xdb311e(replica)=# select t_ctid from generate_series(0, pg_relation_size(‘mytable')/8192 - 1 ) s(i) left join lateral heap_page_items(get_raw_page(‘mytable',s.i::int)) on true where t_xmin::text::bigint < (select relfrozenxid::text::bigint from pg_class where relname = ‘mytable') and t_infomask & X'0300'::int < 1;
t_ctid
-----------
(400,16)
(2837,71)
(2837,72)
(2837,73)
(2837,75)
(2837,76)
(3042,40)
(4750,80)
(4750,81)
(5214,60)
(5214,65)
(6812,31)
(6912,63)
(7329,8)
(7374,26)
(7374,27)
(16 rows)
Same query on master returns 317 rows.

Our thoughts:
1) We think that it is related to switchover.
2) Any WAL-logged modification of this page on master will replace this page on replica due to full page writes.
And all replicas will have broken hint bits too. It’s dangerous.

Where to dig further?

RHEL6, PostgreSQL 9.6.3, wal_log_hints=off, full_page_writes=on, fsync=on, checksums disabled.
We don’t think that it is any hardware-related problems because this databases started from 9.4
and they survived 2 upgrades with pg_upgrade. And any hardware-related problems was not detected.
Problem appears not only in this shard.
Size of each shard is around 5TB and we can’t provide data.

Regards
Dmitriy Sarafannikov

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Dmitriy Sarafannikov

dsarafannikov@yandex.ru

about 9 years ago

In reply to: Dmitriy Sarafannikov (#1)

Re: Broken hint bits (freeze)

We found that this problem appears also on shards with enabled checksums.
This shard has 1st timeline, which means there was no switchover after upgrade to 9.6.

xdb11f(master)=# select pg_current_xlog_location(), pg_xlogfile_name(pg_current_xlog_location());
pg_current_xlog_location | pg_xlogfile_name
--------------------------+--------------------------
30BA/5966AD38 | 00000001000030BA00000059
(1 row)

xdb11f(master)=# select * from page_header(get_raw_page(‘mytable', 1787));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
---------------+----------+-------+-------+-------+---------+----------+---------+-----------
1F43/8C432C60 | -3337 | 5 | 256 | 304 | 8192 | 8192 | 4 | 0
(1 row)

xdb11h(replica)=# select * from page_header(get_raw_page(‘mytable', 1787));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
---------------+----------+-------+-------+-------+---------+----------+---------+-----------
1B28/45819C28 | -17617 | 5 | 256 | 304 | 8192 | 8192 | 4 | 0
(1 row)

xdb11e(replica)=# select * from page_header(get_raw_page('mytable', 1787));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
---------------+----------+-------+-------+-------+---------+----------+---------+-----------
1B28/45819C28 | -17617 | 5 | 256 | 304 | 8192 | 8192 | 4 | 0
(1 row)

Master has newer page version and freeze bits.

xdb11f(master)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from heap_page_items(get_raw_page(‘mytable', 1787)) where lp = 42;
t_xmin | ?column?
-----------+----------------------------------
516651778 | 00000000000000000000001100000000
(1 row)

xdb11h(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from heap_page_items(get_raw_page('mytable', 1787)) where lp = 42;
t_xmin | ?column?
-----------+----------------------------------
516651778 | 00000000000000000000000000000000
(1 row)

xdb11e(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from heap_page_items(get_raw_page('mytable', 1787)) where lp = 42;
t_xmin | ?column?
-----------+----------------------------------
516651778 | 00000000000000000000000000000000
(1 row)

It seems like replica did not replayed corresponding WAL records.
Any thoughts?

Regards,
Dmitriy Sarafannikov

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Dmitriy Sarafannikov (#2)

Re: Broken hint bits (freeze)

On Wed, May 24, 2017 at 7:27 AM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

It seems like replica did not replayed corresponding WAL records.
Any thoughts?

heap_xlog_freeze_page() is a pretty simple function. It's not
impossible that it could have a bug that causes it to incorrectly skip
records, but it's not clear why that wouldn't affect many other replay
routines equally, since the pattern of using the return value of
XLogReadBufferForRedo() to decide what to do is widespread.

Can you prove that other WAL records generated around the same time as
the freeze record *were* replayed on the master? If so, that proves
that this isn't just a case of the WAL never reaching the standby.
Can you look at the segment that contains the relevant freeze record
with pg_xlogdump? Maybe that record is messed up somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Vladimir Borodin

root@simply.name

about 9 years ago

In reply to: Robert Haas (#3)

Re: Broken hint bits (freeze)

24 мая 2017 г., в 15:44, Robert Haas <robertmhaas@gmail.com> написал(а):

On Wed, May 24, 2017 at 7:27 AM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

It seems like replica did not replayed corresponding WAL records.
Any thoughts?

heap_xlog_freeze_page() is a pretty simple function. It's not
impossible that it could have a bug that causes it to incorrectly skip
records, but it's not clear why that wouldn't affect many other replay
routines equally, since the pattern of using the return value of
XLogReadBufferForRedo() to decide what to do is widespread.

Can you prove that other WAL records generated around the same time as
the freeze record *were* replayed on the master? If so, that proves
that this isn't just a case of the WAL never reaching the standby.
Can you look at the segment that contains the relevant freeze record
with pg_xlogdump? Maybe that record is messed up somehow.

Not yet. Most of such cases are long before our recovery window so corresponding WALs have been deleted. We have already tuned retention policy and we are now looking for a fresh case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
May the force be with you…
https://simply.name

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Dmitriy Sarafannikov (#1)

Re: Broken hint bits (freeze)

On Tue, May 23, 2017 at 10:50 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

Hi hackers,

We have some problems on our production with hint bits and frozen tuples.
More and more following errors began to appear on master after switchover:
ERROR: 58P01: could not access status of transaction 1952523525
DETAIL: Could not open file "pg_clog/0746": No such file or directory.
LOCATION: SlruReportIOError, slru.c:896

We investigated the problem with pageinspect and found the tuples that are the cause:

xdb311g(master)=# select * from mytable where ctid = '(4,21)';
ERROR: 58P01: could not access status of transaction 1951521353
DETAIL: Could not open file "pg_clog/0745": No such file or directory.
LOCATION: SlruReportIOError, slru.c:896

But the same query successfully executed on replica.

We found some difference in hint bits between master and replica:

xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]------------------------------
t_xmin | 1951521353
?column? | 00000000000000000000000000000000

old master, now replica:
xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM heap_page_items(get_raw_page(‘mytable',4)) where lp=21;
-[ RECORD 1 ]------------------------------
t_xmin | 1951521353
?column? | 00000000000000000000001100000000

X’0300’ = HEAP_XMIN_FROZEN according to

#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)

xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
relfrozenxid
--------------
2266835605
(1 row)

This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID on master

Another interesting thing that LSN of this page on master and replica are not the same:
xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));
lsn
---------------
8092/6A26DD08
(1 row)

xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));
lsn
---------------
838D/C4A0D280
(1 row)

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?

Yeah, I think this is quite suspicious. This seems to indicate that
not all WAL records are replicated before the switchover. What is the
value of "synchronous_commit" you are using? I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

about 9 years ago

In reply to: Amit Kapila (#5)

Re: Broken hint bits (freeze)

On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think this is quite suspicious. This seems to indicate that
not all WAL records are replicated before the switchover. What is the
value of "synchronous_commit" you are using? I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

It is so easy to corrupt a server because of an incorrect base backup
flow or an incorrect switchover that it would be good first to
understand how you are doing your switchover. Any corruption happening
after a promotion, a failover or a switchover may be the top of the
iceberg of what's on the data pages, and you may just see one symptom
among other problems. Particularly, did you kill the master in any
violent way after promoting the standby? Has the former master been
able to perform at least once a clean shutdown checkpoint and has it
been rewound?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Vladimir Borodin

root@simply.name

about 9 years ago

In reply to: Michael Paquier (#6)

Re: Broken hint bits (freeze)

26 мая 2017 г., в 23:04, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I think this is quite suspicious. This seems to indicate that
not all WAL records are replicated before the switchover. What is the
value of "synchronous_commit" you are using? I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

It is so easy to corrupt a server because of an incorrect base backup
flow or an incorrect switchover that it would be good first to
understand how you are doing your switchover. Any corruption happening
after a promotion, a failover or a switchover may be the top of the
iceberg of what's on the data pages, and you may just see one symptom
among other problems. Particularly, did you kill the master in any
violent way after promoting the standby? Has the former master been
able to perform at least once a clean shutdown checkpoint and has it
been rewound?

At first we cleanly stop the former master and then we extra check that all data has been received by the standby to be promoted. Old master is returned by simply generating recovery.conf and starting it, all other standbys are done in the same way. WAL history on all hosts and in archive remains linear, no pg_rewind is needed. This procedure has been well tested on 9.3 (when the ability to do such switchover without restoring standbys from backup appeared), automated and has not changed for all these years.

Actually we have already found that LSNs of all corrupted tuples are somewhere near the time of upgrade from 9.5 to 9.6. There is still no evidence but it seems that it is mostly related to upgrade procedure. We now extract backups of 9.5 and 9.6 for a database where we now have corrupted pages to check this version. But it is still not obvious if it could be a pg_upgrade bug or our mistake in a way we did upgrade.

--
Michael

--
May the force be with you…
https://simply.name

Vladimir Borodin

root@simply.name

about 9 years ago

In reply to: Amit Kapila (#5)

Re: Broken hint bits (freeze)

26 мая 2017 г., в 21:39, Amit Kapila <amit.kapila16@gmail.com> написал(а):

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?

Yeah, I think this is quite suspicious. This seems to indicate that
not all WAL records are replicated before the switchover. What is the
value of "synchronous_commit" you are using?

synchronous_commit = on.

I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

Well, actually clean shutdown of master with exit code 0 from `pg_ctl stop -m fast` guarantees that all WAL has been replicated to standby. But just in case we also check that "Latest checkpoint's REDO location" from control file on old master after shutdown is less than pg_last_xlog_replay_location() on standby to be promoted.

And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

--
May the force be with you…
https://simply.name

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Vladimir Borodin (#8)

Re: Broken hint bits (freeze)

On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:

Well, actually clean shutdown of master with exit code 0 from `pg_ctl
stop -m fast` guarantees that all WAL has been replicated to standby.

It does not. It makes it likely, but the connection to the standby
could be not up just then, you could run into walsender timeout, and a
bunch of other scenarios.

But just in case we also check that "Latest checkpoint's REDO
location" from control file on old master after shutdown is less than
pg_last_xlog_replay_location() on standby to be promoted.

The *redo* location? Or the checkpoint location itself? Because the
latter is what needs to be *equal* than the replay location not less
than. Normally there won't be other records inbetween, but that's not
guaranteed.

And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

There's no such guarantee. There's a bunch of checks that'll somewhat
likely trigger, but nothing more than that.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Vladimir Borodin

root@simply.name

about 9 years ago

In reply to: Andres Freund (#9)

Re: Broken hint bits (freeze)

27 мая 2017 г., в 19:56, Andres Freund <andres@anarazel.de> написал(а):

On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:

Well, actually clean shutdown of master with exit code 0 from `pg_ctl
stop -m fast` guarantees that all WAL has been replicated to standby.

It does not. It makes it likely, but the connection to the standby
could be not up just then, you could run into walsender timeout, and a
bunch of other scenarios.

AFAIK in this case exit code would not be zero. Even if archiver has not been able to archive all WALs before timeout for shutting down happened, exit code will not be zero.

But just in case we also check that "Latest checkpoint's REDO
location" from control file on old master after shutdown is less than
pg_last_xlog_replay_location() on standby to be promoted.

The *redo* location? Or the checkpoint location itself? Because the
latter is what needs to be *equal* than the replay location not less
than. Normally there won't be other records inbetween, but that's not
guaranteed.

I've asked about it some time ago [1]/messages/by-id/A7683985-2EC2-40AD-AAAC-B44BD0F29723@simply.name. In that case checkpoint location and redo location were equal after shutdown and last replay location on standby was higher on 104 bytes (the size of shutdown checkpoint record).

But we do check exactly redo location. Should we change it for checking checkpoint location?

[1]: /messages/by-id/A7683985-2EC2-40AD-AAAC-B44BD0F29723@simply.name

And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

There's no such guarantee. There's a bunch of checks that'll somewhat
likely trigger, but nothing more than that.

- Andres

--
May the force be with you…
https://simply.name

#11

Michael Paquier

michael@paquier.xyz

about 9 years ago

In reply to: Andres Freund (#9)

Re: Broken hint bits (freeze)

On Sat, May 27, 2017 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:

Well, actually clean shutdown of master with exit code 0 from `pg_ctl
stop -m fast` guarantees that all WAL has been replicated to standby.

It does not. It makes it likely, but the connection to the standby
could be not up just then, you could run into walsender timeout, and a
bunch of other scenarios.

Amen.

And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

There's no such guarantee. There's a bunch of checks that'll somewhat
likely trigger, but nothing more than that.

Yes. Take for example the case where the host with a primary is
plugged off, and another host with a standby is promoted. If at next
restart you add directly for the old primary a recovery.conf and
attempt to use it as a standby to the new primary it may be able to
connect and to begin replication. That will result in a corrupted
standby.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Vladimir Borodin (#8)

Re: Broken hint bits (freeze)

On Sat, May 27, 2017 at 10:18 PM, Vladimir Borodin <root@simply.name> wrote:

26 мая 2017 г., в 21:39, Amit Kapila <amit.kapila16@gmail.com> написал(а):

I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

Well, actually clean shutdown of master with exit code 0 from `pg_ctl stop
-m fast` guarantees that all WAL has been replicated to standby.

I don't see any such guarantee in code or docs. Can you explain what
makes you think that for 'fast' mode exit code 0 is a guarantee that
all the WAL be replicated?

But just in
case we also check that "Latest checkpoint's REDO location" from control
file on old master after shutdown is less than
pg_last_xlog_replay_location() on standby to be promoted.

And if something would go wrong in above logic, postgres will not let you
attach old master as a standby of new master.

I think it will be possible to attach old master as a standby of new
master as some new operations on the new master can increase its LSN
position to a value greater than what old master has. Your statement
will make sense if you ensure that you don't allow any new operation
on the new master till old master has attached to it as standby.

So it is highly probable not a
setup problem.

Yeah, it is quite possible that your setup is perfectly fine and there
is actually some code bug due to which you are facing the problem,
however, it is better to rule out all the possibilities related to the
wrong setup.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Dmitriy Sarafannikov

dsarafannikov@yandex.ru

almost 9 years ago

In reply to: Amit Kapila (#12)

Re: Broken hint bits (freeze)

Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to perform
analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because it marks
tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
master will be frozen, but on replicas not. And it will not change if nobody will not
update any tuple on this page.

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was confirmed.

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are «copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted pages
or is there a better way not to loose data?

Regards,
Dmitriy Sarafannikov

#14

Sergey Burladyan

eshkinkot@gmail.com

almost 9 years ago

In reply to: Dmitriy Sarafannikov (#13)

Re: Broken hint bits (freeze)

Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

But, starting and stopping master after running pg_upgrade is *required*
by documentation:
https://www.postgresql.org/docs/9.6/static/pgupgrade.html

f. Start and stop the new master cluster
In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the cluster.

and there is no any suggestion to disable autovacuum for it.

--
Sergey Burladyan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Vladimir Borodin

root@simply.name

almost 9 years ago

In reply to: Sergey Burladyan (#14)

Re: Broken hint bits (freeze)

6 июня 2017 г., в 23:30, Sergey Burladyan <eshkinkot@gmail.com> написал(а):

Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

But, starting and stopping master after running pg_upgrade is *required*
by documentation:
https://www.postgresql.org/docs/9.6/static/pgupgrade.html

f. Start and stop the new master cluster
In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the cluster.

and there is no any suggestion to disable autovacuum for it.

Yep. This should probably be fixed in the documentation?

--
May the force be with you…
https://simply.name

#16

Sergey Burladyan

eshkinkot@gmail.com

almost 9 years ago

In reply to: Vladimir Borodin (#15)

Re: Broken hint bits (freeze)

Vladimir Borodin <root@simply.name> writes:

6 июня 2017 г., в 23:30, Sergey Burladyan <eshkinkot@gmail.com> написал(а):

Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

But, starting and stopping master after running pg_upgrade is *required*
by documentation:
https://www.postgresql.org/docs/9.6/static/pgupgrade.html

f. Start and stop the new master cluster
In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the cluster.

and there is no any suggestion to disable autovacuum for it.

Yep. This should probably be fixed in the documentation?

I think so. There is some problem in pg_upgrade documentation, nothing about:
1. preventing heap change by vacuum, analyze, something also when master
restarted after pg_upgrade but before rsync
2. log-shipping only standby cannot shutdown at the same checkpoint with
master

I try to start discuss about this: /messages/by-id/87y3ta49zp.fsf@seb.koffice.internal
but without luck :-)

PS: I CC'd Bruce here.

--
Sergey Burladyan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Dmitriy Sarafannikov (#13)

Re: Broken hint bits (freeze)

On Fri, Jun 2, 2017 at 4:20 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to perform
analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because it marks
tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
master will be frozen, but on replicas not. And it will not change if nobody will not
update any tuple on this page.

Why didn't rsync made the copies on master and replica same?

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was confirmed.

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are «copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted pages
or is there a better way not to loose data?

I haven't looked in detail, but it sounds slightly risky proposition
to manipulate the tuples by writing C functions of the form you have
in your code. I would have preferred some way to avoid this problem
by ensuring that replicas are properly synced (complete data of master
via WAL) or by disabling autovacuum.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Dmitriy Sarafannikov

dsarafannikov@yandex.ru

almost 9 years ago

In reply to: Amit Kapila (#17)

Re: Broken hint bits (freeze)

Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.

I haven't looked in detail, but it sounds slightly risky proposition
to manipulate the tuples by writing C functions of the form you have
in your code. I would have preferred some way to avoid this problem
by ensuring that replicas are properly synced (complete data of master
via WAL) or by disabling autovacuum.

Avoiding this problem is a good way. But what to do with already corrupted data?
Can you explain more what do you mean?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Dmitriy Sarafannikov (#18)

Re: Broken hint bits (freeze)

On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.

IIUC the situation, the new WAL and updated pg_control file has been
copied, but not updated data files due to which the WAL has not been
replayed on replicas? If so, why the pg_control file is copied, it's
size shouldn't have changed?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Vladimir Borodin

root@simply.name

almost 9 years ago

In reply to: Amit Kapila (#19)

Re: Broken hint bits (freeze)

8 июня 2017 г., в 17:03, Amit Kapila <amit.kapila16@gmail.com> написал(а):

On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.

IIUC the situation, the new WAL and updated pg_control file has been
copied, but not updated data files due to which the WAL has not been
replayed on replicas? If so, why the pg_control file is copied, it's
size shouldn't have changed?

Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to $prefix/9.5/data/global/pg_control.old and creates new $prefix/9.6/data/global/pg_control without making hardlink. When running rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control on master and checks if it is a hardlink. Since it is not a hardlink and $prefix/9.6/data/global/pg_control does not exist on replica rsync copies it. For data files the logic is different since they are hardlinks, corresponding files exist on replica and they are the same size.