Collect statistics about conflicts in logical replication

smithpb2250@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#1)

Re: Collect statistics about conflicts in logical replication

Hi Hou-san. Here are some review comments for your patch v1-0001.

======
doc/src/sgml/logical-replication.sgml

nit - added a comma.

======
doc/src/sgml/monitoring.sgml

nit - use <literal> for 'apply_error_count'.
nit - added a period when there are multiple sentences.
nit - adjusted field descriptions using Chat-GPT clarification suggestions

======
src/include/pgstat.h

nit - change the param to 'type' -- ie. same as the implementation calls it

======
src/include/replication/conflict.h

nit - defined 'NUM_CONFLICT_TYPES' inside the enum (I think this way
is often used in other PG source enums)

======
src/test/subscription/t/026_stats.pl

1.
+ # Delete data from the test table on the publisher. This delete operation
+ # should be skipped on the subscriber since the table is already empty.
+ $node_publisher->safe_psql($db, qq(DELETE FROM $table_name;));
+
+ # Wait for the subscriber to report tuple missing conflict.
+ $node_subscriber->poll_query_until(
+ $db,
+ qq[
+ SELECT update_missing_count > 0 AND delete_missing_count > 0
+ FROM pg_stat_subscription_stats
+ WHERE subname = '$sub_name'
+ ])
+   or die
+   qq(Timed out while waiting for tuple missing conflict for
subscription '$sub_name');

Can you write a comment to explain why the replicated DELETE is
expected to increment both the 'update_missing_count' and the
'delete_missing_count'?

~
nit - update several "Apply and Sync errors..." comments that did not
mention conflicts
nit - tweak comments wording for update_differ and delete_differ
nit - /both > 0/> 0/
nit - /both 0/0/

======
Kind Regards,
Peter Smith.
Fujitsu Australia

[1]: /messages/by-id/202201130922.izanq4hkkqnx@alvherre.pgsql

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: Peter Smith (#2)

RE: Collect statistics about conflicts in logical replication

On Monday, August 26, 2024 3:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/include/replication/conflict.h

nit - defined 'NUM_CONFLICT_TYPES' inside the enum (I think this way is
often used in other PG source enums)

I think we have recently tended to avoid doing that, as it has been commented
that this style is somewhat deceptive and can cause confusion. See a previous
similar comment[1]/messages/by-id/202201130922.izanq4hkkqnx@alvherre.pgsql. The current style follows the other existing examples like:

#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
#define BACKEND_NUM_TYPES (B_LOGGER + 1)
...

======
src/test/subscription/t/026_stats.pl

1.
+ # Delete data from the test table on the publisher. This delete
+ operation # should be skipped on the subscriber since the table is already
empty.
+ $node_publisher->safe_psql($db, qq(DELETE FROM $table_name;));
+
+ # Wait for the subscriber to report tuple missing conflict.
+ $node_subscriber->poll_query_until(
+ $db,
+ qq[
+ SELECT update_missing_count > 0 AND delete_missing_count > 0 FROM
+ pg_stat_subscription_stats WHERE subname = '$sub_name'
+ ])
+   or die
+   qq(Timed out while waiting for tuple missing conflict for
subscription '$sub_name');

Can you write a comment to explain why the replicated DELETE is
expected to increment both the 'update_missing_count' and the
'delete_missing_count'?

I think the comments several lines above the wait explained the reason[2].. # Truncate test table to ensure the upcoming update operation is skipped # and the test can continue. $node_subscriber->safe_psql($db, qq(TRUNCATE $table_name));. I
slightly modified the comments to make it clear.

Other changes look good to me and have been merged, thanks!

Here is the V2 patch.

[2]: .. # Truncate test table to ensure the upcoming update operation is skipped # and the test can continue. $node_subscriber->safe_psql($db, qq(TRUNCATE $table_name));
..
# Truncate test table to ensure the upcoming update operation is skipped
# and the test can continue.
$node_subscriber->safe_psql($db, qq(TRUNCATE $table_name));

Best Regards,
Hou zj

smithpb2250@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#3)

Re: Collect statistics about conflicts in logical replication

On Mon, Aug 26, 2024 at 10:13 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Monday, August 26, 2024 3:30 PM Peter Smith <smithpb2250@gmail.com> wrote:

======
src/include/replication/conflict.h

nit - defined 'NUM_CONFLICT_TYPES' inside the enum (I think this way is
often used in other PG source enums)

I think we have recently tended to avoid doing that, as it has been commented
that this style is somewhat deceptive and can cause confusion. See a previous
similar comment[1]. The current style follows the other existing examples like:

#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
#define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
#define BACKEND_NUM_TYPES (B_LOGGER + 1)

OK.

======
src/test/subscription/t/026_stats.pl
1.
+ # Delete data from the test table on the publisher. This delete
+ operation # should be skipped on the subscriber since the table is already
empty.
+ $node_publisher->safe_psql($db, qq(DELETE FROM $table_name;));
+
+ # Wait for the subscriber to report tuple missing conflict.
+ $node_subscriber->poll_query_until(
+ $db,
+ qq[
+ SELECT update_missing_count > 0 AND delete_missing_count > 0 FROM
+ pg_stat_subscription_stats WHERE subname = '$sub_name'
+ ])
+   or die
+   qq(Timed out while waiting for tuple missing conflict for
subscription '$sub_name');
Can you write a comment to explain why the replicated DELETE is
expected to increment both the 'update_missing_count' and the
'delete_missing_count'?
I think the comments several lines above the wait explained the reason[2]. I
slightly modified the comments to make it clear.

1.
Right, but it still was not obvious to me what caused the
'update_missing_count'. On further study, I see it was a hangover from
the earlier UPDATE test case which was still stuck in an ERROR loop
attempting to do the update operation. e.g. before it was giving the
expected 'update_exists' conflicts but after the subscriber table
TRUNCATE the update conflict changes to give a 'update_missing'
conflict instead. I've updated the comment to reflect my
understanding. Please have a look to see if you agree.

~~~~

2.
I separated the tests for 'update_missing' and 'delete_missing',
putting the update_missing test *before* the DELETE. I felt the
expected results were much clearer when each test did just one thing.
Please have a look to see if you agree.

~~~

3.
+# Enable track_commit_timestamp to detect origin-differ conflicts in logical
+# replication. Reduce wal_retrieve_retry_interval to 1ms to accelerate the
+# restart of the logical replication worker after encountering a conflict.
+$node_subscriber->append_conf(
+ 'postgresql.conf', q{
+track_commit_timestamp = on
+wal_retrieve_retry_interval = 1ms
+});

Later, after CDR resolvers are implemented, it might be good to
revisit these conflict test cases and re-write them to use some
conflict resolvers like 'skip'. Then the subscriber won't give ERRORs
and restart apply workers all the time behind the scenes, so you won't
need the above configuration for accelerating the worker restarts. In
other words, running these tests might be more efficient if you can
avoid restarting workers all the time.

I suggest putting an XXX comment here as a reminder that these tests
should be revisited to make use of conflict resolvers in the future.

~~~

nit - not caused by this patch, but other comment inconsistencies
about "stats_reset timestamp" can be fixed in passing too.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: Peter Smith (#4)

RE: Collect statistics about conflicts in logical replication

On Tuesday, August 27, 2024 10:59 AM Peter Smith <smithpb2250@gmail.com> wrote:

~~~
3.
+# Enable track_commit_timestamp to detect origin-differ conflicts in
+logical # replication. Reduce wal_retrieve_retry_interval to 1ms to
+accelerate the # restart of the logical replication worker after encountering a
conflict.
+$node_subscriber->append_conf(
+ 'postgresql.conf', q{
+track_commit_timestamp = on
+wal_retrieve_retry_interval = 1ms
+});
Later, after CDR resolvers are implemented, it might be good to revisit these
conflict test cases and re-write them to use some conflict resolvers like 'skip'.
Then the subscriber won't give ERRORs and restart apply workers all the time
behind the scenes, so you won't need the above configuration for accelerating
the worker restarts. In other words, running these tests might be more efficient
if you can avoid restarting workers all the time.

I suggest putting an XXX comment here as a reminder that these tests should
be revisited to make use of conflict resolvers in the future.

I think it would be too early to mention the resolution implementation detail
in the comments considering that the resolution is still not RFC. Also, I think
reducing wal_retrieve_retry_interval is a reasonable way to speed up the test
in this case because the test is not letting the worker to restart all the time, the
error causes the restart will be resolved immediately after the stats check. So, I
think adding XXX is not very appropriate.

Other comments look good to me.
I slightly adjusted few words and merged the changes. Thanks !

Here is V3 patch.

Best Regards,
Hou zj

[1]: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW

shveta.malik@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#5)

Re: Collect statistics about conflicts in logical replication

On Tue, Aug 27, 2024 at 3:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, August 27, 2024 10:59 AM Peter Smith <smithpb2250@gmail.com> wrote:
~~~
3.
+# Enable track_commit_timestamp to detect origin-differ conflicts in
+logical # replication. Reduce wal_retrieve_retry_interval to 1ms to
+accelerate the # restart of the logical replication worker after encountering a
conflict.
+$node_subscriber->append_conf(
+ 'postgresql.conf', q{
+track_commit_timestamp = on
+wal_retrieve_retry_interval = 1ms
+});
Later, after CDR resolvers are implemented, it might be good to revisit these
conflict test cases and re-write them to use some conflict resolvers like 'skip'.
Then the subscriber won't give ERRORs and restart apply workers all the time
behind the scenes, so you won't need the above configuration for accelerating
the worker restarts. In other words, running these tests might be more efficient
if you can avoid restarting workers all the time.

I suggest putting an XXX comment here as a reminder that these tests should
be revisited to make use of conflict resolvers in the future.
I think it would be too early to mention the resolution implementation detail
in the comments considering that the resolution is still not RFC. Also, I think
reducing wal_retrieve_retry_interval is a reasonable way to speed up the test
in this case because the test is not letting the worker to restart all the time, the
error causes the restart will be resolved immediately after the stats check. So, I
think adding XXX is not very appropriate.

Other comments look good to me.
I slightly adjusted few words and merged the changes. Thanks !

Here is V3 patch.

Thanks for the patch. Just thinking out loud, since we have names like
'apply_error_count', 'sync_error_count' which tells that they are
actually error-count, will it be better to have something similar in
conflict-count cases, like 'insert_exists_conflict_count',
'delete_missing_conflict_count' and so on. Thoughts?

I noticed that now we do mention this (as I suggested earlier):
+ Note that any conflict resulting in an apply error will be counted
in both apply_error_count and the corresponding conflict count.

But we do not mention clearly which ones are conflict-counts. As an
example, we have this:

+ insert_exists_count bigint:
+ Number of times a row insertion violated a NOT DEFERRABLE unique
constraint during the application of changes

It does not mention that it is a conflict count. So we need to either
change names or mention clearly against each that it is a conflict
count.

thanks
sHveta

Amit Kapila

amit.kapila16@gmail.com

over 1 year ago

In reply to: shveta malik (#6)

Re: Collect statistics about conflicts in logical replication

On Wed, Aug 28, 2024 at 11:43 AM shveta malik <shveta.malik@gmail.com> wrote:

Thanks for the patch. Just thinking out loud, since we have names like
'apply_error_count', 'sync_error_count' which tells that they are
actually error-count, will it be better to have something similar in
conflict-count cases, like 'insert_exists_conflict_count',
'delete_missing_conflict_count' and so on. Thoughts?

It would be better to have conflict in the names but OTOH it will make
the names a bit longer. The other alternatives could be (a)
insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
(c) confl_insert_exists, etc. These are based on the column names in
the existing view pg_stat_database_conflicts [1]https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW. The (c) looks better
than other options but it will make the conflict-related columns
different from error-related columns.

Yet another option is to have a different view like
pg_stat_subscription_conflicts but that sounds like going too far.

--
With Regards,
Amit Kapila.

======
[1]: https://www.postgresql.org/docs/devel/monitoring-stats.html
[2]: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW
[3]: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-SLOTS-VIEW
[4]: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-TABLES-VIEW

smithpb2250@gmail.com

over 1 year ago

In reply to: Amit Kapila (#7)

Re: Collect statistics about conflicts in logical replication

On Wed, Aug 28, 2024 at 9:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 28, 2024 at 11:43 AM shveta malik <shveta.malik@gmail.com> wrote:

Thanks for the patch. Just thinking out loud, since we have names like
'apply_error_count', 'sync_error_count' which tells that they are
actually error-count, will it be better to have something similar in
conflict-count cases, like 'insert_exists_conflict_count',
'delete_missing_conflict_count' and so on. Thoughts?

It would be better to have conflict in the names but OTOH it will make
the names a bit longer. The other alternatives could be (a)
insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
(c) confl_insert_exists, etc. These are based on the column names in
the existing view pg_stat_database_conflicts [1]. The (c) looks better
than other options but it will make the conflict-related columns
different from error-related columns.

Yet another option is to have a different view like
pg_stat_subscription_conflicts but that sounds like going too far.

[1] - https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW

Option (c) looked good to me.

Removing the suffix "_count" is OK. For example, try searching all of
Chapter 27 ("The Cumulative Statistics System") [1]https://www.postgresql.org/docs/devel/monitoring-stats.html for columns
described as "Number of ..." and you will find that a "_count" suffix
is used only rarely.

Adding the prefix "confl_" is OK. As mentioned, there is a precedent
for this. See "pg_stat_database_conflicts" [2]https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW.

Mixing column names where some have and some do not have "_count"
suffixes may not be ideal, but I see no problem because there are
precedents for that too. E.g. see "pg_stat_replication_slots" [3]https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-SLOTS-VIEW, and
"pg_stat_all_tables" [4]https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-TABLES-VIEW.

Kind Regards,
Peter Smith.
Fujitsu Australia

smithpb2250@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#3)

Re: Collect statistics about conflicts in logical replication

Hi Hou-San.

I tried an experiment where I deliberately violated a primary key
during initial table synchronization.

For example:

test_sub=# create table t1(a int primary key);
CREATE TABLE

test_sub=# insert into t1 values(1);
INSERT 0 1

test_sub=# create subscription sub1 connection 'dbname=test_pub'
publication pub1 with (enabled=false);
2024-08-29 09:53:21.172 AEST [24186] WARNING: subscriptions created
by regression test cases should have names starting with "regress_"
WARNING: subscriptions created by regression test cases should have
names starting with "regress_"
NOTICE: created replication slot "sub1" on publisher
CREATE SUBSCRIPTION

test_sub=# select * from pg_stat_subscription_stats;
subid | subname | apply_error_count | sync_error_count |
insert_exists_count | update_differ_count | update_exists_count |
update_missing_count | de
lete_differ_count | delete_missing_count | stats_reset
-------+---------+-------------------+------------------+---------------------+---------------------+---------------------+----------------------+---
------------------+----------------------+-------------
16390 | sub1 | 0 | 0 |
0 | 0 | 0 |
0 |
0 | 0 |
(1 row)

test_sub=# alter subscription sub1 enable;
ALTER SUBSCRIPTION

test_sub=# 2024-08-29 09:53:57.245 AEST [4345] LOG: logical
replication apply worker for subscription "sub1" has started
2024-08-29 09:53:57.258 AEST [4347] LOG: logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-08-29 09:53:57.311 AEST [4347] ERROR: duplicate key value
violates unique constraint "t1_pkey"
2024-08-29 09:53:57.311 AEST [4347] DETAIL: Key (a)=(1) already exists.
2024-08-29 09:53:57.311 AEST [4347] CONTEXT: COPY t1, line 1
2024-08-29 09:53:57.312 AEST [23501] LOG: background worker "logical
replication tablesync worker" (PID 4347) exited with exit code 1
2024-08-29 09:54:02.385 AEST [4501] LOG: logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-08-29 09:54:02.462 AEST [4501] ERROR: duplicate key value
violates unique constraint "t1_pkey"
2024-08-29 09:54:02.462 AEST [4501] DETAIL: Key (a)=(1) already exists.
2024-08-29 09:54:02.462 AEST [4501] CONTEXT: COPY t1, line 1
2024-08-29 09:54:02.463 AEST [23501] LOG: background worker "logical
replication tablesync worker" (PID 4501) exited with exit code 1
2024-08-29 09:54:07.512 AEST [4654] LOG: logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-08-29 09:54:07.580 AEST [4654] ERROR: duplicate key value
violates unique constraint "t1_pkey"
2024-08-29 09:54:07.580 AEST [4654] DETAIL: Key (a)=(1) already exists.
2024-08-29 09:54:07.580 AEST [4654] CONTEXT: COPY t1, line 1
...

test_sub=# alter subscription sub1 disable;'
ALTER SUBSCRIPTION
2024-08-29 09:55:10.329 AEST [4345] LOG: logical replication worker
for subscription "sub1" will stop because the subscription was
disabled

~~~

Notice how after a while there were multiple (15) 'sync_error_count' recorded.

According to the docs: 'insert_exists' happens when "Inserting a row
that violates a NOT DEFERRABLE unique constraint.". So why are there
not the same number of 'insert_exists_count' recorded in
pg_stat_subscription_stats?

The 'insert_exists' is either not happening or is not being counted
during table synchronization. Either way, it was not what I was
expecting. If it is not a bug, maybe the docs need to explain more
about the rules for 'insert_exists' during the initial table sync.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#10

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: Peter Smith (#9)

RE: Collect statistics about conflicts in logical replication

On Thursday, August 29, 2024 8:31 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi,

I tried an experiment where I deliberately violated a primary key during initial
table synchronization.

For example:

...

test_sub=# 2024-08-29 09:53:57.245 AEST [4345] LOG: logical replication
apply worker for subscription "sub1" has started
2024-08-29 09:53:57.258 AEST [4347] LOG: logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-08-29 09:53:57.311 AEST [4347] ERROR: duplicate key value violates
unique constraint "t1_pkey"
2024-08-29 09:53:57.311 AEST [4347] DETAIL: Key (a)=(1) already exists.
2024-08-29 09:53:57.311 AEST [4347] CONTEXT: COPY t1, line 1
~~~

Notice how after a while there were multiple (15) 'sync_error_count' recorded.

According to the docs: 'insert_exists' happens when "Inserting a row that
violates a NOT DEFERRABLE unique constraint.". So why are there not the
same number of 'insert_exists_count' recorded in pg_stat_subscription_stats?

Because this error was caused by COPY instead of an INSERT (e.g., CONTEXT: COPY
t1, line 1), so this is as expected. The doc of conflict counts(
insert_exists_count) has already mentioned that it counts the conflict only *during the
application of changes* which is clear to me that it doesn't count the ones in
initial table synchronization. See the existing apply_error_count where we also
has similar wording(e.g. "an error occurred while applying changes").

Best Regards,
Hou zj

#11

shveta.malik@gmail.com

over 1 year ago

In reply to: Peter Smith (#8)

Re: Collect statistics about conflicts in logical replication

On Thu, Aug 29, 2024 at 4:59 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Aug 28, 2024 at 9:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 28, 2024 at 11:43 AM shveta malik <shveta.malik@gmail.com> wrote:

Thanks for the patch. Just thinking out loud, since we have names like
'apply_error_count', 'sync_error_count' which tells that they are
actually error-count, will it be better to have something similar in
conflict-count cases, like 'insert_exists_conflict_count',
'delete_missing_conflict_count' and so on. Thoughts?

It would be better to have conflict in the names but OTOH it will make
the names a bit longer. The other alternatives could be (a)
insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
(c) confl_insert_exists, etc. These are based on the column names in
the existing view pg_stat_database_conflicts [1]. The (c) looks better
than other options but it will make the conflict-related columns
different from error-related columns.

Yet another option is to have a different view like
pg_stat_subscription_conflicts but that sounds like going too far.

Yes, I think we are good with pg_stat_subscription_stats for the time being.

[1] - https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW

Option (c) looked good to me.

+1 for option c. it should be okay to not have '_count' in the name.

Show quoted text

Removing the suffix "_count" is OK. For example, try searching all of
Chapter 27 ("The Cumulative Statistics System") [1] for columns
described as "Number of ..." and you will find that a "_count" suffix
is used only rarely.

Adding the prefix "confl_" is OK. As mentioned, there is a precedent
for this. See "pg_stat_database_conflicts" [2].

Mixing column names where some have and some do not have "_count"
suffixes may not be ideal, but I see no problem because there are
precedents for that too. E.g. see "pg_stat_replication_slots" [3], and
"pg_stat_all_tables" [4].

======
[1] https://www.postgresql.org/docs/devel/monitoring-stats.html
[2] https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-CONFLICTS-VIEW
[3] https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-SLOTS-VIEW
[4] https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-TABLES-VIEW

Kind Regards,
Peter Smith.
Fujitsu Australia

#12

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: shveta malik (#11)

RE: Collect statistics about conflicts in logical replication

On Thursday, August 29, 2024 11:18 AM shveta malik <shveta.malik@gmail.com> wrote:

On Thu, Aug 29, 2024 at 4:59 AM Peter Smith <smithpb2250@gmail.com>
wrote:

On Wed, Aug 28, 2024 at 9:19 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Wed, Aug 28, 2024 at 11:43 AM shveta malik

<shveta.malik@gmail.com> wrote:

Thanks for the patch. Just thinking out loud, since we have names
like 'apply_error_count', 'sync_error_count' which tells that they
are actually error-count, will it be better to have something
similar in conflict-count cases, like
'insert_exists_conflict_count', 'delete_missing_conflict_count' and so

on. Thoughts?

It would be better to have conflict in the names but OTOH it will
make the names a bit longer. The other alternatives could be (a)
insert_exists_confl_count, etc. (b) confl_insert_exists_count, etc.
(c) confl_insert_exists, etc. These are based on the column names in
the existing view pg_stat_database_conflicts [1]. The (c) looks
better than other options but it will make the conflict-related
columns different from error-related columns.

Yet another option is to have a different view like
pg_stat_subscription_conflicts but that sounds like going too far.

Yes, I think we are good with pg_stat_subscription_stats for the time being.

[1] -

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORI

NG-PG-STAT-DATABASE-CONFLICTS-VIEW

Option (c) looked good to me.

+1 for option c. it should be okay to not have '_count' in the name.

Agreed. Here is new version patch which change the names as suggested. I also
rebased the patch based on another renaming commit 640178c9.

Best Regards,
Hou zj

#13

shveta.malik@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#12)

Re: Collect statistics about conflicts in logical replication

On Thu, Aug 29, 2024 at 11:06 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Agreed. Here is new version patch which change the names as suggested. I also
rebased the patch based on another renaming commit 640178c9.

Thanks for the patch. Few minor things:

1)
conflict.h:
* This enum is used in statistics collection (see
* PgStat_StatSubEntry::conflict_count) as well, therefore, when adding new
* values or reordering existing ones, ensure to review and potentially adjust
* the corresponding statistics collection codes.

--We shall mention PgStat_BackendSubEntry as well.

026_stats.pl:
2)
# Now that the table is empty, the
# update conflict (update_existing) ERRORs will stop happening.

--Shall it be update_exists instead of update_existing here:

3)
This is an existing comment above insert_exists conflict capture:
# Wait for the apply error to be reported.

--Shall we change to:
# Wait for the subscriber to report apply error and insert_exists conflict.

thanks
Shveta

#14

smithpb2250@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#12)

Re: Collect statistics about conflicts in logical replication

Hi Hou-San. Here are my review comments for v4-0001.

======

1. Add links in the docs

IMO it would be good for all these confl_* descriptions (in
doc/src/sgml/monitoring.sgml) to include links back to where each of
those conflict types was defined [1]https://www.postgresql.org/docs/devel/logical-replication-conflicts.html#LOGICAL-REPLICATION-CONFLICTS.

Indeed, when links are included to the original conflict type
information then I think you should remove mentioning
"track_commit_timestamp":
+       counted only when the
+       <link linkend="guc-track-commit-timestamp"><varname>track_commit_timestamp</varname></link>
+       option is enabled on the subscriber.

It should be obvious that you cannot count a conflict if the conflict
does not happen, but I don't think we should scatter/duplicate those
rules in different places saying when certain conflicts can/can't
happen -- we should just link everywhere back to the original
description for those rules.

~~~

2. Arrange all the counts into an intuitive/natural order

There is an intuitive/natural ordering for these counts. For example,
the 'confl_*' count fields are in the order insert -> update ->
delete, which LGTM.

Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
in a good order.

IMO it makes more sense if everything is ordered as:
'sync_error_count', then 'apply_error_count', then all the 'confl_*'
counts.

This comment applies to lots of places, e.g.:
- docs (doc/src/sgml/monitoring.sgml)
- function pg_stat_get_subscription_stats (pg_proc.dat)
- view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
- TAP test SELECTs (test/subscription/t/026_stats.pl)

As all those places are already impacted by this patch, I think it
would be good if (in passing) we (if possible) also swapped the
sync/apply counts so everything is ordered intuitively top-to-bottom
or left-to-right.

======
[1]: https://www.postgresql.org/docs/devel/logical-replication-conflicts.html#LOGICAL-REPLICATION-CONFLICTS

Kind Regards,
Peter Smith.
Fujitsu Australia

#15

[1]: /messages/by-id/CAJpy0uAZpzustNOMBhxBctHHWbBA=snTAYsLpoWZg+cqegmD-A@mail.gmail.com

shveta.malik@gmail.com

over 1 year ago

In reply to: shveta malik (#13)

Re: Collect statistics about conflicts in logical replication

On Fri, Aug 30, 2024 at 9:40 AM shveta malik <shveta.malik@gmail.com> wrote:

On Thu, Aug 29, 2024 at 11:06 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Agreed. Here is new version patch which change the names as suggested. I also
rebased the patch based on another renaming commit 640178c9.

Thanks for the patch. Few minor things:

1)
conflict.h:
* This enum is used in statistics collection (see
* PgStat_StatSubEntry::conflict_count) as well, therefore, when adding new
* values or reordering existing ones, ensure to review and potentially adjust
* the corresponding statistics collection codes.

--We shall mention PgStat_BackendSubEntry as well.

026_stats.pl:
2)
# Now that the table is empty, the
# update conflict (update_existing) ERRORs will stop happening.

--Shall it be update_exists instead of update_existing here:

3)
This is an existing comment above insert_exists conflict capture:
# Wait for the apply error to be reported.

--Shall we change to:
# Wait for the subscriber to report apply error and insert_exists conflict.

1) I have tested the patch, works well.
2) Verified headers inclusions, all good
3) All my comments (very old ones when the patch was initially posted)
are now addressed.

So apart from the comments I posted in [1]/messages/by-id/CAJpy0uAZpzustNOMBhxBctHHWbBA=snTAYsLpoWZg+cqegmD-A@mail.gmail.com, I have no more comments.

thanks
Shveta

#16

shveta.malik@gmail.com

over 1 year ago

In reply to: Peter Smith (#14)

Re: Collect statistics about conflicts in logical replication

On Fri, Aug 30, 2024 at 10:53 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi Hou-San. Here are my review comments for v4-0001.

======

1. Add links in the docs

IMO it would be good for all these confl_* descriptions (in
doc/src/sgml/monitoring.sgml) to include links back to where each of
those conflict types was defined [1].
Indeed, when links are included to the original conflict type
information then I think you should remove mentioning
"track_commit_timestamp":
+       counted only when the
+       <link linkend="guc-track-commit-timestamp"><varname>track_commit_timestamp</varname></link>
+       option is enabled on the subscriber.
It should be obvious that you cannot count a conflict if the conflict
does not happen, but I don't think we should scatter/duplicate those
rules in different places saying when certain conflicts can/can't
happen -- we should just link everywhere back to the original
description for those rules.

+1, will make the doc better.

~~~

2. Arrange all the counts into an intuitive/natural order

There is an intuitive/natural ordering for these counts. For example,
the 'confl_*' count fields are in the order insert -> update ->
delete, which LGTM.

Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
in a good order.

IMO it makes more sense if everything is ordered as:
'sync_error_count', then 'apply_error_count', then all the 'confl_*'
counts.

This comment applies to lots of places, e.g.:
- docs (doc/src/sgml/monitoring.sgml)
- function pg_stat_get_subscription_stats (pg_proc.dat)
- view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
- TAP test SELECTs (test/subscription/t/026_stats.pl)

As all those places are already impacted by this patch, I think it
would be good if (in passing) we (if possible) also swapped the
sync/apply counts so everything is ordered intuitively top-to-bottom
or left-to-right.

Not sure about this though. It does not seem to belong to the current patch.

thanks
Shveta

#17

[1]: /messages/by-id/CAJpy0uAZpzustNOMBhxBctHHWbBA=snTAYsLpoWZg+cqegmD-A@mail.gmail.com

houzj.fnst@fujitsu.com

over 1 year ago

In reply to: shveta malik (#16)

RE: Collect statistics about conflicts in logical replication

On Friday, August 30, 2024 2:24 PM shveta malik <shveta.malik@gmail.com> wrote:

On Fri, Aug 30, 2024 at 10:53 AM Peter Smith <smithpb2250@gmail.com>
wrote:

Hi Hou-San. Here are my review comments for v4-0001.

Thanks Shveta and Peter for giving comments !

======

1. Add links in the docs

IMO it would be good for all these confl_* descriptions (in
doc/src/sgml/monitoring.sgml) to include links back to where each of
those conflict types was defined [1].

Indeed, when links are included to the original conflict type
information then I think you should remove mentioning
"track_commit_timestamp":
+ counted only when the
+ <link

linkend="guc-track-commit-timestamp"><varname>track_commit_timesta
mp</varname></link>

+ option is enabled on the subscriber.

It should be obvious that you cannot count a conflict if the conflict
does not happen, but I don't think we should scatter/duplicate those
rules in different places saying when certain conflicts can/can't
happen -- we should just link everywhere back to the original
description for those rules.

+1, will make the doc better.

Changed. To add link to each conflict type, I added "<varlistentry
id="conflict-xx, xreflabel=xx" to each conflict in logical-replication.sgml.

~~~

2. Arrange all the counts into an intuitive/natural order

There is an intuitive/natural ordering for these counts. For example,
the 'confl_*' count fields are in the order insert -> update ->
delete, which LGTM.

Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
in a good order.

IMO it makes more sense if everything is ordered as:
'sync_error_count', then 'apply_error_count', then all the 'confl_*'
counts.

This comment applies to lots of places, e.g.:
- docs (doc/src/sgml/monitoring.sgml)
- function pg_stat_get_subscription_stats (pg_proc.dat)
- view pg_stat_subscription_stats
(src/backend/catalog/system_views.sql)
- TAP test SELECTs (test/subscription/t/026_stats.pl)

As all those places are already impacted by this patch, I think it
would be good if (in passing) we (if possible) also swapped the
sync/apply counts so everything is ordered intuitively top-to-bottom
or left-to-right.

Not sure about this though. It does not seem to belong to the current patch.

I also don't think we should handle that in this patch.

Here is V5 patch which addressed above and Shveta's[1]/messages/by-id/CAJpy0uAZpzustNOMBhxBctHHWbBA=snTAYsLpoWZg+cqegmD-A@mail.gmail.com comments.

Best Regards,
Hou zj

#18

shveta.malik@gmail.com

over 1 year ago

In reply to: Zhijie Hou (Fujitsu) (#17)

Re: Collect statistics about conflicts in logical replication

On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here is V5 patch which addressed above and Shveta's[1] comments.

The patch looks good to me.

thanks
Shveta

#19

smithpb2250@gmail.com

over 1 year ago

In reply to: shveta malik (#18)

Re: Collect statistics about conflicts in logical replication

On Fri, Aug 30, 2024 at 6:36 PM shveta malik <shveta.malik@gmail.com> wrote:

On Fri, Aug 30, 2024 at 12:15 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Here is V5 patch which addressed above and Shveta's[1] comments.

The patch looks good to me.

Patch v5 LGTM.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#20

smithpb2250@gmail.com

over 1 year ago

In reply to: shveta malik (#16)

Re: Collect statistics about conflicts in logical replication

On Fri, Aug 30, 2024 at 4:24 PM shveta malik <shveta.malik@gmail.com> wrote:

On Fri, Aug 30, 2024 at 10:53 AM Peter Smith <smithpb2250@gmail.com> wrote:

...

2. Arrange all the counts into an intuitive/natural order

There is an intuitive/natural ordering for these counts. For example,
the 'confl_*' count fields are in the order insert -> update ->
delete, which LGTM.

Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
in a good order.

IMO it makes more sense if everything is ordered as:
'sync_error_count', then 'apply_error_count', then all the 'confl_*'
counts.

This comment applies to lots of places, e.g.:
- docs (doc/src/sgml/monitoring.sgml)
- function pg_stat_get_subscription_stats (pg_proc.dat)
- view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
- TAP test SELECTs (test/subscription/t/026_stats.pl)

As all those places are already impacted by this patch, I think it
would be good if (in passing) we (if possible) also swapped the
sync/apply counts so everything is ordered intuitively top-to-bottom
or left-to-right.

Not sure about this though. It does not seem to belong to the current patch.

Fair enough. But, besides being inappropriate to include in the
current patch, do you think the suggestion to reorder them made sense?
If it has some merit, then I will propose it again as a separate
thread.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

#21

shveta.malik@gmail.com

over 1 year ago

In reply to: Peter Smith (#20)

#22