START_REPLICATION SLOT causing a crash in an assert build

Started by Jaime Casanovaover 3 years ago33 messageshackers
Jump to latest
#1Jaime Casanova
jcasanov@systemguards.com.ec

Hi,

I'm not sure what is causing this, but I have seen this twice. The
second time without activity after changing the set of tables in a
PUBLICATION.

gdb says that debug_query_string contains:

"""
START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')
"""

attached the backtrace.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachments:

gdb_start_replication_slot_crash.txttext/plain; charset=us-asciiDownload
#2Andres Freund
andres@anarazel.de
In reply to: Jaime Casanova (#1)
Re: START_REPLICATION SLOT causing a crash in an assert build

Hi,

On 2022-09-06 18:40:49 -0500, Jaime Casanova wrote:

I'm not sure what is causing this, but I have seen this twice. The
second time without activity after changing the set of tables in a
PUBLICATION.

Can you describe the steps to reproduce?

Which git commit does this happen on?

gdb says that debug_query_string contains:

"""
START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')
"""

attached the backtrace.

#2 0x00005559bfd4f0ed in ExceptionalCondition (
conditionName=0x5559bff30e20 "namestrcmp(&statent->slotname, NameStr(slot->data.name)) == 0", errorType=0x5559bff30e0d "FailedAssertion", fileName=0x5559bff30dbb "pgstat_replslot.c",
lineNumber=89) at assert.c:69

what are statent->slotname and slot->data.name?

Greetings,

Andres Freund

#3Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Andres Freund (#2)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Wed, Sep 07, 2022 at 12:39:08PM -0700, Andres Freund wrote:

Hi,

On 2022-09-06 18:40:49 -0500, Jaime Casanova wrote:

I'm not sure what is causing this, but I have seen this twice. The
second time without activity after changing the set of tables in a
PUBLICATION.

Can you describe the steps to reproduce?

I'm still trying to determine that

Which git commit does this happen on?

6e55ea79faa56db85a2b6c5bf94cee8acf8bfdb8 (Stamp 15beta4)

gdb says that debug_query_string contains:

"""
START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')
"""

attached the backtrace.

#2 0x00005559bfd4f0ed in ExceptionalCondition (
conditionName=0x5559bff30e20 "namestrcmp(&statent->slotname, NameStr(slot->data.name)) == 0", errorType=0x5559bff30e0d "FailedAssertion", fileName=0x5559bff30dbb "pgstat_replslot.c",
lineNumber=89) at assert.c:69

what are statent->slotname and slot->data.name?

slot->data.name seems to be the replication_slot record, and
statent->slotname comes from the in shared memory stats for that slot.

And the assert happens when &statent->slotname.data comes empty, which
is not frequent but it happens from time to time

btw, while I'm looking at this I found that we can drop a publication
while there are active subscriptions pointing to it, is that something
we should allow?
anyway, that is not the cause of this because the replication slot actually
exists.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#4Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Andres Freund (#2)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Wed, Sep 07, 2022 at 12:39:08PM -0700, Andres Freund wrote:

Hi,

On 2022-09-06 18:40:49 -0500, Jaime Casanova wrote:

I'm not sure what is causing this, but I have seen this twice. The
second time without activity after changing the set of tables in a
PUBLICATION.

This crash happens after a reset of statistics for a slot replication

Can you describe the steps to reproduce?

bin/pg_ctl -D data1 initdb
bin/pg_ctl -D data1 -l logfile1 -o "-c port=54315 -c wal_level=logical" start
bin/psql -p 54315 postgres <<EOF
create table t1 (i int primary key);
create publication pub1 for table t1;
EOF

bin/pg_ctl -D data2 initdb
bin/pg_ctl -D data2 -l logfile2 -o "-c port=54316" start
bin/psql -p 54316 postgres <<EOF
create table t1 (i int primary key);
create subscription sub1 connection 'host=/tmp port=54315 dbname=postgres' publication pub1;
EOF

bin/psql -p 54315 postgres <<EOF
select pg_stat_reset_replication_slot('sub1');
insert into t1 values(1);
EOF

Which git commit does this happen on?

just tested again on f5047c1293acce3c6c3802b06825aa3a9f9aa55a

gdb says that debug_query_string contains:

"""
START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')
"""

attached the backtrace.

#2 0x00005559bfd4f0ed in ExceptionalCondition (
conditionName=0x5559bff30e20 "namestrcmp(&statent->slotname, NameStr(slot->data.name)) == 0", errorType=0x5559bff30e0d "FailedAssertion", fileName=0x5559bff30dbb "pgstat_replslot.c",
lineNumber=89) at assert.c:69

what are statent->slotname and slot->data.name?

and the problem seems to be that after zero'ing the stats that includes
the name of the replication slot, this simple patch fixes it... not sure
if it's the right fix though...

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachments:

fix_reset_stat_replslot.patchtext/x-diff; charset=us-asciiDownload+1-0
#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jaime Casanova (#4)
Re: START_REPLICATION SLOT causing a crash in an assert build

Nice finding.

At Tue, 13 Sep 2022 00:39:45 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

and the problem seems to be that after zero'ing the stats that includes
the name of the replication slot, this simple patch fixes it... not sure
if it's the right fix though...

That doesn't work. since what that function clears is not the name in
the slot struct but that in stats entry.

The cause is what pg_stat_reset_replslot wants to do does not match
what pgstat feature thinks as reset.

Unfortunately, we don't have a function to leave a portion alone after
a reset. However, fetching the stats entry in pgstat_reset_replslot is
ugly..

I'm not sure this is less uglier but it works if
pgstat_report_replslot sets the name if it is found to be wiped
out... But that hafly nullify the protction by the assertion just
after.

--- a/src/backend/utils/activity/pgstat_replslot.c
+++ b/src/backend/utils/activity/pgstat_replslot.c
@@ -83,9 +83,11 @@ pgstat_report_replslot(ReplicationSlot *slot, const PgStat_StatReplSlotEntry *re
 	statent = &shstatent->stats;
 	/*
-	 * Any mismatch should have been fixed in pgstat_create_replslot() or
-	 * pgstat_acquire_replslot().
+	 * pgstat_create_replslot() and pgstat_acquire_replslot() enters the name,
+	 * but pgstat_reset_replslot() may clear it.
 	 */
+	if (statent->slotname.data[0] == 0)
+		namestrcpy(&shstatent->stats.slotname, NameStr(slot->data.name));
 	Assert(namestrcmp(&statent->slotname, NameStr(slot->data.name)) == 0);

Another measure would be to add the region to wipe-out on reset to
PgStat_KindInfo, but it seems too much.. (attached)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

replslot_reset_fix.txttext/plain; charset=us-asciiDownload+20-2
#6Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Kyotaro Horiguchi (#5)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Tue, Sep 13, 2022 at 06:48:45PM +0900, Kyotaro Horiguchi wrote:

Nice finding.

At Tue, 13 Sep 2022 00:39:45 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

and the problem seems to be that after zero'ing the stats that includes
the name of the replication slot, this simple patch fixes it... not sure
if it's the right fix though...

That doesn't work. since what that function clears is not the name in
the slot struct but that in stats entry.

you're right... the curious thing is that I tested it and it worked, but
now it doesn't... maybe it was too late for me...

The cause is what pg_stat_reset_replslot wants to do does not match
what pgstat feature thinks as reset.

[...]

Another measure would be to add the region to wipe-out on reset to
PgStat_KindInfo, but it seems too much.. (attached)

This patch solves the problem, i didn't like the other solution because
as you say it partly nullify the protection of the assertion.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#7Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Jaime Casanova (#6)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Tue, Sep 13, 2022 at 10:07:50PM -0500, Jaime Casanova wrote:

On Tue, Sep 13, 2022 at 06:48:45PM +0900, Kyotaro Horiguchi wrote:

Another measure would be to add the region to wipe-out on reset to
PgStat_KindInfo, but it seems too much.. (attached)

This patch solves the problem, i didn't like the other solution because
as you say it partly nullify the protection of the assertion.

I talked too fast, while it solves the immediate problem the patch as is
causes other crashes.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jaime Casanova (#7)
Re: START_REPLICATION SLOT causing a crash in an assert build

At Thu, 15 Sep 2022 01:26:15 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

On Tue, Sep 13, 2022 at 10:07:50PM -0500, Jaime Casanova wrote:

On Tue, Sep 13, 2022 at 06:48:45PM +0900, Kyotaro Horiguchi wrote:

Another measure would be to add the region to wipe-out on reset to
PgStat_KindInfo, but it seems too much.. (attached)

This patch solves the problem, i didn't like the other solution because
as you say it partly nullify the protection of the assertion.

I talked too fast, while it solves the immediate problem the patch as is
causes other crashes.

Where did the crash happen? Is it a bug introduced by it? Or does it
root to other cause?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Kyotaro Horiguchi (#8)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Thu, Sep 15, 2022 at 05:30:11PM +0900, Kyotaro Horiguchi wrote:

At Thu, 15 Sep 2022 01:26:15 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

On Tue, Sep 13, 2022 at 10:07:50PM -0500, Jaime Casanova wrote:

On Tue, Sep 13, 2022 at 06:48:45PM +0900, Kyotaro Horiguchi wrote:

Another measure would be to add the region to wipe-out on reset to
PgStat_KindInfo, but it seems too much.. (attached)

This patch solves the problem, i didn't like the other solution because
as you say it partly nullify the protection of the assertion.

I talked too fast, while it solves the immediate problem the patch as is
causes other crashes.

Where did the crash happen? Is it a bug introduced by it? Or does it
root to other cause?

Just compile and run the installcheck tests.

It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside
pgstat_release_entry_ref() because it expects a "deadbeef", it seems to
be a magic variable but cannot find what its use is.

Attached a backtrace.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachments:

gdb.txttext/plain; charset=us-asciiDownload
#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jaime Casanova (#9)
Re: START_REPLICATION SLOT causing a crash in an assert build

At Thu, 15 Sep 2022 11:15:12 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside

Thanks for the info. I reproduced by make check.. stupid..

It's the thinko about the base address of reset_off.

So the attached doesn't crash..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

replslot_reset_fix_2.txttext/plain; charset=us-asciiDownload+22-2
#11Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Kyotaro Horiguchi (#10)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Fri, Sep 16, 2022 at 02:37:17PM +0900, Kyotaro Horiguchi wrote:

At Thu, 15 Sep 2022 11:15:12 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside

Thanks for the info. I reproduced by make check.. stupid..

It's the thinko about the base address of reset_off.

So the attached doesn't crash..

Hi,

Just confirming there have been no crash since this last patch.

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jaime Casanova (#11)
Re: START_REPLICATION SLOT causing a crash in an assert build

At Mon, 19 Sep 2022 11:04:03 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

On Fri, Sep 16, 2022 at 02:37:17PM +0900, Kyotaro Horiguchi wrote:

At Thu, 15 Sep 2022 11:15:12 -0500, Jaime Casanova <jcasanov@systemguards.com.ec> wrote in

It fails at ./src/backend/utils/activity/pgstat_shmem.c:530 inside

Thanks for the info. I reproduced by make check.. stupid..

It's the thinko about the base address of reset_off.

So the attached doesn't crash..

Hi,

Just confirming there have been no crash since this last patch.

Thanks for confirmation.

Althouh I'm not sure whether this is the right direction, this seems
to be an open item of 15?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#10)
Re: START_REPLICATION SLOT causing a crash in an assert build

Hi,

I wonder if the correct fix here wouldn't be to move the slotname out of
PgStat_StatReplSlotEntry?

On 2022-09-16 14:37:17 +0900, Kyotaro Horiguchi wrote:

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 6224c498c2..ed3f3af4d9 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -263,6 +263,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.shared_size = sizeof(PgStatShared_Database),
.shared_data_off = offsetof(PgStatShared_Database, stats),
.shared_data_len = sizeof(((PgStatShared_Database *) 0)->stats),
+		.reset_off = 0,
+		.reset_len = sizeof(((PgStatShared_Database *) 0)->stats),
.pending_size = sizeof(PgStat_StatDBEntry),
.flush_pending_cb = pgstat_database_flush_cb,
@@ -277,6 +279,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.shared_size = sizeof(PgStatShared_Relation),
.shared_data_off = offsetof(PgStatShared_Relation, stats),
.shared_data_len = sizeof(((PgStatShared_Relation *) 0)->stats),
+		.reset_off = 0,
+		.reset_len = sizeof(((PgStatShared_Relation *) 0)->stats),
.pending_size = sizeof(PgStat_TableStatus),
.flush_pending_cb = pgstat_relation_flush_cb,
@@ -291,6 +295,8 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.shared_size = sizeof(PgStatShared_Function),
.shared_data_off = offsetof(PgStatShared_Function, stats),
.shared_data_len = sizeof(((PgStatShared_Function *) 0)->stats),
+		.reset_off = 0,
+		.reset_len = sizeof(((PgStatShared_Function *) 0)->stats),
.pending_size = sizeof(PgStat_BackendFunctionEntry),
.flush_pending_cb = pgstat_function_flush_cb,
@@ -307,6 +313,10 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.shared_size = sizeof(PgStatShared_ReplSlot),
.shared_data_off = offsetof(PgStatShared_ReplSlot, stats),
.shared_data_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
+		/* reset doesn't wipe off slot name */
+		.reset_off = offsetof(PgStat_StatReplSlotEntry, spill_txns),
+		.reset_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
+		offsetof(PgStat_StatReplSlotEntry, spill_txns),

I'm confused what this offsetof does here? It's not even assigned to a
specific field? Am I missing something?

Also, wouldn't we need to subtract something of the size?

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index ac98918688..09a8c3873c 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -915,8 +915,9 @@ shared_stat_reset_contents(PgStat_Kind kind, PgStatShared_Common *header,
{
const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
-	memset(pgstat_get_entry_data(kind, header), 0,
-		   pgstat_get_entry_len(kind));
+	memset((char *)pgstat_get_entry_data(kind, header) +
+		   kind_info->reset_off, 0,
+		   kind_info->reset_len);

if (kind_info->reset_timestamp_cb)
kind_info->reset_timestamp_cb(header, ts);

This likely doesn't quite conform to what pgindent wants...

Greetings,

Andres Freund

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#13)
Re: START_REPLICATION SLOT causing a crash in an assert build

Thanks!

At Mon, 26 Sep 2022 19:53:02 -0700, Andres Freund <andres@anarazel.de> wrote in

I wonder if the correct fix here wouldn't be to move the slotname out of
PgStat_StatReplSlotEntry?

Ugh. Right. I thought its outer struct as purely the part for the
common header. But we can freely place anything after the header
part. I moved it to the outer struct. I didn't clear that part in
pgstat_create_relation() because it is filled in immediately.

The attached is that.

On 2022-09-16 14:37:17 +0900, Kyotaro Horiguchi wrote:

...

@@ -307,6 +313,10 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
.shared_size = sizeof(PgStatShared_ReplSlot),
.shared_data_off = offsetof(PgStatShared_ReplSlot, stats),
.shared_data_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
+		/* reset doesn't wipe off slot name */
+		.reset_off = offsetof(PgStat_StatReplSlotEntry, spill_txns),
+		.reset_len = sizeof(((PgStatShared_ReplSlot *) 0)->stats),
+		offsetof(PgStat_StatReplSlotEntry, spill_txns),

I'm confused what this offsetof does here? It's not even assigned to a
specific field? Am I missing something?

Also, wouldn't we need to subtract something of the size?

Yeah, I felt it confusing. The last line above is offset from just
after the header part (it is PgStat_, not PgStatShared_). I first
wrote that as you suggested but rewrote to shorter representation.

diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index ac98918688..09a8c3873c 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -915,8 +915,9 @@ shared_stat_reset_contents(PgStat_Kind kind, PgStatShared_Common *header,
{
const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
-	memset(pgstat_get_entry_data(kind, header), 0,
-		   pgstat_get_entry_len(kind));
+	memset((char *)pgstat_get_entry_data(kind, header) +
+		   kind_info->reset_off, 0,
+		   kind_info->reset_len);

if (kind_info->reset_timestamp_cb)
kind_info->reset_timestamp_cb(header, ts);

This likely doesn't quite conform to what pgindent wants...

In the first place, it's ugly...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Do-not-reset-slot-name-of-replication-slot-stats.patchtext/x-patch; charset=us-asciiDownload+7-8
#15Jonathan S. Katz
jkatz@postgresql.org
In reply to: Kyotaro Horiguchi (#14)
Re: START_REPLICATION SLOT causing a crash in an assert build

On 9/27/22 1:52 AM, Kyotaro Horiguchi wrote:

Thanks!

At Mon, 26 Sep 2022 19:53:02 -0700, Andres Freund <andres@anarazel.de> wrote in

I wonder if the correct fix here wouldn't be to move the slotname out of
PgStat_StatReplSlotEntry?

Ugh. Right. I thought its outer struct as purely the part for the
common header. But we can freely place anything after the header
part. I moved it to the outer struct. I didn't clear that part in
pgstat_create_relation() because it is filled in immediately.

The attached is that.

This is still listed as an open item[1]https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items for v15. Does this fix proposed
address the issue?

Thanks,

Jonathan

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items

#16Andres Freund
andres@anarazel.de
In reply to: Jonathan S. Katz (#15)
Re: START_REPLICATION SLOT causing a crash in an assert build

Hi,

On 2022-10-05 13:00:53 -0400, Jonathan S. Katz wrote:

On 9/27/22 1:52 AM, Kyotaro Horiguchi wrote:

Thanks!

At Mon, 26 Sep 2022 19:53:02 -0700, Andres Freund <andres@anarazel.de> wrote in

I wonder if the correct fix here wouldn't be to move the slotname out of
PgStat_StatReplSlotEntry?

Ugh. Right. I thought its outer struct as purely the part for the
common header. But we can freely place anything after the header
part. I moved it to the outer struct. I didn't clear that part in
pgstat_create_relation() because it is filled in immediately.

The attached is that.

This is still listed as an open item[1] for v15. Does this fix proposed
address the issue?

Unfortunately not - it doesn't even pass the existing tests
(test_decoding/001_repl_stats fails) :(.

The reason for that is that with the patch nothing restores the slotname when
reading stats from disk. That turns out not to cause immediate issues, but at
the next shutdown the name won't be set, and we'll serialize the stats data
with an empty string as the name.

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?

Greetings,

Andres Freund

#17Jonathan S. Katz
jkatz@postgresql.org
In reply to: Andres Freund (#16)
Re: START_REPLICATION SLOT causing a crash in an assert build

On 10/5/22 8:44 PM, Andres Freund wrote:

Hi,

On 2022-10-05 13:00:53 -0400, Jonathan S. Katz wrote:

On 9/27/22 1:52 AM, Kyotaro Horiguchi wrote:

Thanks!

At Mon, 26 Sep 2022 19:53:02 -0700, Andres Freund <andres@anarazel.de> wrote in

I wonder if the correct fix here wouldn't be to move the slotname out of
PgStat_StatReplSlotEntry?

Ugh. Right. I thought its outer struct as purely the part for the
common header. But we can freely place anything after the header
part. I moved it to the outer struct. I didn't clear that part in
pgstat_create_relation() because it is filled in immediately.

The attached is that.

This is still listed as an open item[1] for v15. Does this fix proposed
address the issue?

Unfortunately not - it doesn't even pass the existing tests
(test_decoding/001_repl_stats fails) :(.

Thanks for checking.

The reason for that is that with the patch nothing restores the slotname when
reading stats from disk. That turns out not to cause immediate issues, but at
the next shutdown the name won't be set, and we'll serialize the stats data
with an empty string as the name.

Ah.

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?

[My opinion, will let Michael/John chime in]

Ideally we would keep the stats on upgrade -- I think that's the better
user experience.

Thanks,

Jonathan

#18Michael Paquier
michael@paquier.xyz
In reply to: Jonathan S. Katz (#17)
Re: START_REPLICATION SLOT causing a crash in an assert build

On Wed, Oct 05, 2022 at 11:24:57PM -0400, Jonathan S. Katz wrote:

On 10/5/22 8:44 PM, Andres Freund wrote:

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?

[My opinion, will let Michael/John chime in]

Ideally we would keep the stats on upgrade -- I think that's the better user
experience.

The release has not happened yet, so I would be fine to remain
flexible and bump again PGSTAT_FILE_FORMAT_ID so as we have the
cleanest approach in place for the release and the future. At the
end, we would throw automatically the data of a file that's marked
with a version that does not match with what we expect at load time,
so there's a limited impact on the user experience, except, well,
losing these stats, of course.
--
Michael

#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#18)
Re: START_REPLICATION SLOT causing a crash in an assert build

At Thu, 6 Oct 2022 13:44:43 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Wed, Oct 05, 2022 at 11:24:57PM -0400, Jonathan S. Katz wrote:

On 10/5/22 8:44 PM, Andres Freund wrote:

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?

[My opinion, will let Michael/John chime in]

Ideally we would keep the stats on upgrade -- I think that's the better user
experience.

The release has not happened yet, so I would be fine to remain
flexible and bump again PGSTAT_FILE_FORMAT_ID so as we have the
cleanest approach in place for the release and the future. At the
end, we would throw automatically the data of a file that's marked
with a version that does not match with what we expect at load time,
so there's a limited impact on the user experience, except, well,
losing these stats, of course.

+1. FWIW, the atttached is an example of what it looks like if we
avoid file format change.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2-0001-If-we-avoid-file-format.patchtext/x-patch; charset=us-asciiDownload+22-10
#20Jonathan S. Katz
jkatz@postgresql.org
In reply to: Kyotaro Horiguchi (#19)
Re: START_REPLICATION SLOT causing a crash in an assert build

On 10/6/22 1:10 AM, Kyotaro Horiguchi wrote:

At Thu, 6 Oct 2022 13:44:43 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Wed, Oct 05, 2022 at 11:24:57PM -0400, Jonathan S. Katz wrote:

On 10/5/22 8:44 PM, Andres Freund wrote:

I have two ideas how to fix it. As a design constraint, I'd be interested in
the RMTs opinion on the following:
Is a cleaner fix that changes the stats format (i.e. existing stats will be
thrown away when upgrading) or one that doesn't change the stats format
preferrable?

[My opinion, will let Michael/John chime in]

Ideally we would keep the stats on upgrade -- I think that's the better user
experience.

The release has not happened yet, so I would be fine to remain
flexible and bump again PGSTAT_FILE_FORMAT_ID so as we have the
cleanest approach in place for the release and the future.

Yes, I agree with this.

At the

end, we would throw automatically the data of a file that's marked
with a version that does not match with what we expect at load time,
so there's a limited impact on the user experience, except, well,
losing these stats, of course.

I'm fine with this.

+1. FWIW, the atttached is an example of what it looks like if we
avoid file format change.

Thanks for the quick turnaround. I'll let others chime in on the review.

Thanks,

Jonathan

#21Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#19)
#22Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#21)
#23Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#26)
#28Jonathan S. Katz
jkatz@postgresql.org
In reply to: Andres Freund (#27)
#29Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#27)
#30Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#29)
#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#31)
#33Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#32)