SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Started by Vladimir Sitnikovalmost 6 years ago44 messageshackers
Jump to latest
#1Vladimir Sitnikov
sitnikov.vladimir@gmail.com

Hi,

Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of
PostgreSQL, Ubuntu 14.04.5 LTS

Here's a call stack:
https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484
The crash is consistent, and it reproduces 100% of the cases so far.

The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.

Does that ring any bells?

In case you wonder:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 XLogSendPhysical () at walsender.c:2762
2762 if (!WALRead(xlogreader,
(gdb) #0 XLogSendPhysical () at walsender.c:2762
SendRqstPtr = 133473640
startptr = 133473240
endptr = 133473640
nbytes = 400
segno = 1
errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
wre_read = -1, wre_seg = {ws_file = 4714224,
ws_segno = 140729887364688, ws_tli = 0}}
__func__ = "XLogSendPhysical"
#1 0x000000000087fa8a in WalSndLoop (send_data=0x88009d <XLogSendPhysical>)
at walsender.c:2300
__func__ = "WalSndLoop"
#2 0x000000000087d65a in StartReplication (cmd=0x299bda8) at
walsender.c:750
buf = {data = 0x0, len = 3, maxlen = 1024, cursor = 87}
FlushPtr = 133473640
__func__ = "StartReplication"
#3 0x000000000087eddc in exec_replication_command (
cmd_string=0x2916e68 "START_REPLICATION 0/7F4A3D8") at walsender.c:1643
cmd = 0x299bda8
parse_rc = 0
cmd_node = 0x299bda8
cmd_context = 0x299bc60
old_context = 0x2916d50
qc = {commandTag = 988942640, nprocessed = 140729887363520}
__func__ = "exec_replication_command"

Vladimir

#2Michael Paquier
michael@paquier.xyz
In reply to: Vladimir Sitnikov (#1)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:

The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.

Does that ring any bells?

It does, thanks! This would map with 1d374302 or 850196b6 that
reworked this area of the code, so it seems like we are not quite done
with this work yet. Do you still see the problem as of 55ca50d
(today's latest HEAD)?

Also, just wondering.. If I use more or less the same commands as
your travis job I should be able to reproduce the problem with a fresh
JDBC repository, right? Or do you a sub-portion of your regression
tests to run that easily?
--
Michael

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#2)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

At Thu, 28 May 2020 16:22:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:

The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.

Does that ring any bells?

It does, thanks! This would map with 1d374302 or 850196b6 that
reworked this area of the code, so it seems like we are not quite done
with this work yet. Do you still see the problem as of 55ca50d
(today's latest HEAD)?

I think that's not the case. I think I cause this crash with the
HEAD. I'll post a fix soon.

Also, just wondering.. If I use more or less the same commands as
your travis job I should be able to reproduce the problem with a fresh
JDBC repository, right? Or do you a sub-portion of your regression
tests to run that easily?

Pgjdbc seems initiating physical replication on a logical replication
session.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#3)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Thu, May 28, 2020 at 04:32:22PM +0900, Kyotaro Horiguchi wrote:

I think that's not the case. I think I cause this crash with the
HEAD. I'll post a fix soon.

Pgjdbc seems initiating physical replication on a logical replication
session.

Let me see... Indeed:
$ psql "replication=database dbname=postgres"
=# START_REPLICATION SLOT test_slot PHYSICAL 0/0;
[one <boom> later]

(gdb) bt
#0 XLogSendPhysical () at walsender.c:2762
#1 0x000055d5f7803318 in WalSndLoop (send_data=0x55d5f78039c7
<XLogSendPhysical>) at walsender.c:2300
#2 0x000055d5f7800d70 in StartReplication (cmd=0x55d5f919bc60) at
walsender.c:750
#3 0x000055d5f78025ad in exec_replication_command
(cmd_string=0x55d5f9119a80 "START_REPLICATION SLOT test_slot PHYSICAL
0/0;") at walsender.c:1643
#4 0x000055d5f786a7ea in PostgresMain (argc=1, argv=0x55d5f91472c8,
dbname=0x55d5f9147210 "ioltas", username=0x55d5f91471f0 "ioltas") at
postgres.c:4311
#5 0x000055d5f77b4e9c in BackendRun (port=0x55d5f913dcd0) at
postmaster.c:4523
#6 0x000055d5f77b4606 in BackendStartup (port=0x55d5f913dcd0) at
postmaster.c:4215
#7 0x000055d5f77b08ad in ServerLoop () at postmaster.c:1727
#8 0x000055d5f77b00fc in PostmasterMain (argc=3, argv=0x55d5f9113260)
at postmaster.c:1400
#9 0x000055d5f76b3736 in main (argc=3, argv=0x55d5f9113260) at
main.c:210

No need for the JDBC test suite then.
--
Michael

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Vladimir Sitnikov (#1)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

At Thu, 28 May 2020 09:07:04 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in

Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of
PostgreSQL, Ubuntu 14.04.5 LTS

Here's a call stack:
https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484
The crash is consistent, and it reproduces 100% of the cases so far.

The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.

Does that ring any bells?

Thanks for the report. It is surely a bug since the server crashes,
on the other hand Pgjdbc seems doing bad, too.

It seems to me that that crash means Pgjdbc is initiating a logical
replication connection to start physical replication.

In case you wonder:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 XLogSendPhysical () at walsender.c:2762
2762 if (!WALRead(xlogreader,
(gdb) #0 XLogSendPhysical () at walsender.c:2762
SendRqstPtr = 133473640
startptr = 133473240
endptr = 133473640
nbytes = 400
segno = 1
errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
wre_read = -1, wre_seg = {ws_file = 4714224,
ws_segno = 140729887364688, ws_tli = 0}}
__func__ = "XLogSendPhysical"

I see the probably the same symptom by the following steps with the
current HEAD.

psql 'host=/tmp replication=database'
=# START_REPLICATION 0/1;
<serer crashes>

Physical replication is not assumed to be started on a logical
replication connection. The attached would fix that. The patch adds
two tests. One for this case and another for the reverse.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Fix-crash-when-starting-physical-replication-on-l.patchtext/x-patch; charset=us-asciiDownload+25-5
#6Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Kyotaro Horiguchi (#5)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
Kyotaro>replication connection to start physical replication.

Well, it used to work previously, so it might be a breaking change from the
client/application point of view.

Vladimir

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Vladimir Sitnikov (#6)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Hello, Vladimir.

At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in

Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
Kyotaro>replication connection to start physical replication.

Well, it used to work previously, so it might be a breaking change from the
client/application point of view.

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true (or on, yes,
1) tells the backend to go into physical replication walsender mode,
wherein a small set of replication commands, shown below, can be
issued instead of SQL statements.

Passing database as the value for the replication parameter instructs
the backend to go into logical replication walsender mode, connecting
to the database specified in the dbname parameter. In logical
replication walsender mode, the replication commands shown below as
well as normal SQL commands can be issued.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Dave Cramer
pg@fastcrypt.com
In reply to: Kyotaro Horiguchi (#7)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

Hello, Vladimir.

At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote in

Kyotaro>It seems to me that that crash means Pgjdbc is initiating a

logical

Kyotaro>replication connection to start physical replication.

Well, it used to work previously, so it might be a breaking change from

the

client/application point of view.

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true (or on, yes,
1) tells the backend to go into physical replication walsender mode,
wherein a small set of replication commands, shown below, can be
issued instead of SQL statements.

Passing database as the value for the replication parameter instructs
the backend to go into logical replication walsender mode, connecting
to the database specified in the dbname parameter. In logical
replication walsender mode, the replication commands shown below as
well as normal SQL commands can be issued.

regards.

While the documentation does indeed say that there is quite a bit of

additional confusion added by:

and
START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE
*tli* ]

If we already have a physical replication slot according to the startup
message why do we need to specify it in the START REPLICATION message ?

Dave

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Dave Cramer (#8)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

At Thu, 28 May 2020 09:08:19 -0400, Dave Cramer <davecramer@postgres.rocks> wrote in

On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

...

While the documentation does indeed say that there is quite a bit of
additional confusion added by:

and
START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE
*tli* ]

If we already have a physical replication slot according to the startup
message why do we need to specify it in the START REPLICATION message ?

I don't know, but physical replication has worked that way since
before the replication slots was introduced so we haven't needed to do
so. Physical replication slots are not assumed as more than
memorandum for the oldest required WAL segment (and oldest xmin).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#7)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

+       if (am_db_walsender)
+               ereport(ERROR,
+                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("cannot initiate physical
replication on a logical replication connection")));

I don't agree with this change. The only restriction that we have in
place now in walsender.c regarding MyDatabaseId not being set is to
prevent the execution of SQL commands. Note that it is possible to
start physical replication even if MyDatabaseId is set in a
replication connection, so you could break cases that have been valid
until now.

I think that we actually should be much more careful with the
initialization of the WAL reader used in the context of a WAL sender
before calling WALRead() and attempting to read a new WAL page.
--
Michael

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#10)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

+       if (am_db_walsender)
+               ereport(ERROR,
+                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("cannot initiate physical
replication on a logical replication connection")));

I don't agree with this change. The only restriction that we have in
place now in walsender.c regarding MyDatabaseId not being set is to
prevent the execution of SQL commands. Note that it is possible to
start physical replication even if MyDatabaseId is set in a
replication connection, so you could break cases that have been valid
until now.

It donesn't check MyDatabase, but whether the connection parameter
"repliation" is "true" or "database". The documentation is telling
that "replication" should be "true" for a connection that is to be
used for physical replication, and "replication" should literally be
"database" for a connection that is for logical replication. We need
to revise the documentation if we are going to allow physical
replication on a conection with "replication = database".

I think that we actually should be much more careful with the
initialization of the WAL reader used in the context of a WAL sender
before calling WALRead() and attempting to read a new WAL page.

I agree that the initialization can be improved, but the current code
is no problem if we don't allow to run both logical and physical
replication on a single session.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kyotaro Horiguchi (#11)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Fri, 29 May 2020 at 17:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

+       if (am_db_walsender)
+               ereport(ERROR,
+                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("cannot initiate physical
replication on a logical replication connection")));

I don't agree with this change. The only restriction that we have in
place now in walsender.c regarding MyDatabaseId not being set is to
prevent the execution of SQL commands. Note that it is possible to
start physical replication even if MyDatabaseId is set in a
replication connection, so you could break cases that have been valid
until now.

It donesn't check MyDatabase, but whether the connection parameter
"repliation" is "true" or "database". The documentation is telling
that "replication" should be "true" for a connection that is to be
used for physical replication, and "replication" should literally be
"database" for a connection that is for logical replication. We need
to revise the documentation if we are going to allow physical
replication on a conection with "replication = database".

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR: logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#12)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR: logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender. However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached. The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.
--
Michael

Attachments:

walsender-crash-v2.patchtext/x-diff; charset=us-asciiDownload+31-26
#14Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#13)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020/06/02 13:24, Michael Paquier wrote:

On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR: logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender. However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached. The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.

Yes. Also we should add the test to check if physical replication can work
fine even on logical rep connection?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#13)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

At Tue, 2 Jun 2020 13:24:56 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR: logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

If we deliberately allow physical replication on a
database-replication connection, we should revise the documentation
that way. On the other hand physical replication has wider access to a
database cluster than logical replication. Thus allowing to start
physical replication on a logical replication connection could
introduce a problem related to privileges. So I think it might be
better that physical and logical replication have separate pg_hba
lines.

Once we explicitly allow physical replication on a logical replication
connection in documentation, it would be far harder to change the
behavior than now.

If we are sure that that cannot be a problem, I don't object the
change in documented behavior.

Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender. However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

At first fake_xlogreader was really a fake one that only provides
callback routines, but it should have been changed to a real
xlogreader at the time it began to store segment information. In that
sense moving to real xlogreader makes sense to me separately from
whether we allow physicalrep on logicalrep connections.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached. The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.

+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));

The same error message is accompanied by a DETAILS in some other
places. Don't we need one for this?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#14)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:

On 2020/06/02 13:24, Michael Paquier wrote:

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

On top of it, the issue is actually unrelated to if we want to
restrict things more or not when starting replication in a WAL sender
because the xlogreader creation just needs to happen when starting
replication. Now we have a static "fake" one created when a WAL
sender process starts, something that it would not need in most cases
like answering to a BASE_BACKUP command for example.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached. The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.

Yes. Also we should add the test to check if physical replication can work
fine even on logical rep connection?

I found confusing the use of psql to confirm that it actually works,
because we'd just return a protocol-level error in this case with psql
bumping on COPY_BOTH and it is not reliable to do just an error
message match. Note as well that GetConnection() discards
automatically the database name for pg_basebackup and pg_receivewal as
well as libpqrcv_connect() for standbys so we cannot use that.
Perhaps using psql is better than nothing, but that makes me
uncomfortable.
--
Michael

#17Dave Cramer
pg@fastcrypt.com
In reply to: Michael Paquier (#16)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:

On 2020/06/02 13:24, Michael Paquier wrote:

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

I don't see this is a valid reason to keep doing something. If it is broken
then fix it.
Clients can deal with the change.

Dave Cramer
https://www.postgres.rocks

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Dave Cramer (#17)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020/06/03 20:33, Dave Cramer wrote:

On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote:

On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:

On 2020/06/02 13:24, Michael Paquier wrote:

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

For the back branches, I agree with you. Even if it's undocumented behavior,
basically we should not get rid of it from the back branches unless there is
very special reason.

For v13, if it has no functional merit, I don't think it's so bad to get rid of
that undocumented (and maybe not-fully tested) behavior. If there are
applications depending it, I think that they can be updated.

I don't see this is a valid reason to keep doing something. If it is broken then fix it.
Clients can deal with the change.

+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#13)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

On 2020-Jun-02, Michael Paquier wrote:

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached. The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.

I don't particularly disagree with your proposed patch -- in fact, it
seems to make things cleaner. It is a little wasteful, but I don't
really mind that. It's just some memory, and it's not a significant
amount.

That said, I would *also* apply Kyotaro's proposed patch to prohibit a
physical standby running with a logical slot, if only because that
reduces the number of combinations that we need to test and keep our
collective heads straight about. Just reject the weird case and have
one type of slot for each type of replication. I didn't even think this
was at all possible.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#14)
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Hi,

On 2020-06-02 14:23:50 +0900, Fujii Masao wrote:

On 2020/06/02 13:24, Michael Paquier wrote:

On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR: logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

I don't think we should prohibit this. For one, it'd probably break some
clients, without a meaningful need.

But I think it's also actually quite useful to be able to access
catalogs before streaming data. You e.g. can look up configuration of
the primary before streaming WAL. With a second connection that's
actually harder to do reliably in some cases, because you need to be
sure that you actually reached the right server (consider a pooler,
automatic failover etc).

Greetings,

Andres Freund

#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#22)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#22)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#23)
#26Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#25)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#26)
#28Dave Cramer
pg@fastcrypt.com
In reply to: Alvaro Herrera (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Dave Cramer (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#23)
#31Jonathan S. Katz
jkatz@postgresql.org
In reply to: Alvaro Herrera (#29)
#32Andres Freund
andres@anarazel.de
In reply to: Jonathan S. Katz (#31)
#33Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#32)
#34Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#33)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#34)
#37Stephen Frost
sfrost@snowman.net
In reply to: Alvaro Herrera (#36)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Stephen Frost (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#38)
#40Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#39)
#41Dave Cramer
pg@fastcrypt.com
In reply to: Alvaro Herrera (#40)
#42Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#40)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#40)
#44Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#42)