GSSENC'ed connection stalls while reconnection attempts.
Hello.
If psql connected using GSSAPI auth and server restarted, reconnection
sequence stalls and won't return.
I found that psql(libpq) sends startup packet via gss
encryption. conn->gssenc should be reset when encryption state is
freed.
The reason that psql doesn't notice the error is pqPacketSend returns
STATUS_OK when write error occurred. That behavior contradicts to the
comment of the function. The function is used only while making
connection so it's ok to error out immediately by write failure. I
think other usage of pqFlush while making a connection should report
write failure the same way.
Finally, It's user-friendly if psql shows error message for error on
reset attempts. (This perhaps should be arguable.)
The attached does the above. Any thoughts and/or opinions are welcome.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Fix-reconnection-failure-of-GSSENC-connections.patchtext/x-patch; charset=us-asciiDownload+17-6
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
If psql connected using GSSAPI auth and server restarted, reconnection
sequence stalls and won't return.
Yeah, reproduced here. (I wonder if there's any reasonable way to
exercise this scenario in src/test/kerberos/.)
I found that psql(libpq) sends startup packet via gss
encryption. conn->gssenc should be reset when encryption state is
freed.
Actually, it looks to me like the GSS support was wedged in by somebody
who was paying no attention to how SSL is managed, or else we forgot
to pay attention to GSS the last time we rearranged SSL support. It's
completely broken for the multiple-host-addresses scenario as well,
because try_gss is being set and cleared in the wrong places altogether.
conn->gcred is not being handled correctly either I think --- we need
to make sure that it's dropped in pqDropConnection.
The attached patch makes this all act more like the way SSL is handled,
and for me it resolves the reconnection problem.
The reason that psql doesn't notice the error is pqPacketSend returns
STATUS_OK when write error occurred. That behavior contradicts to the
comment of the function. The function is used only while making
connection so it's ok to error out immediately by write failure. I
think other usage of pqFlush while making a connection should report
write failure the same way.
I'm disinclined to mess with that, because (a) I don't think it's the
actual source of the problem, and (b) it would affect way more than
just GSS mode.
Finally, It's user-friendly if psql shows error message for error on
reset attempts. (This perhaps should be arguable.)
Meh, that's changing fairly longstanding behavior that I don't think
we've had many complaints about.
regards, tom lane
Attachments:
fix-bogus-GSS-connection-management-1.patchtext/x-diff; charset=us-ascii; name=fix-bogus-GSS-connection-management-1.patchDownload+9-28
I wrote:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
If psql connected using GSSAPI auth and server restarted, reconnection
sequence stalls and won't return.
Yeah, reproduced here. (I wonder if there's any reasonable way to
exercise this scenario in src/test/kerberos/.)
I tried writing such a test based on the IO::Pty infrastructure used
by src/bin/psql/t/010_tab_completion.pl, as attached. It works, but
it feels pretty grotty, especially seeing that so much of the patch
is copy-and-pasted from 010_tab_completion.pl. I think if we want
to have a test like this, it'd be good to work a little harder on
refactoring so that more of that code can be shared. My Perl skillz
are a bit weak for that, though.
regards, tom lane
Attachments:
test-gssapi-reconnection-wip.patchtext/x-diff; charset=us-ascii; name=test-gssapi-reconnection-wip.patchDownload+86-2
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
If psql connected using GSSAPI auth and server restarted, reconnection
sequence stalls and won't return.Yeah, reproduced here. (I wonder if there's any reasonable way to
exercise this scenario in src/test/kerberos/.)I found that psql(libpq) sends startup packet via gss
encryption. conn->gssenc should be reset when encryption state is
freed.Actually, it looks to me like the GSS support was wedged in by somebody
who was paying no attention to how SSL is managed, or else we forgot
to pay attention to GSS the last time we rearranged SSL support. It's
completely broken for the multiple-host-addresses scenario as well,
because try_gss is being set and cleared in the wrong places altogether.
conn->gcred is not being handled correctly either I think --- we need
to make sure that it's dropped in pqDropConnection.The attached patch makes this all act more like the way SSL is handled,
and for me it resolves the reconnection problem.
It looks good to me.
The reason that psql doesn't notice the error is pqPacketSend returns
STATUS_OK when write error occurred. That behavior contradicts to the
comment of the function. The function is used only while making
connection so it's ok to error out immediately by write failure. I
think other usage of pqFlush while making a connection should report
write failure the same way.I'm disinclined to mess with that, because (a) I don't think it's the
actual source of the problem, and (b) it would affect way more than
just GSS mode.
If I did that in pqFlush your objection would be right, but
pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
so not doing that is just wrong. (pqSendSome reported write failure in
this case.) For other parts in authentication code, I don't think it
doesn't affect badly because authentication should proceed without any
read/write overlapping.
Finally, It's user-friendly if psql shows error message for error on
reset attempts. (This perhaps should be arguable.)Meh, that's changing fairly longstanding behavior that I don't think
we've had many complaints about.
Yeah, I haven't seen the message for any other reasons than the
absence of server. :p And, I noticed that, in the first place, I would
see that message in the next connection attempt from scratch.
I agree to you on this point.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
The attached patch makes this all act more like the way SSL is handled,
and for me it resolves the reconnection problem.
It looks good to me.
OK, thanks.
The reason that psql doesn't notice the error is pqPacketSend returns
STATUS_OK when write error occurred. That behavior contradicts to the
comment of the function. The function is used only while making
connection so it's ok to error out immediately by write failure. I
think other usage of pqFlush while making a connection should report
write failure the same way.
I'm disinclined to mess with that, because (a) I don't think it's the
actual source of the problem, and (b) it would affect way more than
just GSS mode.
If I did that in pqFlush your objection would be right, but
pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
so not doing that is just wrong. (pqSendSome reported write failure in
this case.) For other parts in authentication code, I don't think it
doesn't affect badly because authentication should proceed without any
read/write overlapping.
As the comment for pqSendSome says, we report a write failure immediately
only if we also cannot read. I don't really see a reason why the behavior
described there isn't fine during initial connection as well. If you feel
that the comment for pqPacketSend needs adjustment, we can do that.
In any case, I'm quite against changing pqPacketSend's behavior because
"it's only used during initial connection"; there is nothing about the
function that restricts it to that case.
Bottom line here is that I'm suspicious of changing the behavior of
the read/write code on the strength of a bug in the GSS state management
logic. If there's a reason to change the read/write code, we should be
able to demonstrate it without the GSS bug.
regards, tom lane
At Mon, 13 Jul 2020 11:08:09 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
At Fri, 10 Jul 2020 12:01:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote intgl> >> I'm disinclined to mess with that, because (a) I don't think it's the
actual source of the problem, and (b) it would affect way more than
just GSS mode.If I did that in pqFlush your objection would be right, but
pqPacketSend is defined as "RETURNS: STATUS_ERROR if the write fails"
so not doing that is just wrong. (pqSendSome reported write failure in
this case.) For other parts in authentication code, I don't think it
doesn't affect badly because authentication should proceed without any
read/write overlapping.As the comment for pqSendSome says, we report a write failure immediately
only if we also cannot read. I don't really see a reason why the behavior
described there isn't fine during initial connection as well. If you feel
that the comment for pqPacketSend needs adjustment, we can do that.
I'm fine with that.
In any case, I'm quite against changing pqPacketSend's behavior because
"it's only used during initial connection"; there is nothing about the
function that restricts it to that case.
That sounds fair enough.
Bottom line here is that I'm suspicious of changing the behavior of
the read/write code on the strength of a bug in the GSS state management
logic. If there's a reason to change the read/write code, we should be
able to demonstrate it without the GSS bug.
Agreed to separate the change from this issue. I also don't think
that change in behavior dramatically improve the situation since we
should have had a bunch of trouble when a write actually failed in the
normal case.
I'm going to post a patch to change the comment of pqPacketSend.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 14 Jul 2020 13:31:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Agreed to separate the change from this issue. I also don't think
that change in behavior dramatically improve the situation since we
should have had a bunch of trouble when a write actually failed in the
normal case.I'm going to post a patch to change the comment of pqPacketSend.
So this is a proposal to add a description about the behavior on write
failure. The last half of the addition is a copy from the comment of
pqFlush.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center