Why is citext/regress failing on hamerkop?

Started by Thomas Munroalmost 2 years ago33 messages

thomas.munro@gmail.com

almost 2 years ago

For example, 'i'::citext = 'İ'::citext fails to be true.

It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype,
given that the test isn't skipped. This isn't the first time that
we've noticed that Windows doesn't seem to know about İ→i (see [1]/messages/by-id/CAC+AXB10p+mnJ6wrAEm6jb51+8=BfYzD=w6ftHRbMjMuSFN3kQ@mail.gmail.com),
but I don't think anyone has explained exactly why, yet. It could be
that it just doesn't know about that in any locale, or that it is
locale-dependent and would only do that for Turkish, the same reason
we skip the test for ICU, or ...

Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.

One new development in Windows-land is that the system now does
actually support UTF-8 in the runtime libraries[2]https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page. You can set it at
a system level, or for an application at build time, or by adding
".UTF-8" to a locale name when opening the locale (apparently much
more like Unix systems, but I don't know what exactly it does). I
wonder why we see this change now... did hamerkop have that ACP=UTF-8
setting applied on purpose, or if computers in Japan started doing
that by default instead of using Shift-JIS, or if it only started
picking UTF-8 around the time of the Meson change somehow, or the
initdb-template change. It's a little hard to tell from the logs.

[1]: /messages/by-id/CAC+AXB10p+mnJ6wrAEm6jb51+8=BfYzD=w6ftHRbMjMuSFN3kQ@mail.gmail.com
[2]: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#1)

Re: Why is citext/regress failing on hamerkop?

On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.

Here's a minimal patch like that.

I don't think it's worth back-patching. Only 15 and 16 could possibly
be affected, I think, because the test wasn't enabled before that. I
think this is all just a late-appearing blow-up predicted by the
commit that enabled it:

commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Jan 5 13:30:07 2022 -0500

Enable routine running of citext's UTF8-specific test cases.

These test cases have been commented out since citext was invented,
because at the time we had no nice way to deal with tests that
have restrictions such as requiring UTF8 encoding. But now we do
have a convention for that, ie put them into a separate test file
with an early-exit path. So let's enable these tests to run when
their prerequisites are satisfied.

(We may have to tighten the prerequisites beyond the "encoding = UTF8
and locale != C" checks made here. But let's put it on the buildfarm
and see what blows up.)

Hamerkop is already green on the 15 and 16 branches, apparently
because it's using the pre-meson test stuff that I guess just didn't
run the relevant test. In other words, nobody would notice the
difference anyway, and a master-only fix would be enough to end this
44-day red streak.

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Thomas Munro (#2)

Re: Why is citext/regress failing on hamerkop?

Thomas Munro <thomas.munro@gmail.com> writes:

On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.

Here's a minimal patch like that.

WFM until some Windows person cares to probe more deeply.

BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.

I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Tom Lane (#3)

Re: Why is citext/regress failing on hamerkop?

Hello Tom,

12.05.2024 08:34, Tom Lane wrote:

BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.

I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?

I've reproduced the failure locally with GSS enabled, so I'll try to
figure out what's going on here in the next few days.

Best regards,
Alexander

andrew@dunslane.net

almost 2 years ago

In reply to: Tom Lane (#3)

Re: Why is citext/regress failing on hamerkop?

On 2024-05-12 Su 01:34, Tom Lane wrote:

BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.

Possibly. It looks like this might be the issue:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more information: Credential cache is empty
+FATAL:  sorry, too many clients already

There are several questions here, including:

1. why isn't it failing on later branches?
2. why isn't it failing on drongo (which has more modern compiler and OS)?

I think we'll need the help of the animal owner to dig into the issue.

I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?

Well, this is more or less where I came in back in about 2002 :-) I've
been trying to help support it ever since, mainly motivated by stubborn
persistence than anything else. Still, I agree that the lack of support
for the Windows port from Microsoft over the years has been more than
disappointing.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

andrew@dunslane.net

almost 2 years ago

In reply to: Andrew Dunstan (#5)

Re: Why is citext/regress failing on hamerkop?

On 2024-05-12 Su 08:26, Andrew Dunstan wrote:

On 2024-05-12 Su 01:34, Tom Lane wrote:

BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.

Possibly. It looks like this might be the issue:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more information: Credential cache is empty
+FATAL:  sorry, too many clients already
There are several questions here, including:

1. why isn't it failing on later branches?
2. why isn't it failing on drongo (which has more modern compiler and OS)?

I think we'll need the help of the animal owner to dig into the issue.

Aha! drongo doesn't have GSSAPI enabled. Will work on that.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

thomas.munro@gmail.com

almost 2 years ago

In reply to: Andrew Dunstan (#5)

Re: Why is citext/regress failing on hamerkop?

On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.

I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there). If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.

For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.

andrew@dunslane.net

almost 2 years ago

In reply to: Thomas Munro (#7)

Re: Why is citext/regress failing on hamerkop?

On 2024-05-12 Su 18:05, Thomas Munro wrote:

On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.

I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there). If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.

+1

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Thomas Munro (#7)

Re: Why is citext/regress failing on hamerkop?

Thomas Munro <thomas.munro@gmail.com> writes:

For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.

In the meantime, some off-list investigation by Alexander Lakhin
has turned up a good deal of information about why we're seeing
failures on hamerkop in the back branches. Summarizing, it
appears that

1. In a GSS-enabled Windows build without any active Kerberos server,
libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to
open a GSSAPI connection, but then gss_init_sec_context() fails,
leading to client-side reports like this:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more information: Credential cache is empty
+FATAL:  sorry, too many clients already

(The first of these lines comes out during the attempted GSS
connection, the second during the only-slightly-more-successful
non-GSS connection.) So that's problem number 1: how is it that
gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims
knowledge of any credentials? Can we find a way to get this failure
to be detected during pg_GSS_have_cred_cache()? It is mighty
expensive to launch a backend connection that is doomed to fail,
especially when this happens during *every single libpq connection
attempt*.

2. Once gss_init_sec_context() fails, libpq abandons the connection
and starts over; since it has already initiated a GSS handshake
on the connection, there's not much choice. Although libpq faithfully
issues closesocket() on the abandoned connection, Alexander found
that the connected backend doesn't reliably see that: it may just
sit there until the AuthenticationTimeout elapses (1 minute by
default). That backend is still consuming a postmaster child slot,
so if this happens on any sizable fraction of test connection
attempts, it's little surprise that we soon get "sorry, too many
clients already" failures.

3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware. That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.

Alexander confirmed this theory by demonstrating that the main
regression tests in v15 would pass if he limited their speed enough
(by reducing the CPU allowed to a VM) but not at full speed. So the
buildfarm results suggesting this is only an issue in <= v13 must
be just a timing artifact; the problem is still there.

Of course, backends waiting till timeout is not a good behavior
independently of what is triggering that, so we have two problems to
solve here, not one. I have no ideas about the gss_init_sec_context()
failure, but I see a plausible theory about the failure to detect
socket closure in Microsoft's documentation about closesocket() [1]https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket:

If the l_onoff member of the LINGER structure is zero on a stream
socket, the closesocket call will return immediately and does not
receive WSAEWOULDBLOCK whether the socket is blocking or
nonblocking. However, any data queued for transmission will be
sent, if possible, before the underlying socket is closed. This is
also called a graceful disconnect or close. In this case, the
Windows Sockets provider cannot release the socket and other
resources for an arbitrary period, thus affecting applications
that expect to use all available sockets. This is the default
behavior for a socket.

I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?). But this has the right odor for an explanation.

I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted. I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.
I wonder if there are any Microsoft employees around here with
access to the relevant source code.

In the short run, it might be a good idea to deprecate using
--with-gssapi on Windows builds. A different stopgap idea
could be to drastically reduce the default AuthenticationTimeout,
perhaps only on Windows.

regards, tom lane

[1]: https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket

thomas.munro@gmail.com

almost 2 years ago

In reply to: Tom Lane (#9)

Re: Why is citext/regress failing on hamerkop?

On Tue, May 14, 2024 at 8:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?). But this has the right odor for an explanation.

I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted. I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.

I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#10)

Re: Why is citext/regress failing on hamerkop?

13.05.2024 23:17, Tom Lane wrote:

3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware. That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.

Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).

14.05.2024 03:38, Thomas Munro wrote:

I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG: forked new backend, pid=12500 socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG: connection received: host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG: server process (PID 12500) exited with exit code 1
# at the end of the test run

And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG: forked new backend, pid=11736 socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG: connection received: host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG: exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG: forked new backend, pid=10536 socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG: server process (PID 11736) exited with exit code 0

Best regards,
Alexander

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Alexander Lakhin (#11)

Re: Why is citext/regress failing on hamerkop?

Alexander Lakhin <exclusion@gmail.com> writes:

13.05.2024 23:17, Tom Lane wrote:

3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:

As I mentioned in our off-list discussion, I have a lingering feeling
that this v14 commit could be affecting the results somehow:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400

Update the Winsock API version requested by libpq.

According to Microsoft's documentation, 2.2 has been the current
version since Windows 98 or so. Moreover, that's what the Postgres
backend has been requesting since 2004 (cf commit 4cdf51e64).
So there seems no reason for libpq to keep asking for 1.1.

I didn't believe at the time that that'd have any noticeable effect,
but maybe it somehow made Winsock play a bit nicer with the GSS
support?

regards, tom lane

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Tom Lane (#12)

Re: Why is citext/regress failing on hamerkop?

14.05.2024 17:38, Tom Lane wrote:

As I mentioned in our off-list discussion, I have a lingering feeling
that this v14 commit could be affecting the results somehow:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400

Update the Winsock API version requested by libpq.

According to Microsoft's documentation, 2.2 has been the current
version since Windows 98 or so. Moreover, that's what the Postgres
backend has been requesting since 2004 (cf commit 4cdf51e64).
So there seems no reason for libpq to keep asking for 1.1.

I didn't believe at the time that that'd have any noticeable effect,
but maybe it somehow made Winsock play a bit nicer with the GSS
support?

Yes, probably, but may be not nicer, as the test duration increased?
Still I can't see the difference locally to check that commit.
Will try other VMs/configurations, maybe I could find a missing factor...

Best regards,
Alexander

thomas.munro@gmail.com

almost 2 years ago

In reply to: Alexander Lakhin (#11)

Re: Why is citext/regress failing on hamerkop?

On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

14.05.2024 03:38, Thomas Munro wrote:

I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

Thanks for checking, and generally for your infinite patience with all
these horrible Windows problems.

OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#14)

Re: Why is citext/regress failing on hamerkop?

15.05.2024 01:26, Thomas Munro wrote:

OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?

Yes, I see that abandoned GSS connections are closed immediately, as
expected. I have also confirmed that `meson test` with the basic
configuration passes on REL_16_STABLE. So from the outside, the fix
looks good to me.

Thank you for working on this!

Best regards,
Alexander

thomas.munro@gmail.com

almost 2 years ago

In reply to: Alexander Lakhin (#15)

Re: Why is citext/regress failing on hamerkop?

On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

15.05.2024 01:26, Thomas Munro wrote:

OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?

Yes, I see that abandoned GSS connections are closed immediately, as
expected. I have also confirmed that `meson test` with the basic
configuration passes on REL_16_STABLE. So from the outside, the fix
looks good to me.

Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#16)

Re: Why is citext/regress failing on hamerkop?

On Thu, May 16, 2024 at 9:46 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.

Of course as soon as I wrote that I thought of a useful improvement
myself: as far as I can tell, you only need to do the extra poll on
the first wait for WL_SOCKET_READABLE for any given WaitEventSet. I
don't think it's needed for later waits done by long-lived
WaitEventSet objects, so we can track that with a flag. That avoids
adding new overhead for regular backend socket waits after
authentication, it's just in these code paths that do a bunch of
WaitLatchOrSocket() calls that we need to consider FD_CLOSE events
lost between the cracks.

I also don't know if the condition should include "&& received == 0".
It probably doesn't make much difference, but by leaving that off we
don't have to wonder how peeking interacts with events, ie if it's a
problem that we didn't do the "reset" step. Thinking about that, I
realised that I should probably set reset = true in this new return
path, just like the "normal" WL_SOCKET_READABLE notification path,
just to be paranoid. (Programming computers you don't have requires
extra paranoia.)

Any chance you could test this version please Alexander?

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#17)

Re: Why is citext/regress failing on hamerkop?

On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Any chance you could test this version please Alexander?

Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway. The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc). So I'm back to v2.

Alexander Lakhin

exclusion@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#18)

Re: Why is citext/regress failing on hamerkop?

Hello Thomas,

16.05.2024 04:32, Thomas Munro wrote:

On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Any chance you could test this version please Alexander?

Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway. The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc). So I'm back to v2.

I've tested v2 and can confirm that it works as v1, `vcregress check`
passes with no failures on REL_16_STABLE, `meson test` with the basic
configuration too.

By the way, hamerkop is not configured to enable gssapi for HEAD [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2024-05-12%2011%3A00%3A28&stg=configure and
I could not enable gss locally yet (just passing extra_lib_dirs,
extra_include_dirs doesn't work for me).

It looks like we need to find a way to enable it for meson to continue
testing v17+ with GSS on Windows.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2024-05-12%2011%3A00%3A28&stg=configure

Best regards,
Alexander

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Thomas Munro (#7)

Re: Why is citext/regress failing on hamerkop?

Thomas Munro <thomas.munro@gmail.com> writes:

For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.

Meanwhile, back at the ranch, it doesn't seem that changed anything:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32

... and now that I look more closely, the reason why it didn't
change anything is that hamerkop is still building 0294df2
on HEAD. All its other branches are equally stuck at the
end of March. So this is a flat-out-broken animal, and I
plan to just ignore it until its owner un-sticks it.
(In particular, I think we shouldn't be in a hurry to push
the patch discussed downthread.)

Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.

regards, tom lane

andrew@dunslane.net

almost 2 years ago

In reply to: Tom Lane (#20)

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Andrew Dunstan (#21)

andrew@dunslane.net

almost 2 years ago

In reply to: Tom Lane (#22)

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Andrew Dunstan (#23)

TAKATSUKA Haruka

harukat@sraoss.co.jp

almost 2 years ago

In reply to: Thomas Munro (#1)

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: TAKATSUKA Haruka (#25)

andrew@dunslane.net

almost 2 years ago

In reply to: Tom Lane (#24)

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Andrew Dunstan (#27)

thomas.munro@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#19)

Oleg Tselebrovskiy

o.tselebrovskiy@postgrespro.ru

over 1 year ago

In reply to: Thomas Munro (#2)

thomas.munro@gmail.com

over 1 year ago

In reply to: Oleg Tselebrovskiy (#30)

David E. Wheeler

david@kineticode.com

over 1 year ago

In reply to: Thomas Munro (#31)

thomas.munro@gmail.com

over 1 year ago

In reply to: David E. Wheeler (#32)