Why is citext/regress failing on hamerkop?
For example, 'i'::citext = 'İ'::citext fails to be true.
It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype,
given that the test isn't skipped. This isn't the first time that
we've noticed that Windows doesn't seem to know about İ→i (see [1]/messages/by-id/CAC+AXB10p+mnJ6wrAEm6jb51+8=BfYzD=w6ftHRbMjMuSFN3kQ@mail.gmail.com),
but I don't think anyone has explained exactly why, yet. It could be
that it just doesn't know about that in any locale, or that it is
locale-dependent and would only do that for Turkish, the same reason
we skip the test for ICU, or ...
Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.
One new development in Windows-land is that the system now does
actually support UTF-8 in the runtime libraries[2]https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page. You can set it at
a system level, or for an application at build time, or by adding
".UTF-8" to a locale name when opening the locale (apparently much
more like Unix systems, but I don't know what exactly it does). I
wonder why we see this change now... did hamerkop have that ACP=UTF-8
setting applied on purpose, or if computers in Japan started doing
that by default instead of using Shift-JIS, or if it only started
picking UTF-8 around the time of the Meson change somehow, or the
initdb-template change. It's a little hard to tell from the logs.
[1]: /messages/by-id/CAC+AXB10p+mnJ6wrAEm6jb51+8=BfYzD=w6ftHRbMjMuSFN3kQ@mail.gmail.com
[2]: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.
Here's a minimal patch like that.
I don't think it's worth back-patching. Only 15 and 16 could possibly
be affected, I think, because the test wasn't enabled before that. I
think this is all just a late-appearing blow-up predicted by the
commit that enabled it:
commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Jan 5 13:30:07 2022 -0500
Enable routine running of citext's UTF8-specific test cases.
These test cases have been commented out since citext was invented,
because at the time we had no nice way to deal with tests that
have restrictions such as requiring UTF8 encoding. But now we do
have a convention for that, ie put them into a separate test file
with an early-exit path. So let's enable these tests to run when
their prerequisites are satisfied.
(We may have to tighten the prerequisites beyond the "encoding = UTF8
and locale != C" checks made here. But let's put it on the buildfarm
and see what blows up.)
Hamerkop is already green on the 15 and 16 branches, apparently
because it's using the pre-meson test stuff that I guess just didn't
run the relevant test. In other words, nobody would notice the
difference anyway, and a master-only fix would be enough to end this
44-day red streak.
Attachments:
0001-Skip-the-citext_utf8-test-on-Windows.patchtext/x-patch; charset=UTF-8; name=0001-Skip-the-citext_utf8-test-on-Windows.patchDownload
From 1f0e1dc21d4055a0e5109bac39999b290508e2d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 12 May 2024 10:20:06 +1200
Subject: [PATCH] Skip the citext_utf8 test on Windows.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
On other Windows build farm animals it is already skipped because they
don't use UTF-8 encoding. On "hamerkop", UTF-8 is used, and then the
test fails.
It is not clear to me (a non-Windows person looking only at buildfarm
evidence) whether Windows is less sophisticated than other OSes and
doesn't know how to downcase Turkish İ with the standard Unicode
database, or if it is more sophisticated than other systems and uses
locale-specific behavior like ICU does.
Whichever the reason, the result is the same: we need to skip the test
on Windows, just as we already do for ICU, at least until a
Windows-savvy developer comes up with a better idea. The technique for
detecting the OS is borrowed from collate.windows.win1252.sql.
Discussion: https://postgr.es/m/CA%2BhUKGJ1LeC3aE2qQYTK95rFVON3ZVoTQpTKJqxkHdtEyawH4A%40mail.gmail.com
---
contrib/citext/expected/citext_utf8.out | 3 +++
contrib/citext/expected/citext_utf8_1.out | 3 +++
contrib/citext/sql/citext_utf8.sql | 3 +++
3 files changed, 9 insertions(+)
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 5d988dcd485..19538db674e 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -6,8 +6,11 @@
* Turkish dotted I is not correct for many ICU locales. citext always
* uses the default collation, so it's not easy to restrict the test
* to the "tr-TR-x-icu" collation where it will succeed.
+ *
+ * Also disable for Windows. It fails similarly, at least in some locales.
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
+ version() ~ '(Visual C\+\+|mingw32|windows)' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
FROM pg_database
WHERE datname=current_database())
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 7065a5da190..874ec8519e1 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -6,8 +6,11 @@
* Turkish dotted I is not correct for many ICU locales. citext always
* uses the default collation, so it's not easy to restrict the test
* to the "tr-TR-x-icu" collation where it will succeed.
+ *
+ * Also disable for Windows. It fails similarly, at least in some locales.
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
+ version() ~ '(Visual C\+\+|mingw32|windows)' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
FROM pg_database
WHERE datname=current_database())
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index 34b232d64e2..ba283320797 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -6,9 +6,12 @@
* Turkish dotted I is not correct for many ICU locales. citext always
* uses the default collation, so it's not easy to restrict the test
* to the "tr-TR-x-icu" collation where it will succeed.
+ *
+ * Also disable for Windows. It fails similarly, at least in some locales.
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
+ version() ~ '(Visual C\+\+|mingw32|windows)' OR
(SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
FROM pg_database
WHERE datname=current_database())
--
2.44.0
Thomas Munro <thomas.munro@gmail.com> writes:
On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green. That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.
Here's a minimal patch like that.
WFM until some Windows person cares to probe more deeply.
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.
I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?
regards, tom lane
Hello Tom,
12.05.2024 08:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?
I've reproduced the failure locally with GSS enabled, so I'll try to
figure out what's going on here in the next few days.
Best regards,
Alexander
On 2024-05-12 Su 01:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.
Possibly. It looks like this might be the issue:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information: Credential cache is empty
+FATAL: sorry, too many clients already
There are several questions here, including:
1. why isn't it failing on later branches?
2. why isn't it failing on drongo (which has more modern compiler and OS)?
I think we'll need the help of the animal owner to dig into the issue.
I'm not planning on looking into that question myself, but really
somebody ought to. Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?
Well, this is more or less where I came in back in about 2002 :-) I've
been trying to help support it ever since, mainly motivated by stubborn
persistence than anything else. Still, I agree that the lack of support
for the Windows port from Microsoft over the years has been more than
disappointing.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 2024-05-12 Su 08:26, Andrew Dunstan wrote:
On 2024-05-12 Su 01:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so. It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.Possibly. It looks like this might be the issue:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information: Credential cache is empty +FATAL: sorry, too many clients alreadyThere are several questions here, including:
1. why isn't it failing on later branches?
2. why isn't it failing on drongo (which has more modern compiler and OS)?I think we'll need the help of the animal owner to dig into the issue.
Aha! drongo doesn't have GSSAPI enabled. Will work on that.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.
I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there). If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.
For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.
On 2024-05-12 Su 18:05, Thomas Munro wrote:
On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.
I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there). If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port. I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.
+1
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes:
For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.
In the meantime, some off-list investigation by Alexander Lakhin
has turned up a good deal of information about why we're seeing
failures on hamerkop in the back branches. Summarizing, it
appears that
1. In a GSS-enabled Windows build without any active Kerberos server,
libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to
open a GSSAPI connection, but then gss_init_sec_context() fails,
leading to client-side reports like this:
+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure. Minor code may provide more information: Credential cache is empty
+FATAL: sorry, too many clients already
(The first of these lines comes out during the attempted GSS
connection, the second during the only-slightly-more-successful
non-GSS connection.) So that's problem number 1: how is it that
gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims
knowledge of any credentials? Can we find a way to get this failure
to be detected during pg_GSS_have_cred_cache()? It is mighty
expensive to launch a backend connection that is doomed to fail,
especially when this happens during *every single libpq connection
attempt*.
2. Once gss_init_sec_context() fails, libpq abandons the connection
and starts over; since it has already initiated a GSS handshake
on the connection, there's not much choice. Although libpq faithfully
issues closesocket() on the abandoned connection, Alexander found
that the connected backend doesn't reliably see that: it may just
sit there until the AuthenticationTimeout elapses (1 minute by
default). That backend is still consuming a postmaster child slot,
so if this happens on any sizable fraction of test connection
attempts, it's little surprise that we soon get "sorry, too many
clients already" failures.
3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware. That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.
Alexander confirmed this theory by demonstrating that the main
regression tests in v15 would pass if he limited their speed enough
(by reducing the CPU allowed to a VM) but not at full speed. So the
buildfarm results suggesting this is only an issue in <= v13 must
be just a timing artifact; the problem is still there.
Of course, backends waiting till timeout is not a good behavior
independently of what is triggering that, so we have two problems to
solve here, not one. I have no ideas about the gss_init_sec_context()
failure, but I see a plausible theory about the failure to detect
socket closure in Microsoft's documentation about closesocket() [1]https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket:
If the l_onoff member of the LINGER structure is zero on a stream
socket, the closesocket call will return immediately and does not
receive WSAEWOULDBLOCK whether the socket is blocking or
nonblocking. However, any data queued for transmission will be
sent, if possible, before the underlying socket is closed. This is
also called a graceful disconnect or close. In this case, the
Windows Sockets provider cannot release the socket and other
resources for an arbitrary period, thus affecting applications
that expect to use all available sockets. This is the default
behavior for a socket.
I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?). But this has the right odor for an explanation.
I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted. I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.
I wonder if there are any Microsoft employees around here with
access to the relevant source code.
In the short run, it might be a good idea to deprecate using
--with-gssapi on Windows builds. A different stopgap idea
could be to drastically reduce the default AuthenticationTimeout,
perhaps only on Windows.
regards, tom lane
[1]: https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket
On Tue, May 14, 2024 at 8:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?). But this has the right odor for an explanation.I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted. I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.
I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's what
https://commitfest.postgresql.org/46/3523/
is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).
13.05.2024 23:17, Tom Lane wrote:
3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware. That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.
Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.
Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).
14.05.2024 03:38, Thomas Munro wrote:
I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's whathttps://commitfest.postgresql.org/46/3523/
is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).
Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).
An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG: forked new backend, pid=12500 socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG: connection received: host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG: server process (PID 12500) exited with exit code 1
# at the end of the test run
And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG: forked new backend, pid=11736 socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG: connection received: host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG: exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG: forked new backend, pid=10536 socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG: server process (PID 11736) exited with exit code 0
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
13.05.2024 23:17, Tom Lane wrote:
3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started. In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after. So it
looks like the animal was moved onto faster hardware.
Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
As I mentioned in our off-list discussion, I have a lingering feeling
that this v14 commit could be affecting the results somehow:
Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400
Update the Winsock API version requested by libpq.
According to Microsoft's documentation, 2.2 has been the current
version since Windows 98 or so. Moreover, that's what the Postgres
backend has been requesting since 2004 (cf commit 4cdf51e64).
So there seems no reason for libpq to keep asking for 1.1.
I didn't believe at the time that that'd have any noticeable effect,
but maybe it somehow made Winsock play a bit nicer with the GSS
support?
regards, tom lane
14.05.2024 17:38, Tom Lane wrote:
As I mentioned in our off-list discussion, I have a lingering feeling
that this v14 commit could be affecting the results somehow:Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400Update the Winsock API version requested by libpq.
According to Microsoft's documentation, 2.2 has been the current
version since Windows 98 or so. Moreover, that's what the Postgres
backend has been requesting since 2004 (cf commit 4cdf51e64).
So there seems no reason for libpq to keep asking for 1.1.I didn't believe at the time that that'd have any noticeable effect,
but maybe it somehow made Winsock play a bit nicer with the GSS
support?
Yes, probably, but may be not nicer, as the test duration increased?
Still I can't see the difference locally to check that commit.
Will try other VMs/configurations, maybe I could find a missing factor...
Best regards,
Alexander
On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
14.05.2024 03:38, Thomas Munro wrote:
I was beginning to suspect that lingering odour myself. I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect. I wonder
if the GSS code is somehow more prone to brokenness. One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again. That's whathttps://commitfest.postgresql.org/46/3523/
is intended to fix. Does it help here? Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).
Thanks for checking, and generally for your infinite patience with all
these horrible Windows problems.
OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?
Attachments:
0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patchtext/x-patch; charset=US-ASCII; name=0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patchDownload
From cbe4680c3e561b26c0bb49fc39dc8a6f40e84134 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 15 May 2024 10:01:19 +1200
Subject: [PATCH] Add kludge to make FD_CLOSE level-triggered.
Winsock only signals an FD_CLOSE event once, if the other end of the
socket shuts down gracefully. Because event WaitLatchOrSocket()
constructs and destroys a new event handle, we can miss the FD_CLOSE
event that is signaled just as we're destroying the handle. The next
WaitLatchOrSocket() will not see it.
Fix that race with some extra polling.
We wouldn't need this if we had long-lived event handles for sockets,
which has been proposed and tested and shown to work, but it's far too
complex to back-patch.
---
src/backend/storage/ipc/latch.c | 37 +++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index a7d88ebb048..9a5273f17d6 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1999,6 +1999,43 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
cur_event->reset = false;
}
+ /*
+ * Because we associate the socket with different event handles at
+ * different times, and because FD_CLOSE is only generated once the
+ * other end closes gracefully, we might miss an FD_CLOSE event that
+ * was signaled already on a handle we've already closed. We close
+ * that race by synchronously polling for EOF, after adjusting the
+ * event above and before sleeping below.
+ *
+ * XXX If we arranged to have one event handle for the lifetime of a
+ * socket, we wouldn't need this.
+ */
+ if (cur_event->events & WL_SOCKET_READABLE)
+ {
+ char c;
+ WSABUF buf;
+ DWORD received;
+ DWORD flags;
+ int r;
+
+ buf.buf = &c;
+ buf.len = 1;
+
+ /*
+ * Peek to see if EOF condition is true. Don't worry about error
+ * handling or pending data, just be careful not to consume it.
+ */
+ flags = MSG_PEEK;
+ if (WSARecv(cur_event->fd, &buf, 1, &received, &flags, NULL, NULL) == 0)
+ {
+ occurred_events->pos = cur_event->pos;
+ occurred_events->user_data = cur_event->user_data;
+ occurred_events->events = WL_SOCKET_READABLE;
+ occurred_events->fd = cur_event->fd;
+ return 1;
+ }
+ }
+
/*
* Windows does not guarantee to log an FD_WRITE network event
* indicating that more data can be sent unless the previous send()
--
2.44.0
15.05.2024 01:26, Thomas Munro wrote:
OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?
Yes, I see that abandoned GSS connections are closed immediately, as
expected. I have also confirmed that `meson test` with the basic
configuration passes on REL_16_STABLE. So from the outside, the fix
looks good to me.
Thank you for working on this!
Best regards,
Alexander
On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
15.05.2024 01:26, Thomas Munro wrote:
OK, so we know what the problem is here. Here is the simplest
solution I know of for that problem. I have proposed this in the past
and received negative feedback because it's a really gross hack. But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event. Does this fix
it?Yes, I see that abandoned GSS connections are closed immediately, as
expected. I have also confirmed that `meson test` with the basic
configuration passes on REL_16_STABLE. So from the outside, the fix
looks good to me.
Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.
Attachments:
v2-0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patchapplication/octet-stream; name=v2-0001-Add-kludge-to-make-FD_CLOSE-level-triggered.patchDownload
From 46cf55acc8a9f6b59fbac845339e87b8a9501956 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 15 May 2024 10:01:19 +1200
Subject: [PATCH v2] Add kludge to make FD_CLOSE level-triggered.
Winsock only signals an FD_CLOSE event once if the other end of the
socket shuts down gracefully. Because each WaitLatchOrSocket() call
constructs and destroys a new event handle every time, we can miss the
FD_CLOSE notification that is signaled after we've stopped waiting. The
next WaitLatchOrSocket()'s event handle will never see it.
Fix that race with some extra polling. It's not a beautiful code
change, but it seems to work, and is in the same spirit as the kludge
installed by commit f7819baa6.
We wouldn't need this if we had exactly one long-lived event handle per
socket, which has been proposed and tested and shown to work, but that's
too complex to back-patch.
It's plausible that commits 6051857fc and ed52c3707, later reverted by
29992a6a5, could be re-instated, with this fix in place.
Back-patch to all supported releases. This should hopefully clear up
build farm animal hamerkop's recent failures.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/176008.1715492071%40sss.pgh.pa.us
---
src/backend/storage/ipc/latch.c | 37 +++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index a7d88ebb048..4e28410e4c8 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1999,6 +1999,43 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
cur_event->reset = false;
}
+ /*
+ * We associate the socket with a new event handle for each
+ * WaitEventSet. FD_CLOSE is only generated once if the other end
+ * closes gracefully. Therefore we might miss the FD_CLOSE
+ * notification, if it was delivered to another event after we stopped
+ * waiting for it. Close that race by polling for EOF after setting
+ * up this handle to receive notifications, and before entering the
+ * sleep.
+ *
+ * XXX If we had one event handle for the lifetime of a socket, we
+ * wouldn't need this.
+ */
+ if (cur_event->events & WL_SOCKET_READABLE)
+ {
+ char c;
+ WSABUF buf;
+ DWORD received;
+ DWORD flags;
+
+ /*
+ * Peek to see if EOF has been reached. Don't worry about error
+ * handling or pending data here, as those will cause a wakeup
+ * below and be discovered by a later non-peek recv() call.
+ */
+ buf.buf = &c;
+ buf.len = 1;
+ flags = MSG_PEEK;
+ if (WSARecv(cur_event->fd, &buf, 1, &received, &flags, NULL, NULL) == 0)
+ {
+ occurred_events->pos = cur_event->pos;
+ occurred_events->user_data = cur_event->user_data;
+ occurred_events->events = WL_SOCKET_READABLE;
+ occurred_events->fd = cur_event->fd;
+ return 1;
+ }
+ }
+
/*
* Windows does not guarantee to log an FD_WRITE network event
* indicating that more data can be sent unless the previous send()
--
2.44.0
On Thu, May 16, 2024 at 9:46 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.
Of course as soon as I wrote that I thought of a useful improvement
myself: as far as I can tell, you only need to do the extra poll on
the first wait for WL_SOCKET_READABLE for any given WaitEventSet. I
don't think it's needed for later waits done by long-lived
WaitEventSet objects, so we can track that with a flag. That avoids
adding new overhead for regular backend socket waits after
authentication, it's just in these code paths that do a bunch of
WaitLatchOrSocket() calls that we need to consider FD_CLOSE events
lost between the cracks.
I also don't know if the condition should include "&& received == 0".
It probably doesn't make much difference, but by leaving that off we
don't have to wonder how peeking interacts with events, ie if it's a
problem that we didn't do the "reset" step. Thinking about that, I
realised that I should probably set reset = true in this new return
path, just like the "normal" WL_SOCKET_READABLE notification path,
just to be paranoid. (Programming computers you don't have requires
extra paranoia.)
Any chance you could test this version please Alexander?
Attachments:
v3-0001-Fix-FD_CLOSE-socket-event-hangs-on-Windows.patchapplication/octet-stream; name=v3-0001-Fix-FD_CLOSE-socket-event-hangs-on-Windows.patchDownload
From c24fc3bc7309a22f6e07e6375abe584700df4552 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 15 May 2024 10:01:19 +1200
Subject: [PATCH v3] Fix FD_CLOSE socket event hangs on Windows.
Winsock only signals an FD_CLOSE event once if the other end of the
socket shuts down gracefully. Because each WaitLatchOrSocket() call
constructs and destroys a new event handle every time, we can miss the
FD_CLOSE notification that is signaled after we've stopped waiting. The
next WaitLatchOrSocket()'s event handle will never see it.
Fix that race with some extra polling. It's not a beautiful code
change, but it seems to work, and is in the same spirit as the kludge
installed by commit f7819baa6.
We wouldn't need this if we had exactly one long-lived event handle per
socket, which has been proposed and tested and shown to work, but that's
too complex to back-patch.
It's plausible that commits 6051857fc and ed52c3707, later reverted by
29992a6a5, could be re-instated, with this fix in place.
Back-patch to all supported releases. This should hopefully clear up
build farm animal hamerkop's recent failures.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/176008.1715492071%40sss.pgh.pa.us
---
src/backend/storage/ipc/latch.c | 43 +++++++++++++++++++++++++++++++++
src/include/storage/latch.h | 1 +
2 files changed, 44 insertions(+)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index a7d88ebb048..cad360c1ada 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1000,6 +1000,7 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
event->user_data = user_data;
#ifdef WIN32
event->reset = false;
+ event->peek = true;
#endif
if (events == WL_LATCH_SET)
@@ -1999,6 +2000,48 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
cur_event->reset = false;
}
+ /*
+ * We associate the socket with a new event handle for each
+ * WaitEventSet. FD_CLOSE is only generated once if the other end
+ * closes gracefully. Therefore we might miss the FD_CLOSE
+ * notification, if it was delivered to another event after we stopped
+ * waiting for it. Close that race by polling for a readable socket
+ * after setting up this handle to receive notifications, and before
+ * entering the sleep, on the first time we wait for this socket.
+ */
+ if ((cur_event->events & WL_SOCKET_READABLE) != 0 && cur_event->peek)
+ {
+ char c;
+ WSABUF buf;
+ DWORD received;
+ DWORD flags;
+
+ /*
+ * Don't do this next time, it's only needed once to smooth over
+ * the transition from one WaitEventSet's event handle to
+ * another's.
+ */
+ cur_event->peek = false;
+
+ /*
+ * Peek to see if EOF has been reached. Don't worry about errors,
+ * as those will cause a wakeup below and be discovered by a later
+ * non-peek recv() call.
+ */
+ buf.buf = &c;
+ buf.len = 1;
+ flags = MSG_PEEK;
+ if (WSARecv(cur_event->fd, &buf, 1, &received, &flags, NULL, NULL) == 0)
+ {
+ occurred_events->pos = cur_event->pos;
+ occurred_events->user_data = cur_event->user_data;
+ occurred_events->events = WL_SOCKET_READABLE;
+ occurred_events->fd = cur_event->fd;
+ cur_event->reset = true;
+ return 1;
+ }
+ }
+
/*
* Windows does not guarantee to log an FD_WRITE network event
* indicating that more data can be sent unless the previous send()
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7e194d536f0..fb5cb2864db 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -157,6 +157,7 @@ typedef struct WaitEvent
void *user_data; /* pointer provided in AddWaitEventToSet */
#ifdef WIN32
bool reset; /* Is reset of the event required? */
+ bool peek; /* Is peek required? */
#endif
} WaitEvent;
--
2.44.0
On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Any chance you could test this version please Alexander?
Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway. The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc). So I'm back to v2.
Hello Thomas,
16.05.2024 04:32, Thomas Munro wrote:
On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Any chance you could test this version please Alexander?
Sorry, cancel that. v3 is not good. I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway. The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc). So I'm back to v2.
I've tested v2 and can confirm that it works as v1, `vcregress check`
passes with no failures on REL_16_STABLE, `meson test` with the basic
configuration too.
By the way, hamerkop is not configured to enable gssapi for HEAD [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2024-05-12%2011%3A00%3A28&stg=configure and
I could not enable gss locally yet (just passing extra_lib_dirs,
extra_include_dirs doesn't work for me).
It looks like we need to find a way to enable it for meson to continue
testing v17+ with GSS on Windows.
Best regards,
Alexander
Thomas Munro <thomas.munro@gmail.com> writes:
For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.
Meanwhile, back at the ranch, it doesn't seem that changed anything:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32
... and now that I look more closely, the reason why it didn't
change anything is that hamerkop is still building 0294df2
on HEAD. All its other branches are equally stuck at the
end of March. So this is a flat-out-broken animal, and I
plan to just ignore it until its owner un-sticks it.
(In particular, I think we shouldn't be in a hurry to push
the patch discussed downthread.)
Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.
regards, tom lane
On 2024-05-16 Th 16:18, Tom Lane wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.Meanwhile, back at the ranch, it doesn't seem that changed anything:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32
... and now that I look more closely, the reason why it didn't
change anything is that hamerkop is still building 0294df2
on HEAD. All its other branches are equally stuck at the
end of March. So this is a flat-out-broken animal, and I
plan to just ignore it until its owner un-sticks it.
(In particular, I think we shouldn't be in a hurry to push
the patch discussed downthread.)Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.
Yeah, that should be doable. Since we have the git ref these days we
should be able to mark it as old, or maybe just reject builds for very
old commits (the latter would be easier).
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-05-16 Th 16:18, Tom Lane wrote:
Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.
Yeah, that should be doable. Since we have the git ref these days we
should be able to mark it as old, or maybe just reject builds for very
old commits (the latter would be easier).
I'd rather have some visible status on the BF dashboard. Invariably,
with a problem like this, the animal's owner is unaware there's a
problem. If it's just silently not reporting, then no one else will
notice either, and we effectively lose an animal (despite it still
burning electricity to perform those rejected runs).
regards, tom lane
On 2024-05-16 Th 17:15, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-05-16 Th 16:18, Tom Lane wrote:
Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.Yeah, that should be doable. Since we have the git ref these days we
should be able to mark it as old, or maybe just reject builds for very
old commits (the latter would be easier).I'd rather have some visible status on the BF dashboard. Invariably,
with a problem like this, the animal's owner is unaware there's a
problem. If it's just silently not reporting, then no one else will
notice either, and we effectively lose an animal (despite it still
burning electricity to perform those rejected runs).
Fair enough. That will mean some database changes and other stuff, so it
will take a bit longer.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-05-16 Th 17:15, Tom Lane wrote:
I'd rather have some visible status on the BF dashboard. Invariably,
with a problem like this, the animal's owner is unaware there's a
problem. If it's just silently not reporting, then no one else will
notice either, and we effectively lose an animal (despite it still
burning electricity to perform those rejected runs).
Fair enough. That will mean some database changes and other stuff, so it
will take a bit longer.
Sure, I don't think it's urgent.
regards, tom lane
Hello,
I'm a hamerkop maintainer.
Sorry I have missed the scm error for so long.
Today I switched scmrepo from git.postgresql.org/git/postgresql.git
to github.com/postgres/postgres.git and successfully modernized
the build target code.
with best regards, Haruka Takatsuka
On Thu, 16 May 2024 16:18:23 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
For citext_utf8, I pushed cff4e5a3. Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.Meanwhile, back at the ranch, it doesn't seem that changed anything:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32
... and now that I look more closely, the reason why it didn't
change anything is that hamerkop is still building 0294df2
on HEAD. All its other branches are equally stuck at the
end of March. So this is a flat-out-broken animal, and I
plan to just ignore it until its owner un-sticks it.
(In particular, I think we shouldn't be in a hurry to push
the patch discussed downthread.)Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits? This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.regards, tom lane
_____________________________________________________________________
Import Notes
Reply to msg id not found: mailman.37712.1715890715.10034.buildfarm@ml.sraoss.co.jp
TAKATSUKA Haruka <harukat@sraoss.co.jp> writes:
I'm a hamerkop maintainer.
Sorry I have missed the scm error for so long.
Today I switched scmrepo from git.postgresql.org/git/postgresql.git
to github.com/postgres/postgres.git and successfully modernized
the build target code.
Thanks very much! I see hamerkop has gone green in HEAD.
It looks like it succeeded in v13 too but failed in v12,
which suggests that the isolationcheck problem is intermittent,
which is not too surprising given our current theory about
what's causing that.
At this point I think we are too close to the 17beta1 release
freeze to mess with it, but I'd support pushing Thomas'
proposed patch after the freeze is over.
regards, tom lane
On 2024-05-16 Th 17:34, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2024-05-16 Th 17:15, Tom Lane wrote:
I'd rather have some visible status on the BF dashboard. Invariably,
with a problem like this, the animal's owner is unaware there's a
problem. If it's just silently not reporting, then no one else will
notice either, and we effectively lose an animal (despite it still
burning electricity to perform those rejected runs).Fair enough. That will mean some database changes and other stuff, so it
will take a bit longer.Sure, I don't think it's urgent.
I've pushed a small change, that should just mark with an asterisk any
gitref that is more than 2 days older than the tip of the branch at the
time of reporting.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
I've pushed a small change, that should just mark with an asterisk any
gitref that is more than 2 days older than the tip of the branch at the
time of reporting.
Thanks!
regards, tom lane
On Fri, May 17, 2024 at 12:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I've tested v2 and can confirm that it works as v1, `vcregress check`
passes with no failures on REL_16_STABLE, `meson test` with the basic
configuration too.
Pushed, including back-branches.
This is all not very nice code and I hope we can delete it all some
day. Ideas include: (1) Thinking small: change over to the
WAIT_USE_POLL implementation of latch.c on this OS (Windows has poll()
these days), using a socket pair for latch wakeup (i.e. give up trying
to multiplex with native Windows event handles, even though they are a
great fit for our latch abstraction, as the sockets are too different
from Unix). (2) Thinking big: use native completion-based
asynchronous socket APIs, as part of a much larger cross-platform AIO
socket reengineering project that would deliver higher performance
networking on all OSes. The thought of (2) puts me off investing time
into (1), but on the other hand it would be nice if Windows could
almost completely share code with some Unixen. I may be more inclined
to actually try it if/when we can rip out the fake signal support,
because it is tangled up with this stuff and does not spark joy.
Thomas Munro wrote 2024-05-12 06:31:
Hamerkop is already green on the 15 and 16 branches, apparently
because it's using the pre-meson test stuff that I guess just didn't
run the relevant test. In other words, nobody would notice the
difference anyway, and a master-only fix would be enough to end this
44-day red streak.
Sorry for necroposting, but in our automated testing system we have
found some fails of this test. The most recent one was a couple of
days ago (see attached files) on PostgreSQL 15.7. Also I've reported
this bug some time ago [1]/messages/by-id/6885a0b52d06f7e5910d2b6276bbb4e8@postgrespro.ru, but provided an example only for
PostgreSQL 17. Back then the bug was actually found on 15 or 16
branches (no logs remain from couple of months back), but i wanted
to show that it was reproducible on 17.
I would appreciate if you would backpatch this change to 15 and 16
branches.
[1]: /messages/by-id/6885a0b52d06f7e5910d2b6276bbb4e8@postgrespro.ru
/messages/by-id/6885a0b52d06f7e5910d2b6276bbb4e8@postgrespro.ru
Oleg Tselebrovskiy, Postgres Pro
Attachments:
regression.diffstext/x-diff; name=regression.diffsDownload
diff -w -U3 C:/gr-builds/TaKFe3FF/2/pgpro-dev/postgrespro/contrib/citext/expected/citext_utf8.out C:/gr-builds/TaKFe3FF/2/pgpro-dev/postgrespro/contrib/citext/results/citext_utf8.out
--- C:/gr-builds/TaKFe3FF/2/pgpro-dev/postgrespro/contrib/citext/expected/citext_utf8.out 2024-07-29 13:53:45.259126600 +0300
+++ C:/gr-builds/TaKFe3FF/2/pgpro-dev/postgrespro/contrib/citext/results/citext_utf8.out 2024-07-29 14:43:38.772857200 +0300
@@ -54,7 +54,7 @@
SELECT 'i'::citext = 'İ'::citext AS t;
t
---
- t
+ f
(1 row)
-- Regression.
On Fri, Aug 2, 2024 at 1:37 AM Oleg Tselebrovskiy
<o.tselebrovskiy@postgrespro.ru> wrote:
I would appreciate if you would backpatch this change to 15 and 16
branches.
Done (e52a44b8, 91f498fd).
Any elucidation on how and why Windows machines have started using
UTF-8 would be welcome.
On Aug 1, 2024, at 18:54, Thomas Munro <thomas.munro@gmail.com> wrote:
Done (e52a44b8, 91f498fd).
Any elucidation on how and why Windows machines have started using
UTF-8 would be welcome.
Haven’t been following this thread, but this post reminded me of an issue I saw with locales on Windows[1]https://github.com/shogo82148/actions-setup-perl/issues/1713. Could it be that the introduction of Universal CRT[2]https://learn.microsoft.com/en-us/cpp/porting/upgrade-your-code-to-the-universal-crt?view=msvc-170 in Windows 10 has improved UTF-8 support?
Bit of a wild guess, but I assume worth bringing up at least.
D
[1]: https://github.com/shogo82148/actions-setup-perl/issues/1713
[2]: https://learn.microsoft.com/en-us/cpp/porting/upgrade-your-code-to-the-universal-crt?view=msvc-170
On Sat, Aug 3, 2024 at 2:11 AM David E. Wheeler <david@justatheory.com> wrote:
Haven’t been following this thread, but this post reminded me of an issue I saw with locales on Windows[1]. Could it be that the introduction of Universal CRT[2] in Windows 10 has improved UTF-8 support?
Yeah. We have a few places that claim that Windows APIs can't do
UTF-8 and they have to do extra wchar_t conversions, but that doesn't
seem to be true on modern Windows. Example:
I suspect that at least when the locale name is "en-US.UTF-8", then
the regular POSIXoid strcoll_l() function should just work™ and we
could delete all that stuff and save Windows users a lot of wasted CPU
cycles.