pgsql: Add kqueue(2) support to the WaitEventSet API.

Started by Thomas Munroover 6 years ago36 messageshackers

thomas.munro@gmail.com

over 6 years ago

Add kqueue(2) support to the WaitEventSet API.

Use kevent(2) to wait for events on the BSD family of operating
systems and macOS. This is similar to the epoll(2) support added
for Linux by commit 98a64d0bd.

Author: Thomas Munro
Reviewed-by: Andres Freund, Marko Tiikkaja, Tom Lane
Tested-by: Mateusz Guzik, Matteo Beccati, Keith Fiske, Heikki Linnakangas, Michael Paquier, Peter Eisentraut, Rui DeSousa, Tom Lane, Mark Wong
Discussion: /messages/by-id/CAEepm=37oF84-iXDTQ9MrGjENwVGds+5zTr38ca73kWR7ez_tA@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/815c2f0972c8386aba7c606f1ee6690d13b04db2

Modified Files
--------------
configure | 4 +-
configure.in | 2 +
src/backend/storage/ipc/latch.c | 300 +++++++++++++++++++++++++++++++++++++++-
src/include/pg_config.h.in | 6 +
src/tools/msvc/Solution.pm | 2 +
5 files changed, 311 insertions(+), 3 deletions(-)

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Thomas Munro (#1)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Hi Thomas,

On Wed, Feb 05, 2020 at 04:59:10AM +0000, Thomas Munro wrote:

Add kqueue(2) support to the WaitEventSet API.

Use kevent(2) to wait for events on the BSD family of operating
systems and macOS. This is similar to the epoll(2) support added
for Linux by commit 98a64d0bd.

Worth noting this issue with the test suite of postgres_fdw for
buildfarm animal coypu, running on NetBSD:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2020-02-19%2023%3A01%3A01
+ERROR:  kqueue failed: Too many open files
--
Michael

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Michael Paquier (#2)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On Thu, Feb 20, 2020 at 8:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 05, 2020 at 04:59:10AM +0000, Thomas Munro wrote:

Add kqueue(2) support to the WaitEventSet API.

Use kevent(2) to wait for events on the BSD family of operating
systems and macOS. This is similar to the epoll(2) support added
for Linux by commit 98a64d0bd.

Worth noting this issue with the test suite of postgres_fdw for
buildfarm animal coypu, running on NetBSD:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2020-02-19%2023%3A01%3A01
+ERROR: kqueue failed: Too many open files

Hmm. So coypu just came back after 48 days, and the new kqueue() code
fails for process 19829 after successfully running 265 log lines'
worth of postgres_fdw tests, because it's run out of file
descriptors. I can see that WaitLatchOrSocket() actually could leak
an epoll/kqueue socket if WaitEventSetWait() raises an error, which is
interesting, but apparently not the explanation here because we don't
see a preceding error report. Another theory would be that this
machine has a low max_safe_fds, and NUM_RESERVED_FDS is only just
enough to handle the various sockets that postgres_fdw.sql creates and
at some point kqueue()'s demand for just one more pushed it over the
edge. From the error text and a look at the man page for errno, this
error is EMFILE (per process limit, which could be as low as 64)
rather then ENFILE (system limit).

Remi, any chance you could run gmake installcheck under
contrib/postgres_fdw on that host, to see if this is repeatable? Can
you tell us about the relevant limits? Maybe ulimit -n (for the user
that runs the build farm), and also sysctl -a | grep descriptors,
sysctl -a | grep maxfiles?

Rémi Zara

remi_zara@mac.com

over 6 years ago

In reply to: Thomas Munro (#3)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Le 20 févr. 2020 à 12:15, Thomas Munro <thomas.munro@gmail.com> a écrit :

On Thu, Feb 20, 2020 at 8:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Feb 05, 2020 at 04:59:10AM +0000, Thomas Munro wrote:

Add kqueue(2) support to the WaitEventSet API.

Use kevent(2) to wait for events on the BSD family of operating
systems and macOS. This is similar to the epoll(2) support added
for Linux by commit 98a64d0bd.

Worth noting this issue with the test suite of postgres_fdw for
buildfarm animal coypu, running on NetBSD:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2020-02-19%2023%3A01%3A01
+ERROR: kqueue failed: Too many open files

Hmm. So coypu just came back after 48 days, and the new kqueue() code
fails for process 19829 after successfully running 265 log lines'
worth of postgres_fdw tests, because it's run out of file
descriptors. I can see that WaitLatchOrSocket() actually could leak
an epoll/kqueue socket if WaitEventSetWait() raises an error, which is
interesting, but apparently not the explanation here because we don't
see a preceding error report. Another theory would be that this
machine has a low max_safe_fds, and NUM_RESERVED_FDS is only just
enough to handle the various sockets that postgres_fdw.sql creates and
at some point kqueue()'s demand for just one more pushed it over the
edge. From the error text and a look at the man page for errno, this
error is EMFILE (per process limit, which could be as low as 64)
rather then ENFILE (system limit).

Remi, any chance you could run gmake installcheck under
contrib/postgres_fdw on that host, to see if this is repeatable? Can
you tell us about the relevant limits? Maybe ulimit -n (for the user
that runs the build farm), and also sysctl -a | grep descriptors,
sysctl -a | grep maxfiles?

Hi,

Unfortunalty, coypu went offline again. I will run tests as soon as I can bring it back up.

Rémi

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Rémi Zara (#4)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

=?utf-8?Q?R=C3=A9mi_Zara?= <remi_zara@mac.com> writes:

Le 20 févr. 2020 à 12:15, Thomas Munro <thomas.munro@gmail.com> a écrit :
Remi, any chance you could run gmake installcheck under
contrib/postgres_fdw on that host, to see if this is repeatable? Can
you tell us about the relevant limits? Maybe ulimit -n (for the user
that runs the build farm), and also sysctl -a | grep descriptors,
sysctl -a | grep maxfiles?

Unfortunalty, coypu went offline again. I will run tests as soon as I can bring it back up.

I have a working NetBSD 8/ppc installation, will try to reproduce there.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#5)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

[ redirecting to -hackers ]

I wrote:

=?utf-8?Q?R=C3=A9mi_Zara?= <remi_zara@mac.com> writes:
Le 20 févr. 2020 à 12:15, Thomas Munro <thomas.munro@gmail.com> a écrit :

Remi, any chance you could run gmake installcheck under
contrib/postgres_fdw on that host, to see if this is repeatable? Can
you tell us about the relevant limits? Maybe ulimit -n (for the user
that runs the build farm), and also sysctl -a | grep descriptors,
sysctl -a | grep maxfiles?

I have a working NetBSD 8/ppc installation, will try to reproduce there.

Yup, it reproduces fine here. I see

$ ulimit -a
...
nofiles (-n descriptors) 128

which squares with the sysctl values:

proc.curproc.rlimit.descriptors.soft = 128
proc.curproc.rlimit.descriptors.hard = 1772
kern.maxfiles = 1772

and also with set_max_safe_fds' results:

2020-02-20 14:29:38.610 EST [2218] DEBUG: max_safe_fds = 115, usable_fds = 125, already_open = 3

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have. The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.

The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

I think we ought to redesign this so that those FDs are handed out by
fd.c, which can ReleaseLruFile() and retry if it gets EMFILE or ENFILE.
fd.c could also be responsible for the resource tracking needed to
prevent leakage.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#6)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

I wrote:

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have. The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.
The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

Hmmm ... actually, there's a third problem, which is the implicit
assumption that we can have as many concurrently-active WaitEventSets
as we like. We can't, if they depend on FDs --- that's a precious
resource. It doesn't look like we actually ever have more than about
two per process right now, but I'm concerned about what will happen
as the usage of the feature increases.

regards, tom lane

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#7)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On Fri, Feb 21, 2020 at 8:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have. The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.
The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

Hmmm ... actually, there's a third problem, which is the implicit
assumption that we can have as many concurrently-active WaitEventSets
as we like. We can't, if they depend on FDs --- that's a precious
resource. It doesn't look like we actually ever have more than about
two per process right now, but I'm concerned about what will happen
as the usage of the feature increases.

One thing I've been planning to do for 13 is to get rid of all the
temporary create/destroy WaitEventSets from the main backend loops.
My goal was cutting down on stupid system calls, but this puts a new
spin on it. I have a patch set to do a bunch of that[1]/messages/by-id/CA+hUKGJAC4Oqao=qforhNey20J8CiG2R=oBPqvfR0vOJrFysGw@mail.gmail.com, but now I'm
thinking that perhaps I need to be even more aggressive about it and
set up the 'common' long lived WES up front at backend startup, rather
than doing it on demand, so that there is no chance of failure due to
lack of fds once you've started up. I also recently figured out how
to handle some more places with the common WES. I'll post a new patch
set over on that thread shortly.

That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
= 128 system, though, it might just mean that it's postgres_fdw's
socket() call that hits EMFILE rather than WES's kqueue() call while
running that test. I suppose there are two kinds of system: those
where ulimit -n is higher than max_files_per_process (defaults, on
Linux: 1024 vs 1000) so you have more allowance for sockets and the
like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds. I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

About the potential for leaks, Horiguchi-san realised this hazard and
posted a patch[2]/messages/by-id/20191206.171211.1119526746053895900.horikyota.ntt@gmail.com to allow WaitEventSets to be cleaned up by the
resource owner machinery. That's useful for the temporary
WaitEventSet objects that we'd genuinely need in the patch set that's
part of: that's for creating a query-lifetime WES to manage N
connections to remote shards, and it needs to be cleaned up on
failure. For the temporary ones created by WaitLatch(), I suspect
they don't really belong in a resource owner: instead we should get
rid of it using my WaitMyLatch() patch set, and if there are any
places where we can't for some reason (I hope not), perhaps a
try/catch block should be used to fix that.

[1]: /messages/by-id/CA+hUKGJAC4Oqao=qforhNey20J8CiG2R=oBPqvfR0vOJrFysGw@mail.gmail.com
[2]: /messages/by-id/20191206.171211.1119526746053895900.horikyota.ntt@gmail.com

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#8)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Thomas Munro <thomas.munro@gmail.com> writes:

One thing I've been planning to do for 13 is to get rid of all the
temporary create/destroy WaitEventSets from the main backend loops.
My goal was cutting down on stupid system calls, but this puts a new
spin on it. I have a patch set to do a bunch of that[1], but now I'm
thinking that perhaps I need to be even more aggressive about it and
set up the 'common' long lived WES up front at backend startup, rather
than doing it on demand, so that there is no chance of failure due to
lack of fds once you've started up.

That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
= 128 system, though, it might just mean that it's postgres_fdw's
socket() call that hits EMFILE rather than WES's kqueue() call while
running that test.

Good point.

I suppose there are two kinds of system: those
where ulimit -n is higher than max_files_per_process (defaults, on
Linux: 1024 vs 1000) so you have more allowance for sockets and the
like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds. I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

No. NUM_RESERVED_FDS was set decades ago, long before any of those tests
existed, and it has never been changed AFAIK. It is a bit striking that
we just started seeing it be insufficient with this patch. Maybe that's
just happenstance, but I wonder whether there is a plain old FD leak
involved in addition to the design issue? I'll take a closer look at
exactly what's open when we hit the error.

The point about possibly hitting EMFILE in libpq's socket() call is
an interesting one. libpq of course can't do anything to recover
from that (and even if it could, there are lower levels such as a
possible DNS lookup that we're not going to be able to modify).
I'm speculating about having postgres_fdw ask fd.c to forcibly
free one LRU file before it attempts to open a new libpq connection.
That would prevent EMFILE (process-level exhaustion) and it would
also provide some small protection against ENFILE (system-wide
exhaustion), though of course there's no guarantee that someone
else doesn't snap up the FD you so graciously relinquished.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#9)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

I wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

... like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds. I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

No. NUM_RESERVED_FDS was set decades ago, long before any of those tests
existed, and it has never been changed AFAIK. It is a bit striking that
we just started seeing it be insufficient with this patch. Maybe that's
just happenstance, but I wonder whether there is a plain old FD leak
involved in addition to the design issue? I'll take a closer look at
exactly what's open when we hit the error.

Hmm ... looks like I'm wrong: we've been skating way too close to the edge
for awhile. Here's a breakdown of the open FDs in the backend at the time
of the failure, excluding stdin/stdout/stderr (which set_max_safe_fds
accounted for) and FDs pointing to database files:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
postmaste 2657 postgres 3r FIFO 0,8 0t0 20902158 pipe postmaster_alive_fds[0]
postmaste 2657 postgres 4u REG 0,9 0 3877 [eventpoll] FeBeWaitSet's epoll_fd
postmaste 2657 postgres 7u unix 0xffff880878e21880 0t0 20902664 socket socket for a PGconn owned by postgres_fdw
postmaste 2657 postgres 8u IPv6 20902171 0t0 UDP localhost:40795->localhost:40795 pgStatSock
postmaste 2657 postgres 9u unix 0xffff880876903c00 0t0 20902605 /tmp/.s.PGSQL.5432 MyProcPort->sock
postmaste 2657 postgres 10r FIFO 0,8 0t0 20902606 pipe selfpipe_readfd
postmaste 2657 postgres 11w FIFO 0,8 0t0 20902606 pipe selfpipe_writefd
postmaste 2657 postgres 105u unix 0xffff880878e21180 0t0 20902647 socket socket for a PGconn owned by postgres_fdw
postmaste 2657 postgres 118u unix 0xffff8804772c88c0 0t0 20902650 socket socket for a PGconn owned by postgres_fdw

One thing to notice is that there are only nine, though NUM_RESERVED_FDS
should have allowed ten. That's because there are 116 open FDs pointing
at database files, though max_safe_fds is 115. I'm not sure whether
that's OK or an off-by-one error in fd.c's accounting. One of the 116
is pointing at a WAL segment, and I think we might not be sending that
through the normal VFD path, so it might be "expected".

But anyway, what this shows is that over time we've eaten enough of
the "reserved" FDs that only three are available for random usage like
postgres_fdw's, if the process's back is against the wall FD-wise.
The postgres_fdw regression test is using all three, meaning there's
exactly no daylight in that test.

Clearly, we gotta do something about that too. Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea. And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections. I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

BTW, you don't need anything very exotic to provoke this error.
The results above are from a Linux box; I just did "ulimit -n 128"
before starting the postmaster.

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#10)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

I wrote:

Clearly, we gotta do something about that too. Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea. And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections. I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

A late-night glimmer of an idea: suppose we make fd.c keep count,
not just of the open FDs under its control, but also of open FDs
not under its control. The latter count would include the initial
FDs (stdin/stdout/stderr), and FDs allocated by OpenTransientFile
et al, and we could provide functions for other callers to report
that they just allocated or released a FD. So postgres_fdw could
report, when it opens or closes a PGconn, that the count of external
FDs should be increased or decreased. fd.c would then forcibly
close VFDs as needed to keep NUM_RESERVED_FDS worth of slop. We
still need slop, to provide some daylight for code that isn't aware
of this mechanism. But we could certainly get all these known
long-lived FDs to be counted, and that would allow fd.c to reduce
the number of open VFDs enough to ensure that some slop remains.

This idea doesn't fix every possible problem. For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#11)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

I wrote:

Clearly, we gotta do something about that too. Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea. And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections. I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

A late-night glimmer of an idea: suppose we make fd.c keep count,
not just of the open FDs under its control, but also of open FDs
not under its control.

Here's a draft patch that does it like that. There are undoubtedly
more places that need to be taught to report their FD consumption;
one obvious candidate that I didn't touch is dblink. But this is
enough to account for all the long-lived "extra" FDs that are currently
open in a default build, and it passes check-world even with ulimit -n
set to the minimum that set_max_safe_fds will allow.

One thing I noticed is that if you open enough postgres_fdw connections
to cause a failure, that tends to come out as an unintelligible-to-
the-layman "epoll_create1 failed: Too many open files" error. That's
because after postgres_fdw has consumed the last available "external
FD" slot, it tries to use CreateWaitEventSet to wait for input from
the remote server, and now that needs to get another external FD.
I left that alone for the moment, because if you do rejigger the
WaitEventSet code to avoid dynamically creating/destroying epoll sockets,
it will stop being a problem. If that doesn't happen, another
possibility is to reclassify CreateWaitEventSet as a caller that should
ignore "failure" from ReserveExternalFD --- but that would open us up
to problems if a lot of WaitEventSets get created, so it's not a great
answer. It'd be okay perhaps if we added a distinction between
long-lived and short-lived WaitEventSets (ignoring "failure" only for
the latter). But I didn't want to go there in this patch.

Thoughts?

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Tom Lane (#12)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

I wrote:

Here's a draft patch that does it like that.

On reflection, trying to make ReserveExternalFD serve two different
use-cases was pretty messy. Here's a version that splits it into two
functions. I also took the trouble to fix dblink.

regards, tom lane

#14

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Tom Lane (#13)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On Mon, Feb 24, 2020 at 12:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

On reflection, trying to make ReserveExternalFD serve two different
use-cases was pretty messy. Here's a version that splits it into two
functions. I also took the trouble to fix dblink.

+    /*
+     * We don't want more than max_safe_fds / 3 FDs to be consumed for
+     * "external" FDs.
+     */
+    if (numExternalFDs < max_safe_fds / 3)

This looks pretty reasonable to me.

I'll have a new patch set to create a common WES at startup over on
that other thread in a day or two.

#15

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Thomas Munro (#14)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On Mon, Feb 24, 2020 at 7:42 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Feb 24, 2020 at 12:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

On reflection, trying to make ReserveExternalFD serve two different
use-cases was pretty messy. Here's a version that splits it into two
functions. I also took the trouble to fix dblink.
+    /*
+     * We don't want more than max_safe_fds / 3 FDs to be consumed for
+     * "external" FDs.
+     */
+    if (numExternalFDs < max_safe_fds / 3)

I suppose there may be users who have set ulimit -n high enough to
support an FDW workload that connects to very many hosts, who will now
need to set max_files_per_process higher to avoid the new error now
that we're doing this accounting. That doesn't seem to be a problem
in itself, but I wonder if the error message should make it clearer
that it's our limit they hit here.

#16

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Thomas Munro (#15)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Thomas Munro <thomas.munro@gmail.com> writes:

I suppose there may be users who have set ulimit -n high enough to
support an FDW workload that connects to very many hosts, who will now
need to set max_files_per_process higher to avoid the new error now
that we're doing this accounting. That doesn't seem to be a problem
in itself, but I wonder if the error message should make it clearer
that it's our limit they hit here.

I struggled with the wording of that message, actually. The patch
proposes

+            ereport(ERROR,
+                    (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+                     errmsg("could not connect to server \"%s\"",
+                            server->servername),
+                     errdetail("There are too many open files.")));

I wanted to say "The server has too many open files." but in context
it would seem to be talking about the remote server, so I'm not sure
how to fix that.

We could also consider a HINT, along the lines of "Raise the server's
max_files_per_process and/or \"ulimit -n\" limits." This again has
the ambiguity about which server, and it also seems dangerously
platform-specific.

regards, tom lane

#17

Mark Dilger

mark.dilger@enterprisedb.com

over 6 years ago

In reply to: Tom Lane (#11)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This idea doesn't fix every possible problem. For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

I understand that you were using plperlu just as an example, but it got me thinking. Could we ship a wrapper using perl's tie() mechanism to call a new spi function to report when a file handle is opened and when it is closed? Most plperlu functions would not participate, since developers will not necessarily know to use the wrapper, but at least they could learn about the wrapper and use it as a work-around if this becomes a problem for them. Perhaps the same spi function could be used by other procedural languages.

I can't see this solution working unless the backend can cleanup properly under exceptional conditions, and decrement the counter of used file handles appropriately. But that's the same requirement that postgres_fdw would also have, right? Would the same mechanism work for both?

I imagine something like <PgPerluSafe>::IO::File and <PgPerluSafe>::File::Temp which could be thin wrappers around IO::File and File::Temp that automatically do the tie()ing for you. (Replace <PgPerluSafe> with whichever name seems best.)

Is this too convoluted to be worth the bother?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Mark Dilger (#17)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Hi,

On 2020-02-24 10:29:51 -0800, Mark Dilger wrote:

On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This idea doesn't fix every possible problem. For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

I understand that you were using plperlu just as an example, but it
got me thinking. Could we ship a wrapper using perl's tie() mechanism
to call a new spi function to report when a file handle is opened and
when it is closed? Most plperlu functions would not participate,
since developers will not necessarily know to use the wrapper, but at
least they could learn about the wrapper and use it as a work-around
if this becomes a problem for them. Perhaps the same spi function
could be used by other procedural languages.

While we're thinking a bit outside of the box: We could just dup() a
bunch of fds within fd.c to protect fd.c's fd "quota". And then close
them when actually needing the fds.

Not really suggesting that we go for that, but it does have some appeal.

I can't see this solution working unless the backend can cleanup properly under exceptional conditions, and decrement the counter of used file handles appropriately. But that's the same requirement that postgres_fdw would also have, right? Would the same mechanism work for both?

We can just throw an error, and all fdw state should get cleaned up. We
can't generally rely on that for plperl, as it IIRC can global state. So
I don't think they're in the same boat.

Greetings,

Andres Freund

#19

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Mark Dilger (#17)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Feb 20, 2020, at 8:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This idea doesn't fix every possible problem. For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

I understand that you were using plperlu just as an example, but it got
me thinking. Could we ship a wrapper using perl's tie() mechanism to
call a new spi function to report when a file handle is opened and when
it is closed? Most plperlu functions would not participate, since
developers will not necessarily know to use the wrapper, but at least
they could learn about the wrapper and use it as a work-around if this
becomes a problem for them. Perhaps the same spi function could be used
by other procedural languages.

Hmm. I had thought briefly about getting plperl to do that automatically
and had concluded that I didn't see a way (though there might be one;
I'm not much of a Perl expert). But if we accept that changes in the
plperl function's source code might be needed, it gets a lot easier,
for sure.

Anyway, the point of the current patch is to provide the mechanism and
use it in a couple of places where we know there's an issue. Improving
the PLs is something that could be added later.

I can't see this solution working unless the backend can cleanup
properly under exceptional conditions, and decrement the counter of used
file handles appropriately. But that's the same requirement that
postgres_fdw would also have, right? Would the same mechanism work for
both?

The hard part is to tie into whatever is responsible for closing the
kernel FD. If you can ensure that the FD gets closed, you can put
the ReleaseExternalFD() call at the same place(s).

Is this too convoluted to be worth the bother?

So far we've not seen field reports of PL functions running out of FDs;
and there's always the ad-hoc solution of making sure the server's
ulimit -n limit is sufficiently larger than max_files_per_process.
So I wouldn't put a lot of effort into it right now. But it's nice
to have an idea about what to do if it does become a hot issue for
somebody.

regards, tom lane

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Tom Lane (#16)

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

On 2020-Feb-24, Tom Lane wrote:

We could also consider a HINT, along the lines of "Raise the server's
max_files_per_process and/or \"ulimit -n\" limits." This again has
the ambiguity about which server, and it also seems dangerously
platform-specific.

Maybe talk about "the local server" to distinguish from the other one.

As for the platform dependencies, I see two main options: make the hint
platform-specific (i.e have #ifdef branches per platform) or make it
even more generic, such as "file descriptor limits".

A quick search suggests that current Windows itself doesn't typically
have such problems:
https://stackoverflow.com/questions/31108693/increasing-no-of-file-handles-in-windows-7-64-bit
https://docs.microsoft.com/es-es/archive/blogs/markrussinovich/pushing-the-limits-of-windows-handles

But the C runtime does:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/file-handling?view=vs-2019
I suppose we do use the C runtime. There's a reference to
_setmaxstdio() being able to raise the limit from the default of 512 to
up to 8192 open files. We don't currently invoke that.
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Alvaro Herrera (#20)

#22

Andres Freund

andres@anarazel.de

over 6 years ago