Connection limits/permissions, slotsync workers, etc

Started by Tom Laneover 1 year ago6 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

In connection with the discussion at [1]/messages/by-id/8befc845430ba1ae3748af900af298788e579c89.camel@cybertec.at, I started to look at
exactly which server processes ought to be subject to connection
limits (datconnlimit, ACL_CONNECT, and related checks). The
current situation seems to be an inconsistent mess.

Looking at InitPostgres, InitializeSessionUserId, and CheckMyDatabase,
we have several different permissions and resource-limiting checks
that may be enforced against an incoming process:
ReservedConnections/SuperuserReservedConnections
rolcanlogin
rolconnlimit
datallowconn
datconnlimit
database's ACL_CONNECT privilege

I want to argue that ReservedConnections, rolconnlimit, and
datconnlimit should only be applied to regular backends. It
makes no particular sense to enforce them against autovac workers,
background workers, or wal senders, because each of those types of
processes has its own resource-limiting PGPROC pool. It's especially
bad to enforce them against parallel workers, since that creates
edge-case failures that make the use of parallelism not transparent
to applications. We hacked around that at 73c9f91a1, but I think
that should be reverted in favor of not applying the check at all.

I further argue that rolcanlogin, datallowconn, and ACL_CONNECT
should not be checked in a parallel worker, again primarily on the
grounds that this creates parallelism-specific failures (cf [1]/messages/by-id/8befc845430ba1ae3748af900af298788e579c89.camel@cybertec.at).
The two scenarios where this occurs are (1) permissions were
revoked since the leader process connected, or (2) leader is
currently running as a role that wouldn't have permission to
connect on its own. We don't attempt to kick out the leader
process when either of those things happen, so why should we
prevent it from using parallelism?

The current situation for each of these checks is:

ReservedConnections is enforced if !am_superuser && !am_walsender,
so it is enforced against non-superuser background workers,
which is silly because BG workers have their own PGPROC pool;
moreover, what's the argument for letting walsenders but not other
kinds of background processes escape this? I propose changing it to
apply only to regular backends.

rolcanlogin is enforced if IsUnderPostmaster and we reach
InitializeSessionUserId, which basically reduces to regular backends,
parallel workers, logrep workers, and walsenders. Seems reasonable
for logrep workers and walsenders which represent fresh logins, but
not for parallel workers. I propose fixing this by making
ParallelWorkerMain pass BGWORKER_BYPASS_ROLELOGINCHECK.

rolconnlimit is enforced if IsUnderPostmaster and we reach
InitializeSessionUserId and it's not a superuser. So that applies
to non-superuser parallel workers, logrep workers, and walsenders,
and I don't think it's reasonable to apply it to any of them since
those all come out of other PGPROC pools. I propose switching that
to apply only to regular backends.

BTW, I kind of wonder why rolconnlimit is ineffectual for superusers,
especially when rolcanlogin does apply to them. Not a bug exactly,
but it sure seems inconsistent. If you've taken the trouble to set it
you'd expect it to work. Shall we take out the !is_superuser check?

datallowconn is enforced against all non-standalone, non-AV-worker
processes that connect to a specific database, except bgworkers that
pass BGWORKER_BYPASS_ALLOWCONN (nothing in core except test module).
So again that includes parallel workers, logrep workers, and
walsenders. Again this seems reasonable for logrep workers and
walsenders but not for parallel workers. I propose fixing this by
making ParallelWorkerMain pass BGWORKER_BYPASS_ALLOWCONN.

datconnlimit is enforced against all non-superuser processes,
including per-DB walsenders and BG workers (see above).
This is fairly dubious given that they have their own PGPROC pools.
I propose switching that to apply only to regular backends, too.

ACL_CONNECT is enforced against all non-superuser processes,
including per-DB walsenders and BG workers (includes
parallel workers, subscription apply workers, logrep workers).
Perhaps that's OK for most, but I argue not for parallel workers;
maybe skip it if BGWORKER_BYPASS_ALLOWCONN?

Also, the enforcement of datconnlimit and rolconnlimit is inconsistent
in another way: our counting of the pre-existing processes is pretty
random. CountDBConnections is not consistent with either the current
set of processes that datconnlimit is enforced against, or my proposal
to enforce it only against regular backends. It counts anything that
is not AmBackgroundWorkerProcess, including AV workers and per-DB
walsenders. I think it should count only regular backends, because
anything else leads to weird inconsistencies in whether a rejection
occurs.

The same applies to CountUserBackends (used for rolconnlimit check).
I argue these two functions should count only regular backends,
and the enforcement should likewise be only against regular backends.

Another recently-created problem is that the "slotsync worker"
process type we added in v17 hasn't been considered in any of this.
In particular, unlike every other process type that can obtain
a PGPROC, we did not consider it in the MaxBackends calculation:

/* the extra unit accounts for the autovacuum launcher */
MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
max_worker_processes + max_wal_senders;

This means AFAICS that it effectively competes for one of the
MaxConnections PGPROC slots, meaning that it could fail for lack of a
slot or could lock out a client that should have been able to connect.
Shouldn't we have added another dedicated PGPROC slot for it?
(proc.c would require some additional work to make that happen.)
I wonder if the AV launcher and slotsync worker could be reclassified
as "auxiliary processes" instead of being their own weird animal.

regards, tom lane

[1]: /messages/by-id/8befc845430ba1ae3748af900af298788e579c89.camel@cybertec.at

#2Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Tom Lane (#1)
RE: Connection limits/permissions, slotsync workers, etc

On Thursday, December 26, 2024 3:50 AM Tom Lane <tgl@sss.pgh.pa.us>

Hi,

In connection with the discussion at [1], I started to look at exactly which server
processes ought to be subject to connection limits (datconnlimit,
ACL_CONNECT, and related checks). The current situation seems to be an
inconsistent mess.

Looking at InitPostgres, InitializeSessionUserId, and CheckMyDatabase, we
have several different permissions and resource-limiting checks that may be
enforced against an incoming process:
ReservedConnections/SuperuserReservedConnections
rolcanlogin
rolconnlimit
datallowconn
datconnlimit
database's ACL_CONNECT privilege

I want to argue that ReservedConnections, rolconnlimit, and datconnlimit
should only be applied to regular backends. It makes no particular sense to
enforce them against autovac workers, background workers, or wal senders,
because each of those types of processes has its own resource-limiting
PGPROC pool. It's especially bad to enforce them against parallel workers,
since that creates edge-case failures that make the use of parallelism not
transparent to applications. We hacked around that at 73c9f91a1, but I think
that should be reverted in favor of not applying the check at all.

I further argue that rolcanlogin, datallowconn, and ACL_CONNECT should not
be checked in a parallel worker, again primarily on the grounds that this creates
parallelism-specific failures (cf [1]).
The two scenarios where this occurs are (1) permissions were revoked since
the leader process connected, or (2) leader is currently running as a role that
wouldn't have permission to connect on its own. We don't attempt to kick out
the leader process when either of those things happen, so why should we
prevent it from using parallelism?

The current situation for each of these checks is:

ReservedConnections is enforced if !am_superuser && !am_walsender, so it is
enforced against non-superuser background workers, which is silly because
BG workers have their own PGPROC pool; moreover, what's the argument for
letting walsenders but not other kinds of background processes escape this?
I propose changing it to apply only to regular backends.

rolcanlogin is enforced if IsUnderPostmaster and we reach
InitializeSessionUserId, which basically reduces to regular backends, parallel
workers, logrep workers, and walsenders. Seems reasonable for logrep
workers and walsenders which represent fresh logins, but not for parallel
workers. I propose fixing this by making ParallelWorkerMain pass
BGWORKER_BYPASS_ROLELOGINCHECK.

rolconnlimit is enforced if IsUnderPostmaster and we reach
InitializeSessionUserId and it's not a superuser. So that applies to
non-superuser parallel workers, logrep workers, and walsenders, and I don't
think it's reasonable to apply it to any of them since those all come out of other
PGPROC pools. I propose switching that to apply only to regular backends.

BTW, I kind of wonder why rolconnlimit is ineffectual for superusers, especially
when rolcanlogin does apply to them. Not a bug exactly, but it sure seems
inconsistent. If you've taken the trouble to set it you'd expect it to work. Shall
we take out the !is_superuser check?

datallowconn is enforced against all non-standalone, non-AV-worker
processes that connect to a specific database, except bgworkers that pass
BGWORKER_BYPASS_ALLOWCONN (nothing in core except test module).
So again that includes parallel workers, logrep workers, and walsenders.
Again this seems reasonable for logrep workers and walsenders but not for
parallel workers. I propose fixing this by making ParallelWorkerMain pass
BGWORKER_BYPASS_ALLOWCONN.

datconnlimit is enforced against all non-superuser processes, including
per-DB walsenders and BG workers (see above).
This is fairly dubious given that they have their own PGPROC pools.
I propose switching that to apply only to regular backends, too.

ACL_CONNECT is enforced against all non-superuser processes, including
per-DB walsenders and BG workers (includes parallel workers, subscription
apply workers, logrep workers).
Perhaps that's OK for most, but I argue not for parallel workers; maybe skip it if
BGWORKER_BYPASS_ALLOWCONN?

Also, the enforcement of datconnlimit and rolconnlimit is inconsistent in
another way: our counting of the pre-existing processes is pretty random.
CountDBConnections is not consistent with either the current set of processes
that datconnlimit is enforced against, or my proposal to enforce it only against
regular backends. It counts anything that is not
AmBackgroundWorkerProcess, including AV workers and per-DB walsenders.
I think it should count only regular backends, because anything else leads to
weird inconsistencies in whether a rejection occurs.

The same applies to CountUserBackends (used for rolconnlimit check).
I argue these two functions should count only regular backends, and the
enforcement should likewise be only against regular backends.

Personally, I find these proposals to be reasonable.

Another recently-created problem is that the "slotsync worker"
process type we added in v17 hasn't been considered in any of this.
In particular, unlike every other process type that can obtain a PGPROC, we did
not consider it in the MaxBackends calculation:

/* the extra unit accounts for the autovacuum launcher */
MaxBackends = MaxConnections + autovacuum_max_workers + 1 +
max_worker_processes + max_wal_senders;

This means AFAICS that it effectively competes for one of the MaxConnections
PGPROC slots, meaning that it could fail for lack of a slot or could lock out a
client that should have been able to connect.
Shouldn't we have added another dedicated PGPROC slot for it?
(proc.c would require some additional work to make that happen.)

Thanks for reporting the issue! I can reproduce the issue E.g., slotsync worker
cannot start when user backends have reached the max_connections limit. And I
agree that it would be better to add another dedicated PGPROC for it. I have
prepared a patch attached to address this.

I wonder if the AV launcher and slotsync worker could be reclassified as "auxiliary
processes" instead of being their own weird animal.

It appears that the current aux processes do not run transactions as stated in the
comments[1], so we may need to somehow release this restriction to achieve the
goal.

[1]
/messages/by-id/8befc845430ba1ae3748af900af2987
88e579c89.camel%40cybertec.at

Best Regards,
Hou zj

Attachments:

v1-0001-Reserve-a-dedicated-PGPROC-slot-for-slotsync-work.patchapplication/octet-stream; name=v1-0001-Reserve-a-dedicated-PGPROC-slot-for-slotsync-work.patchDownload+27-8
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zhijie Hou (Fujitsu) (#2)
Re: Connection limits/permissions, slotsync workers, etc

"Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> writes:

On Thursday, December 26, 2024 3:50 AM Tom Lane <tgl@sss.pgh.pa.us>

I wonder if the AV launcher and slotsync worker could be reclassified as "auxiliary
processes" instead of being their own weird animal.

It appears that the current aux processes do not run transactions as stated in the
comments[1], so we may need to somehow release this restriction to achieve the
goal.

Ah, right, I'd forgotten about that restriction. I agree that
removing it wouldn't be very reasonable. However, I still would
rather avoid making the slotsync worker be its very own special
snowflake, because that offers no support for the next person
who wants to invent a new sort of specialized transaction-capable
process.

Attached is an alternative proposal that groups the autovac launcher
and slotsync worker into a new category of "special workers" (better
name welcome). I chose to put them into the existing autovacFreeProcs
freelist, partly because the autovac launcher lives there already
but mostly because I don't want to add another freelist in a patch
we need to put into v17. (As written, your patch is an ABI break.
It'd probably be safe to add a new freelist at the end of the struct
in v17, but I'm a little shy about that in view of recent bugs. In
any case, a freelist having at most two members seems rather silly.)

I was amused but not terribly surprised to notice that the comments
in InitProcGlobal were *already* out of date, in that they didn't
account for the walsender PGPROC pool. We have a remarkably bad
track record for updating comments that are more than about two
lines away from the code they describe :-(

Thoughts?

regards, tom lane

Attachments:

v2-reserve-a-PGPROC-for-slotsync-worker.patchtext/x-diff; charset=us-ascii; name=v2-reserve-a-PGPROC-for-slotsync-worker.patchDownload+44-24
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: Connection limits/permissions, slotsync workers, etc

Also, here's a patch for the rest of what I was talking about.

We'll need to back-patch this given that the CVE-2024-10978 changes
caused these sorts of problems in all branches, but I've not yet
attempted to back-patch. It looks like it might be a bit painful
thanks to past code churn in these areas.

I didn't do anything about the idea of making rolconnlimit applicable
to superusers. If we do that at all, it should only be in HEAD.
Also, I got a shade less enthusiastic about it after noting that this
logic is parallel to that for datconnlimit, and it does seems sensible
to allow superusers to ignore datconnlimit. Maybe it's fine for the
two limits to operate differently, but I'm unsure.

Also, it probably would make sense to rename PGPROC.isBackgroundWorker
to isRegularBackend (inverting the sense of the boolean), but that
doesn't seem like back-patch material either, so I didn't include it
here. I think we can get away with a subtle adjustment of which
processes that flag is set for in the back branches, but not with
renaming it.

regards, tom lane

Attachments:

v1-0001-Exclude-parallel-workers-from-connection-privileg.patchtext/x-diff; charset=us-ascii; name*0=v1-0001-Exclude-parallel-workers-from-connection-privileg.p; name*1=atchDownload+37-49
#5Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Tom Lane (#3)
RE: Connection limits/permissions, slotsync workers, etc

On Saturday, December 28, 2024 1:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> writes:

On Thursday, December 26, 2024 3:50 AM Tom Lane <tgl@sss.pgh.pa.us>

I wonder if the AV launcher and slotsync worker could be reclassified
as "auxiliary processes" instead of being their own weird animal.

It appears that the current aux processes do not run transactions as
stated in the comments[1], so we may need to somehow release this
restriction to achieve the goal.

Ah, right, I'd forgotten about that restriction. I agree that removing it wouldn't
be very reasonable. However, I still would rather avoid making the slotsync
worker be its very own special snowflake, because that offers no support for
the next person who wants to invent a new sort of specialized
transaction-capable process.

Attached is an alternative proposal that groups the autovac launcher and
slotsync worker into a new category of "special workers" (better name
welcome). I chose to put them into the existing autovacFreeProcs freelist,
partly because the autovac launcher lives there already but mostly because I
don't want to add another freelist in a patch we need to put into v17. (As
written, your patch is an ABI break.

Right, thanks for pointing it out. The new version patch looks good to me.

Best Regards,
Hou zj

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zhijie Hou (Fujitsu) (#5)
Re: Connection limits/permissions, slotsync workers, etc

"Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> writes:

On Saturday, December 28, 2024 1:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Attached is an alternative proposal that groups the autovac launcher and
slotsync worker into a new category of "special workers" (better name
welcome). I chose to put them into the existing autovacFreeProcs freelist,
partly because the autovac launcher lives there already but mostly because I
don't want to add another freelist in a patch we need to put into v17. (As
written, your patch is an ABI break.

Right, thanks for pointing it out. The new version patch looks good to me.

Pushed, thanks for looking at it.

regards, tom lane