Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

Started by alex workalmost 2 years ago35 messages

alexwork033@gmail.com

almost 2 years ago

Hello,

we run multiple versions of PostgreSQL instances on production. Some time ago
we add new physical servers and decided to go with latest GA from pgdg APT
repository, that is PostgreSQL 16.

We encounter slow `GRANT ROLES` only on PostgreSQL 16 instances up to 42 seconds
in production, the client process at PostgresSQL would use 100% of the CPU.
Which is a surprise compared to other instances running older PostgreSQL
releases. On production we have a *LOT* of ROLEs, which unfortunately a case
that we did not test before switching the new servers into production mode.

The Application & ROLEs
-----------------------
Our application make use of ROLEs. We create group ROLEs for each tenant of our
application, these ROLEs are named with `d_` and `a_` prefix.

A special ROLE, called `acc`, it will be a member to each of these `d_` and
`a_` ROLEs.

The application have a concept of "session", which it would mantain and I think
outside the scope of this e-mail. In relation to PostgreSQL, the application
would create a PostgreSQL ROLE that corresponds to its own (application)
session. It would name these ROLEs with `s_` prefix, which CREATEd and
GRANTed its permission on every application's "session".

When an application "session" started, user with `acc` ROLE would grant
membersip of `d_` ROLE to `s_` ROLE (ie. GRANT ROLE `d_xxxx` TO `s_xxxx`;)

To make this clear, for example, we have (say) role `d_202402` already existing
and application would create a new ROLE `s_0000001` which corresponds to
application's "session". Application that connects with special ROLE `acc`
would GRANT ROLE `d_202402` to the ROLE `s_0000001`, like so:

GRANT d_202402 TO s_0000001;

In production we have up to 13 thousands of these ROLEs, each:

$ sudo -u postgres psql -p 5531
psql (16.2 (Debian 16.2-1.pgdg120+2))
Type "help" for help.

postgres=# select count(*) s_roles_count from pg_catalog.pg_authid
where rolname like 's_%';
s_roles_count
---------------
13299
(1 row)

postgres=# select count(*) a_roles_count from pg_catalog.pg_authid
where rolname like 'a_%';
a_roles_count
---------------
12776
(1 row)

postgres=# select count(*) d_roles_count from pg_catalog.pg_authid
where rolname like 'd_%';
d_roles_count
---------------
13984
(1 row)

The Setup
---------

Investigating this slow `GRANT ROLE` we start a VM running Debian 11,
and create a lot of roles.

create special `acc` role and write to some file:
$ echo -e "CREATE ROLE acc WITH LOGIN NOSUPERUSER INHERIT CREATEDB
CREATEROLE NOREPLICATION;\n\n" > create_acc.sql

create a lot of `a_` roles and make sure `acc` is member of each one of them:
$ for idx1 in $(seq -w 1 100); do for idx2 in $(seq -w 1 12); do for
idx3 in $(seq -w 1 10); do echo "CREATE ROLE a_${idx1}${idx2}${idx3}
WITH NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN;"; echo "GRANT
a_${idx1}${idx2}${idx3} TO acc WITH ADMIN OPTION;"; done; done; done >
create_a.sql

create a lot of `d_` roles and make sure `acc` is member of each one of them:
$ for idx1 in $(seq -w 1 100); do for idx2 in $(seq -w 1 12); do for
idx3 in $(seq -w 1 10); do echo "CREATE ROLE d_${idx1}${idx2}${idx3}
WITH NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN;"; echo "GRANT
d_${idx1}${idx2}${idx3} TO acc WITH ADMIN OPTION;"; done; done; done >
create_d.sql

create a lot of `s_` roles:
$ for idx1 in $(seq -w 1 100); do for idx2 in $(seq -w 1 12); do for
idx3 in $(seq -w 1 10); do echo "CREATE ROLE s_${idx1}${idx2}${idx3}
WITH NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN;"; done; done;
done > create_s.sql

merge ROLE creation into one file:
$ cat create_acc.sql create_a.sql create_d.sql create_s.sql >
/tmp/create-roles.sql

PostgreSQL 16
-------------

Install PostgreSQL 16:
--
$ sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt
$(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
$ sudo apt install gnupg2
$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc
| sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get -y install postgresql-16 postgresql-client-16

Create PostgreSQL 16 instance:
--
$ sudo pg_dropcluster --stop 16 main # drop default Debian cluster
$ sudo pg_createcluster 16 pg16
$ echo "local all acc trust" | sudo tee
/etc/postgresql/16/pg16/pg_hba.conf
$ echo "local all postgres peer" | sudo tee -a
/etc/postgresql/16/pg16/pg_hba.conf
$ sudo systemctl start postgresql@16-pg16.service

Import lots of roles:
--
$ sudo -u postgres /usr/lib/postgresql/16/bin/psql -f
/tmp/create-roles.sql -p 5432 -d postgres

Using ROLE `acc`, grant `d_` ROLE to a session ROLE:
--
$ time sudo -u postgres /usr/lib/postgresql/16/bin/psql -U acc
postgres -c 'GRANT d_0010109 TO s_0010109;'
GRANT ROLE

real 0m7.579s
user 0m0.054s
sys 0m0.020s

This is the surprising behavior for PostgreSQL 16. It seems there's a new logic
in PostgreSQL that checks against each role, and it took 100% of CPU.

At this point we know `acc` is just another ROLE that happens to have ADMIN
privilege that is a member of `d_0010109` group ROLE.

But what happens when `acc` is a SUPERUSER?

Alter role `acc` as SUPERUSER:
--
$ sudo -u postgres /usr/lib/postgresql/16/bin/psql -c 'ALTER ROLE acc
WITH SUPERUSER'
ALTER ROLE

This is a workaround to make GRANT ROLE bearable.

Using ROLE `acc`, grant `d_` ROLE to a session ROLE, again:
--
$ time sudo -u postgres /usr/lib/postgresql/16/bin/psql -U acc
postgres -c 'GRANT d_0010108 TO s_0010108;'
GRANT ROLE

real 0m0.079s
user 0m0.054s
sys 0m0.019s

OK this is fast.

But what hapens when `acc` is back being not a SUPERUSER?

Alter role `acc` to stop being SUPERUSER:
--
$ sudo -u postgres psql -c 'ALTER ROLE acc WITH NOSUPERUSER'
ALTER ROLE

Using ROLE `acc`, grant `d_` ROLE to a session ROLE with `acc` not a SUPERUSER:
--
$ time sudo -u postgres /usr/lib/postgresql/16/bin/psql -U acc
postgres -c 'GRANT d_0010107 TO s_0010107;'
GRANT ROLE

real 0m7.741s
user 0m0.055s
sys 0m0.021s

As expected, slow `GRANT ROLE` again.

At this point, we try with PostgreSQL 15 just to make sure that this
is new to PostgreSQL 16.

$ sudo systemctl stop postgresql@16-pg16

PostgreSQL 15
-------------

Install PostgreSQL 15:
--
$ sudo apt-get update
$ sudo apt-get -y install postgresql-15 postgresql-client-15

$ sudo pg_dropcluster --stop 15 main # drop default Debian cluster
$ sudo pg_createcluster 15 pg15
$ echo "local all acc trust" | sudo tee
/etc/postgresql/15/pg15/pg_hba.conf
$ echo "local all postgres peer" | sudo tee -a
/etc/postgresql/15/pg15/pg_hba.conf
$ sudo systemctl start postgresql@15-pg15.service

Import lots of roles:
--
$ sudo -u postgres /usr/lib/postgresql/15/bin/psql -f
/tmp/create-roles.sql -p 5433 -d postgres

Using ROLE `acc`, grant `d_` ROLE to a session ROLE:
--
$ time sudo -u postgres /usr/lib/postgresql/15/bin/psql -U acc -p 5433
postgres -c 'GRANT d_0010109 TO s_0010109;'
GRANT ROLE

real 0m0.077s
user 0m0.054s
sys 0m0.017s

Seems OK with the same amount of ROLEs. The `acc` ROLE is not a SUPERUSER here.

Alter role `acc` as SUPERUSER:
--
$ sudo -u postgres /usr/lib/postgresql/15/bin/psql -p 5433 -c 'ALTER
ROLE acc WITH SUPERUSER'
ALTER ROLE

Using ROLE `acc`, grant `d_` ROLE to a session ROLE, again:
--
$ time sudo -u postgres /usr/lib/postgresql/15/bin/psql -p 5433 -U acc
postgres -c 'GRANT d_0010108 TO s_0010108;'
GRANT ROLE

real 0m0.084s
user 0m0.057s
sys 0m0.021s

Doesn't matter, GRANT ROLE works still as fast.

Alter role `acc` to stop being a SUPERUSER:
--
$ sudo -u postgres /usr/lib/postgresql/15/bin/psql -p 5433 -c 'ALTER
ROLE acc WITH NOSUPERUSER'
ALTER ROLE

Using ROLE `acc`, grant `d_` ROLE to a session ROLE with `acc` not a SUPERUSER:
--
$ time sudo -u postgres /usr/lib/postgresql/15/bin/psql -p 5433 -U acc
postgres -c 'GRANT d_0010107 TO s_0010107;'
GRANT ROLE

real 0m0.077s
user 0m0.054s
sys 0m0.017s

Again, doesn't matter, GRANT ROLE works still as fast.

Looking At The Source Code
--------------------------

Looking at git diff of `REL_15_6` against `REL_16_0`, it seems the
`roles_is_member_of` function called by the new in PostgreSQL 16
`check_role_membership_authorization`
is expensive for our use case.

REL_16_0: src/backend/commands/user.c:1562
---8<------
(errcode(ERRCODE_INVALID_GRANT_OPERATION),
errmsg("column names cannot be included in GRANT/REVOKE ROLE")));

roleid = get_role_oid(rolename, false);
check_role_membership_authorization(currentUserId,
roleid, stmt->is_grant);
if (stmt->is_grant)
--->8------

While I can see the value in improvements on how ROLEs are being handled
PostgreSQL 16 onward, I'm curious what would help for setups that has thousands
of ROLEs like us outside of patching the source code?

Dominique Devienne

ddevienne@gmail.com

almost 2 years ago

In reply to: alex work (#1)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

On Thu, Mar 21, 2024 at 8:10 AM alex work <alexwork033@gmail.com> wrote:

We encounter slow `GRANT ROLES` only on PostgreSQL 16 instances up to 42
seconds
in production, the client process at PostgresSQL would use 100% of the
CPU. [...]

Using ROLE `acc`, grant `d_` ROLE to a session ROLE:

real 0m7.579s [...]

PostgreSQL 15

Using ROLE `acc`, grant `d_` ROLE to a session ROLE:
real 0m0.077s

Ouch, that's a ~ 100x regression. Thanks for the write-up, that's worrying.
We don't have as many ROLEs, but we do have plenty, so this is worrying.

On top of the v16 ROLE changes breaking on ROLE logic, which was fine prior
(v12-v15).
We've paused for now our planned v16 upgrade, until we have more time to
adapt.

Like you, I welcome the changes. But it turns out more expensive to adapt
to them.
And your report certainly makes me wonder whether we should hold off until
that perf regression is addressed.

Thanks, --DD

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: alex work (#1)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

[ redirecting to -hackers ]

alex work <alexwork033@gmail.com> writes:

We encounter slow `GRANT ROLES` only on PostgreSQL 16 instances up to 42 seconds
in production, the client process at PostgresSQL would use 100% of the CPU.
Which is a surprise compared to other instances running older PostgreSQL
releases. On production we have a *LOT* of ROLEs, which unfortunately a case
that we did not test before switching the new servers into production mode.

I poked into this a bit. It seems the problem is that as of v16, we
try to search for the "best" role membership path from the current
user to the target role, and that's done in a very brute-force way,
as a side effect of computing the set of *all* role memberships the
current role has. In the given case, we could have skipped all that
if we simply tested whether the current role is directly a member
of the target: it is, so there can't be any shorter path. But in
any case roles_is_member_of has horrid performance when the current
role is a member of a lot of roles.

It looks like part of the blame might be ascribable to catcache.c,
as if you look at the problem microscopically you find that
roles_is_member_of is causing catcache to make a ton of AUTHMEMMEMROLE
catcache lists, and SearchSysCacheList is just iterating linearly
through the cache's list-of-lists, so that search is where the O(N^2)
time is actually getting taken. Up to now that code has assumed that
any one catcache would not have very many catcache lists. Maybe it's
time to make that smarter; but since we've gotten away with this
implementation for decades, I can't help feeling that the real issue
is with roles_is_member_of's usage pattern.

For self-containedness, attached is a directly usable shell script
to reproduce the problem. The complaint is that the last GRANT
takes multiple seconds (about 5s on my machine), rather than
milliseconds.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Tom Lane (#3)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

I wrote:

I poked into this a bit. It seems the problem is that as of v16, we
try to search for the "best" role membership path from the current
user to the target role, and that's done in a very brute-force way,
as a side effect of computing the set of *all* role memberships the
current role has.

Actually, roles_is_member_of sucks before v16 too; the new thing
is only that it's being invoked during GRANT ROLE. Using the
roles created by the given test case, I see in v15:

$ psql
psql (15.6)
Type "help" for help.

regression=# drop table at;
DROP TABLE
regression=# set role a_0010308;
SET
regression=> create table at(f1 int);
CREATE TABLE
regression=> \timing
Timing is on.
regression=> set role acc;
SET
Time: 0.493 ms
regression=> insert into at values(1);
INSERT 0 1
Time: 3565.029 ms (00:03.565)
regression=> insert into at values(1);
INSERT 0 1
Time: 2.308 ms

So it takes ~3.5s to populate the roles_is_member_of cache for "acc"
given this membership set. This is actually considerably worse than
in v16 or HEAD, where the same test takes about 1.6s for me.

Apparently the OP has designed their use-case so that they dodge these
implementation problems in v15 and earlier, but that's a far cry from
saying that there were no problems with lots-o-roles before v16.

regards, tom lane

Wolfgang Walther

walther@technowledgy.de

almost 2 years ago

In reply to: Tom Lane (#4)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

Tom Lane:

Actually, roles_is_member_of sucks before v16 too; the new thing
is only that it's being invoked during GRANT ROLE. Using the
roles created by the given test case, I see in v15:

[...]
So it takes ~3.5s to populate the roles_is_member_of cache for "acc"
given this membership set. This is actually considerably worse than
in v16 or HEAD, where the same test takes about 1.6s for me.

Ah, this reminds me that I hit the same problem about a year ago, but
haven't had the time to put together a test-case, yet. In my case, it's
like this:
- I have one role "authenticator" with which the application (PostgREST)
connects to the database.
- This role has been granted all of the actual user roles and will then
do a SET ROLE for each authenticated request it handles.
- In my case that's currently about 120k roles granted to
"authenticator", back then it was probably around 60k.
- The first request (SET ROLE) for each session took between 5 and 10
*minutes* to succeed - subsequent requests were instant.
- When the "authenticator" role is made SUPERUSER, the first request is
instant, too.

I guess this matches exactly what you are observing.

There is one more thing that is actually even worse, though: When you
try to cancel the query or terminate the backend while the SET ROLE is
still running, this will not work. It will not only not cancel the
query, but somehow leave the process for that backend in some kind of
zombie state that is impossible to recover from.

All of this was v15.

Best,

Wolfgang

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Tom Lane (#3)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

I wrote:

It looks like part of the blame might be ascribable to catcache.c,
as if you look at the problem microscopically you find that
roles_is_member_of is causing catcache to make a ton of AUTHMEMMEMROLE
catcache lists, and SearchSysCacheList is just iterating linearly
through the cache's list-of-lists, so that search is where the O(N^2)
time is actually getting taken. Up to now that code has assumed that
any one catcache would not have very many catcache lists. Maybe it's
time to make that smarter; but since we've gotten away with this
implementation for decades, I can't help feeling that the real issue
is with roles_is_member_of's usage pattern.

I wrote a quick finger exercise to make catcache.c use a hash table
instead of a single list for CatCLists, modeling it closely on the
existing hash logic for simple catcache entries. This helps a good
deal, but I still see the problematic GRANT taking ~250ms, compared
to 5ms in v15. roles_is_member_of is clearly on the hook for that.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Tom Lane (#6)

Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

I wrote:

... I still see the problematic GRANT taking ~250ms, compared
to 5ms in v15. roles_is_member_of is clearly on the hook for that.

Ah: looks like that is mainly the fault of the list_append_unique_oid
calls in roles_is_member_of. That's also an O(N^2) cost of course,
though with a much smaller constant factor.

I don't think we have any really cheap way to de-duplicate the role
OIDs, especially seeing that it has to be done on-the-fly within the
collection loop, and the order of roles_list is at least potentially
interesting. Not sure how to make further progress without a lot of
work.

regards, tom lane

Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs

Attachments:

Attachments:

Attachments:

Attachments: