Can we get rid of repeated queries from pg_dump?
Hi,
I have following case: local pg_dump (v15) connecting to remote
PostgreSQL (v12).
I'm trying to get just schema (pg_dump -s). It's taking very long, which
is kinda OK given that there is long distance and latency, but I got
curious and checked queries that the pg_dump was running (select * from
pg_stat_activity where application_name = 'pg_dump').
And I noticed that many of these queries repeat many times.
The ones that I noticed were:
SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL)
around the time that
SELECT
proretset,
prosrc,
probin,
provolatile,
proisstrict,
prosecdef,
lanname,
proconfig,
procost,
prorows,
pg_catalog.pg_get_function_arguments(p.oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs,
pg_catalog.pg_get_function_result(p.oid) AS funcresult,
proleakproof,
array_to_string(protrftypes, ' ') AS protrftypes,
proparallel,
prokind,
prosupport,
NULL AS prosqlbody
FROM
pg_catalog.pg_proc p,
pg_catalog.pg_language l
WHERE
p.oid = '60188'::pg_catalog.oid
AND l.oid = p.prolang
was called too.
It seems that for every function, pg_dump is getting it's data, and then
runs format_type on each parameter/output type? I'm mostly guessing
here, as I didn't read the code.
Wouldn't it be possible to get all type formats at once, and cache them
in pg_dump? Or at the very least reuse already received information?
Unfortunately it seems I can't run pg_dump closer to the db server, and
the latency of queries is killing me.
It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump,
with env variables configuring db connection) hasn't written even single
byte to schema.dump)
depesz
hubert depesz lubaczewski <depesz@depesz.com> writes:
It seems that for every function, pg_dump is getting it's data, and then
runs format_type on each parameter/output type? I'm mostly guessing
here, as I didn't read the code.
Wouldn't it be possible to get all type formats at once, and cache them
in pg_dump? Or at the very least reuse already received information?
Send a patch ...
regards, tom lane
On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote:
hubert depesz lubaczewski <depesz@depesz.com> writes:
It seems that for every function, pg_dump is getting it's data, and then
runs format_type on each parameter/output type? I'm mostly guessing
here, as I didn't read the code.
Wouldn't it be possible to get all type formats at once, and cache them
in pg_dump? Or at the very least reuse already received information?Send a patch ...
Yeah, that's not going to work, my C skills are next-to-none :(
I guess I'll have to wait till someone else will assume it's a problem,
someone with skills to do something about it.
Best regards,
depesz
hubert depesz lubaczewski <depesz@depesz.com> writes:
On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote:
hubert depesz lubaczewski <depesz@depesz.com> writes:
Wouldn't it be possible to get all type formats at once, and cache them
in pg_dump? Or at the very least reuse already received information?
Send a patch ...
Yeah, that's not going to work, my C skills are next-to-none :(
I guess I'll have to wait till someone else will assume it's a problem,
someone with skills to do something about it.
Well, you could move it forward by doing the legwork to identify which
queries are worth merging. Is it really sane to do a global "select
format_type() from pg_type" query and save all the results on the client
side? I wonder whether there are cases where that'd be a net loss.
You could do the experimentation to figure that out without necessarily
having the C skills to make pg_dump actually do it.
regards, tom lane
On Thu, Aug 26, 2021 at 10:20:29AM -0400, Tom Lane wrote:
hubert depesz lubaczewski <depesz@depesz.com> writes:
On Thu, Aug 26, 2021 at 10:02:07AM -0400, Tom Lane wrote:
hubert depesz lubaczewski <depesz@depesz.com> writes:
Wouldn't it be possible to get all type formats at once, and cache them
in pg_dump? Or at the very least reuse already received information?Send a patch ...
Yeah, that's not going to work, my C skills are next-to-none :(
I guess I'll have to wait till someone else will assume it's a problem,
someone with skills to do something about it.Well, you could move it forward by doing the legwork to identify which
queries are worth merging. Is it really sane to do a global "select
Sure. On it. Will report back when I'll have more info.
Best regards,
depesz
On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote:
Hi,
I have following case: local pg_dump (v15) connecting to remote
PostgreSQL (v12).
So you are using a dev version of pg_dump or is that a typo?
It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump,
with env variables configuring db connection) hasn't written even single
byte to schema.dump)
What happens if you run without the -v?
depesz
--
Adrian Klaver
adrian.klaver@aklaver.com
On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote:
On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote:
Hi,
I have following case: local pg_dump (v15) connecting to remote
PostgreSQL (v12).So you are using a dev version of pg_dump or is that a typo?
Yes. I'm running pg_dump from my computer to (very) remote db server.
It's been 15 minutes, and pg_dump (called: pg_dump -v -s -f schema.dump,
with env variables configuring db connection) hasn't written even single
byte to schema.dump)What happens if you run without the -v?
Well, I guess it works, but with no output I can't judge how fast.
Definitely doesn't seem to be going any faster.
Best regards,
depesz
On 8/26/21 7:35 AM, hubert depesz lubaczewski wrote:
On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote:
On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote:
Hi,
I have following case: local pg_dump (v15) connecting to remote
PostgreSQL (v12).So you are using a dev version of pg_dump or is that a typo?
Yes. I'm running pg_dump from my computer to (very) remote db server.
SSHing and dumping on the remote is out as a short term solution?
Well, I guess it works, but with no output I can't judge how fast.
Definitely doesn't seem to be going any faster.
Unknown slow, that didn't help.
Best regards,
depesz
--
Adrian Klaver
adrian.klaver@aklaver.com
On Thu, Aug 26, 2021 at 07:46:46AM -0700, Adrian Klaver wrote:
On 8/26/21 7:35 AM, hubert depesz lubaczewski wrote:
On Thu, Aug 26, 2021 at 07:34:26AM -0700, Adrian Klaver wrote:
On 8/26/21 1:44 AM, hubert depesz lubaczewski wrote:
Hi,
I have following case: local pg_dump (v15) connecting to remote
PostgreSQL (v12).So you are using a dev version of pg_dump or is that a typo?
Yes. I'm running pg_dump from my computer to (very) remote db server.
SSHing and dumping on the remote is out as a short term solution?
As I mentioned in original post - I can't run pg_dump closer to server.
SSH is not available, at least for me.
Anyway - I got the dump, so I am good for now, but I think that this
could be improved, so I'll work on getting some stats on queries.
Best regards,
depesz
On Thu, Aug 26, 2021 at 10:20:29AM -0400, Tom Lane wrote:
Well, you could move it forward by doing the legwork to identify which
queries are worth merging. Is it really sane to do a global "select
format_type() from pg_type" query and save all the results on the client
side? I wonder whether there are cases where that'd be a net loss.
You could do the experimentation to figure that out without necessarily
having the C skills to make pg_dump actually do it.
So, I got some info.
First, some stats. The DB contains:
- 14 extensions
- 1 aggregate
- 107 functions
- 5 schemas
- 5 sequences
- 188 logged tables
- 1 unlogged table
- 206 "normal" indexes
- 30 unique indexes
- 15 materialized views
- 16 triggers
- 87 types
- 26 views
pg_dump -s of it is ~ 670kB.
Interestingly, while dumping (pg_dump -s -v), we can see progress going on, and then, after:
====
...
pg_dump: reading publications
pg_dump: reading publication membership
pg_dump: reading subscriptions
pg_dump: reading dependency data
pg_dump: saving encoding = UTF8
pg_dump: saving standard_conforming_strings = on
pg_dump: saving search_path =
====
It stops (progress visible in console). And then, in pg logs I see queries like:
#v+
SELECT
proretset,
prosrc,
probin,
provolatile,
proisstrict,
prosecdef,
lanname,
proconfig,
procost,
prorows,
pg_catalog.pg_get_function_arguments(p.oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs,
pg_catalog.pg_get_function_result(p.oid) AS funcresult,
proleakproof,
array_to_string(protrftypes, ' ') AS protrftypes,
proparallel,
prokind,
prosupport,
NULL AS prosqlbody
FROM pg_catalog.pg_proc p, pg_catalog.pg_language l
WHERE p.oid = '43875'::pg_catalog.oid AND l.oid = p.prolang
#v-
Now for query stats.
To dump it all, pg_dump needed 9173 queries (logged by
log_min_duration_statement = 0 for this user).
I extracted all queries to separate files, and made stats. In total there were
only 4257 unique queries.
Then I checked for repeated queries. Top 10 most repeated offenders were:
615 times : SELECT pg_catalog.format_type('25'::pg_catalog.oid, NULL)
599 times : SELECT pg_catalog.format_type('23'::pg_catalog.oid, NULL)
579 times : SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL)
578 times : SELECT pg_catalog.format_type('41946'::pg_catalog.oid, NULL)
523 times : SELECT pg_catalog.format_type('701'::pg_catalog.oid, NULL)
459 times : SELECT pg_catalog.format_type('42923'::pg_catalog.oid, NULL)
258 times : SELECT pg_catalog.format_type('16'::pg_catalog.oid, NULL)
176 times : SELECT pg_catalog.format_type('19'::pg_catalog.oid, NULL)
110 times : SELECT pg_catalog.format_type('21'::pg_catalog.oid, NULL)
106 times : SELECT pg_catalog.format_type('42604'::pg_catalog.oid, NULL)
In total, there were 5000 queries:
SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL)
But there were only 83 separate oids that were scanned.
The only other repeated command was:
SELECT pg_catalog.set_config('search_path', '', false);
and it was called only twice.
Based on my reading of queries in order it seems to follow the pattern of:
One call for:
SELECT proretset, prosrc, probin, provolatile, proisstrict, prosecdef, lanname, proconfig, procost, prorows, pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs, pg_catalog.pg_get_function_re
sult(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes, ' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE p.oid = 'SOME_NUMBER'::pg_catalog.oid AND l.oid = p.prolang
and then one or more:
SELECT pg_catalog.format_type('SOME_NUMBER'::pg_catalog.oid, NULL)
In one case, after proc query, there were 94 concecutive
pg_catalog.format_type queries.
I hope it helps.
Best regards,
depesz
On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote:
Now for query stats.
To dump it all, pg_dump needed 9173 queries (logged by
log_min_duration_statement = 0 for this user).I extracted all queries to separate files, and made stats. In total there were
only 4257 unique queries.Then I checked for repeated queries. Top 10 most repeated offenders were:
615 times : SELECT pg_catalog.format_type('25'::pg_catalog.oid, NULL)
599 times : SELECT pg_catalog.format_type('23'::pg_catalog.oid, NULL)
579 times : SELECT pg_catalog.format_type('2281'::pg_catalog.oid, NULL)
578 times : SELECT pg_catalog.format_type('41946'::pg_catalog.oid, NULL)
523 times : SELECT pg_catalog.format_type('701'::pg_catalog.oid, NULL)
459 times : SELECT pg_catalog.format_type('42923'::pg_catalog.oid, NULL)
258 times : SELECT pg_catalog.format_type('16'::pg_catalog.oid, NULL)
176 times : SELECT pg_catalog.format_type('19'::pg_catalog.oid, NULL)
110 times : SELECT pg_catalog.format_type('21'::pg_catalog.oid, NULL)
106 times : SELECT pg_catalog.format_type('42604'::pg_catalog.oid, NULL)In total, there were 5000 queries:
SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL)But there were only 83 separate oids that were scanned.
That is a strong argument for using a hash table to cache the types.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote:
In total, there were 5000 queries:
SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL)
But there were only 83 separate oids that were scanned.
That is a strong argument for using a hash table to cache the types.
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?
I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.
(This is not meant for commit as-is; notably, I didn't bother to fix
getTypes' code paths for pre-9.6 servers. It should be fine for
performance testing though.)
regards, tom lane
Attachments:
avoid-repeated-format_type-queries-0.1.patchtext/x-diff; charset=us-ascii; name=avoid-repeated-format_type-queries-0.1.patchDownload+19-4
On 8/27/21 2:23 PM, Tom Lane wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote:
In total, there were 5000 queries:
SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL)
But there were only 83 separate oids that were scanned.That is a strong argument for using a hash table to cache the types.
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?
Could be.
From the stats post:
"Based on my reading of queries in order it seems to follow the pattern of:
One call for:
SELECT proretset, prosrc, probin, provolatile, proisstrict,
prosecdef, lanname, proconfig, procost, prorows,
pg_catalog.pg_get_function_arguments(p.oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs,
pg_catalog.pg_get_function_re
sult(p.oid) AS funcresult, proleakproof, array_to_string(protrftypes,
' ') AS protrftypes, proparallel, prokind, prosupport, NULL AS
prosqlbody FROM pg_catalog.pg_proc p, pg_catalog.pg_language l WHERE
p.oid = 'SOME_NUMBER'::pg_catalog.oid AND l.oid = p.prolang
and then one or more:
SELECT pg_catalog.format_type('SOME_NUMBER'::pg_catalog.oid, NULL)
In one case, after proc query, there were 94 concecutive
pg_catalog.format_type queries.
"
I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.(This is not meant for commit as-is; notably, I didn't bother to fix
getTypes' code paths for pre-9.6 servers. It should be fine for
performance testing though.)regards, tom lane
--
Adrian Klaver
adrian.klaver@aklaver.com
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Thu, 2021-08-26 at 18:06 +0200, hubert depesz lubaczewski wrote:
In total, there were 5000 queries:
SELECT pg_catalog.format_type('[0-9]+'::pg_catalog.oid, NULL)
But there were only 83 separate oids that were scanned.That is a strong argument for using a hash table to cache the types.
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.
Seems like the issue here is mainly just the latency of each query being
rather high compared to most use-cases, so local testing where there's
basically zero latency wouldn't see any change in timing, but throw a
trans-atlantic or worse amount of latency between the system running
pg_dump and the PG server and you'd see notable wall-clock savings in
time.
Only took a quick look but generally +1 on reducing the number of
queries that pg_dump is doing and the changes suggested looked good to
me.
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.
Seems like the issue here is mainly just the latency of each query being
rather high compared to most use-cases, so local testing where there's
basically zero latency wouldn't see any change in timing, but throw a
trans-atlantic or worse amount of latency between the system running
pg_dump and the PG server and you'd see notable wall-clock savings in
time.
Yeah. What I was more concerned about was the potential downside
of running format_type() for each pg_type row, even though we might
use only a few of those results. The fact that I'm *not* seeing
a performance hit with a local server is encouraging from that
standpoint.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.Seems like the issue here is mainly just the latency of each query being
rather high compared to most use-cases, so local testing where there's
basically zero latency wouldn't see any change in timing, but throw a
trans-atlantic or worse amount of latency between the system running
pg_dump and the PG server and you'd see notable wall-clock savings in
time.Yeah. What I was more concerned about was the potential downside
of running format_type() for each pg_type row, even though we might
use only a few of those results. The fact that I'm *not* seeing
a performance hit with a local server is encouraging from that
standpoint.
Ah, yes, agreed.
Thanks!
Stephen
Adrian Klaver <adrian.klaver@aklaver.com> writes:
On 8/27/21 2:23 PM, Tom Lane wrote:
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?
Could be.
Maybe. I'm disturbed by the discrepancy between my result (about
10% of pg_dump's queries are these) and Hubert's (over 50% are).
I'd like to know the reason for that before we push forward.
regards, tom lane
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote:
I experimented with the attached, very quick-n-dirty patch to collect
format_type results during the initial scan of pg_type, instead. On the
regression database in HEAD, it reduces the number of queries pg_dump
issues from 3260 to 2905; but I'm having a hard time detecting any net
performance change.
(This is not meant for commit as-is; notably, I didn't bother to fix
getTypes' code paths for pre-9.6 servers. It should be fine for
performance testing though.)
Hi,
thanks a lot for this. Will test and report back, most likely on Monday,
though.
Best regards,
depesz
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote:
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?
Oh. missed that part.
So I checked. In the mean time I got -Fc dump, so:
#v+
=$ pg_restore -l schema.dump | \
grep -P '^\d*; \d+ \d+ FUNCTION ' |
sed 's/^[^(]*(//; s/)[^)]*$//' |
awk -F, '{print NF}' |
sort -n |
uniq -c
23 0
52 1
21 2
8 3
1 4
2 5
#v-
23 functions with 0 arguments, 52 with 1, and the max is 5 arguments - two
functions have these.
Not sure if it matters but there is a lot of enums. 83 of them. And they have
up to 250 elements (2 such types).
Best regards,
depesz
hubert depesz lubaczewski <depesz@depesz.com> writes:
On Fri, Aug 27, 2021 at 05:23:23PM -0400, Tom Lane wrote:
Those queries are coming from getFormattedTypeName(), which is used
for function arguments and the like. I'm not quite sure why Hubert
is seeing 5000 such calls in a database with only ~100 functions;
surely they don't all have an average of 50 arguments?
23 functions with 0 arguments, 52 with 1, and the max is 5 arguments - two
functions have these.
Not sure if it matters but there is a lot of enums. 83 of them. And they have
up to 250 elements (2 such types).
Hmm, no, I don't see any getFormattedTypeName calls in dumpEnumType.
There are two of 'em in dumpCast though. Does this DB by chance
have a ton of user-defined casts?
regards, tom lane