Improve pg_dump dumping publication tables

Started by Hsu, Johnover 5 years ago4 messageshackers
Jump to latest
#1Hsu, John
hsuchen@amazon.com

Hi hackers,

I was wondering if there's a good reason in pg_dump getPublicationTables()
to iterate through all tables one by one and querying to see if it has a
corresponding publication other than memory concerns?

Attached is a patch to construct the publications by querying for all publications
at once and then finding the corresponding tableInfo information in tblinfoindex.

It seems more likely that users will have a lot more tables than publications so
this should be a lot faster, especially when there's millions of tables.

Setup:
time pg_dump --schema-only -d postgres -f dump.sql -v 2> dump.log
10k tables, each table having its own publication.

Before patch:
real 0m6.409s
user 0m0.234s
sys 0m0.584s

With patch:
real 0m5.483s
user 0m0.399s
sys 0m0.443s

10k tables, no publications.

Before patch:
real 0m6.193s
user 0m0.276s
sys 0m0.531s

With patch:
real 0m5.261s
user 0m0.254s
sys 0m0.377s

Thanks,

John H

Attachments:

0001-pg_dump-Iterate-through-all-publication-tables-in-pg.patchapplication/octet-stream; name=0001-pg_dump-Iterate-through-all-publication-tables-in-pg.patchDownload+37-62
#2Cary Huang
cary.huang@highgo.ca
In reply to: Hsu, John (#1)
Re: Improve pg_dump dumping publication tables

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

Hi

I applied the patch to PG master branch successfully and the patch seems to do as described. If there are a lot of tables created in the Postgres and only a few of them having publications, the getPublicationTables() function will loop through all the tables and check each one to see if it is the desired relation type and if it contains one or more publications. So potentially a lot of looping and checking may happen. In John's patch, all these looping and checking are replaced with one query that will return all the tables that have publications. One problem though, is that, if there are a million tables created in the server like you say and all of them have publications, the query can return a million rows, which will require a lot of memory on client side and the pg_dump client may need to handle this extreme case with proper pagination, which adds complexity to the function's logic. Existing logic does not have this problem, because it simply queries a table's publication one at a time and do it a million times.

thank you
Cary Huang
HighGo Software

#3John Hsu
chenhao.john.hsu@gmail.com
In reply to: Cary Huang (#2)
Re: Improve pg_dump dumping publication tables

Hi Cary,

Thanks for taking a look. I agree there's a risk since there's more memory usage on client side, and if you're dumping say millions of publicationTables then that could be problematic.
I'd like to think this isn't any riskier than existing pg_dump code such as in getTables(...) where presumably we would run into similar problems.

Cheers,
John H

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hsu, John (#1)
Re: Improve pg_dump dumping publication tables

"Hsu, John" <hsuchen@amazon.com> writes:

I was wondering if there's a good reason in pg_dump getPublicationTables()
to iterate through all tables one by one and querying to see if it has a
corresponding publication other than memory concerns?

I just came across this entry in the CommitFest, and I see that it's
practically the same as something I did in passing in 8e396a773.
The main difference is that I got rid of the server-side join, too,
in favor of having getPublicationTables locate the PublicationInfo
that should have been created already by getPublications. (The
immediate rationale for that was to get the publication owner name
from the PublicationInfo; but we'd have had to do that eventually
anyway if we ever want to allow selective dumping of publications.)

Anyway, I apologize for treading on your toes. If I'd noticed this
thread earlier I would certainly have given you credit for being the
first to have the idea.

As far as the memory argument goes, I'm not too concerned about it
because both the PublicationRelInfo structs and the tuples of the
transient PGresult are pretty small. In principle if you had very
many entries in pg_publication_rel, but a selective dump was only
interested in a few of them, there might be an interesting amount of
space wasted here. But that argument fails because even a selective
dump collects data about all tables, for reasons that are hard to get
around. The incremental space usage for PublicationRelInfos seems
unlikely to be significant compared to the per-table data we'd have
anyway.

I'll mark this CF entry "withdrawn", since it wasn't rejected
exactly. Too bad we don't have a classification of "superseded
by events", or something like that.

regards, tom lane