Add support for specifying tables in pg_createsubscriber.
Hi hackers,
Currently, pg_createsubscriber supports converting streaming
replication to logical replication for selected databases or all
databases. However, there is no provision to replicate only a few
selected tables. For such cases, users are forced to manually set up
logical replication using individual SQL commands (CREATE PUBLICATION,
CREATE SUBSCRIPTION, etc.), which can be time-consuming and
error-prone. Extending pg_createsubscriber to support table-level
replication would significantly improve the time taken to perform the
setup.
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument. It allows users to
selectively replicate specific tables within a database instead of
defaulting to all tables. The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.
Example usage:
./pg_createsubscriber \ --database db1 \ --table 'public.t1' \ --table
'public.t2(a,b) WHERE a > 100' \ --database db2 \ --table 'public.t3'
I conducted tests comparing the patched pg_createsubscriber with
standard logical replication under various scenarios to assess
performance and flexibility. All test results represent the average of
five runs.
Scenario pg_createsubscriber Logical Replication Improvement
Two databases
(postgres and
db1 each
having 100
tables), replicate
all 100 in
postgres, 50
tables in db1
(100MB/table)
total 15GB data 2m4.823s 7m23.294s 71.85%
One DB, 100
tables, replicate
50 tables
(200 MB/table)
total 10GB data 2m47.703s 4m58.003s 43.73%
One DB, 200
tables, replicate
100 tables
(100 MB/table)
total 10GB data 3m6.476s 4m35.130s 32.22%
One DB, 100
tables, replicate
50 tables
(100MB/table)
total 5GB data 1m54.384s 2m23.719s 20.42%
These results demonstrate that pg_createsubscriber consistently
outperforms standard logical replication by 20.42% for 5GB data to
71.85% for 15GB data, the time taken reduces as the data increases.
The attached test scripts were used for all experiments.
Scenario 1 (Logical replication setup involving 50 tables across 2
databases, each containing 100 tables with 100 MB of data per table):
pg_createsubscriber_setup_multi_db.sh was used for setup, followed by
pg_createsubscriber_test_multi_db.sh to measure performance. For
logical replication, the setup was done using
logical_replication_setup_multi_db.sh, with performance measured via
logical_replication_test_multi_db.sh.
Scenario 2 and 3:
The pg_createsubscriber_setup_single_db.sh (uncomment appropriate
scenario mentioned in comments) script was used, with configuration
changes specific to Scenario 2 and Scenario 3. In both cases,
pg_createsubscriber_test_single_db.sh (uncomment appropriate scenario
mentioned in comments) was used for measuring performance. Logical
replication followed the same pattern, using
logical_replication_setup_single_db.sh (uncomment appropriate scenario
mentioned in comments) and logical_replication_test_single_db.sh
(uncomment appropriate scenario mentioned in comments) for
measurement.
Scenario 4 (Logical replication setup on 50 tables from a database
containing 100 tables, each with 100 MB of data):
pg_createsubscriber_setup_single_db.sh (without modifications) was
used for setup, and pg_createsubscriber_test_single_db.sh (without
modifications) was used for performance measurement. Logical
replication used logical_replication_setup_single_db.sh (without
modifications) for setup and logical_replication_test_single_db.sh
(without modifications) for measurement.
Thoughts?
Thanks and regards,
Shubham Khanna.
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.
Do we have another example which we consider the ordering of options? I'm unsure
for it. Does getopt_long() always return parsed options with the specified order?
The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.
Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.
Per document:
```
+ <term><option>-f <replaceable class="parameter">table</replaceable></option></term>
+ <term><option>--table=<replaceable class="parameter">table</replaceable></option></term>
```
I feel using "-f" is not suitable. Let's remove the shorten option now.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Monday, July 21, 2025 1:31 PM Shubham Khanna <khannashubham1197@gmail.com> wrote:
Hi hackers,
Currently, pg_createsubscriber supports converting streaming
replication to logical replication for selected databases or all
databases. However, there is no provision to replicate only a few
selected tables. For such cases, users are forced to manually set up
logical replication using individual SQL commands (CREATE PUBLICATION,
CREATE SUBSCRIPTION, etc.), which can be time-consuming and
error-prone. Extending pg_createsubscriber to support table-level
replication would significantly improve the time taken to perform the
setup.
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument. It allows users to
selectively replicate specific tables within a database instead of
defaulting to all tables. The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.
Example usage:
./pg_createsubscriber \ --database db1 \ --table 'public.t1' \ --table
'public.t2(a,b) WHERE a > 100' \ --database db2 \ --table 'public.t3'I conducted tests comparing the patched pg_createsubscriber with
standard logical replication under various scenarios to assess
performance and flexibility. All test results represent the average of
five runs.Thoughts?
Aside from the interface discussion, I think this is an interesting feature in
general. I got some feedback from PGConf.dev this year that many people favor
using pg_createsubscriber to accelerate the initial table synchronization phase.
However, some users prefer subscribing to a subset of tables from the publisher,
whereas pg_createsubscriber currently subscribes all tables by default. This
necessitates additional adjustments to publications and subscriptions afterward.
So, this feature could streamline the process, reducing the steps users need to
take.
Best Regards,
Hou zj
On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of options? I'm
unsure
for it. Does getopt_long() always return parsed options with the specified
order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.
We have another example to consider: pg_amcheck, which allows users to specify
multiple databases. Following this precedent, it may be beneficial to adopt a
similar style in pg_createsubscriber. E.g., Users could specify tables using
database-qualified names, such as:
./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'
This approach enables the tool to internally categorize specified tables by
database and create publications accordingly.
Best Regards,
Hou zj
On 2025-08-01 Fr 4:03 AM, Zhijie Hou (Fujitsu) wrote:
On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)<kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of options? I'm
unsure
for it. Does getopt_long() always return parsed options with the specified
order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.We have another example to consider: pg_amcheck, which allows users to specify
multiple databases.
I don't think that's quite the point, as I understand it. pg_amcheck
might allow you to have multiple --database arguments, but I don't think
it depends on the order of arguments. You didn't answer his question
about what getopt_long() does. I don't recall if it is free to mangle
the argument order.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Friday, August 1, 2025 8:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-08-01 Fr 4:03 AM, Zhijie Hou (Fujitsu) wrote:
On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)
mailto:kuroda.hayato@fujitsu.com wrote:Dear Shubham,
The attached patch introduces a new '--table' option that can be specified
after each '--database' argument.Do we have another example which we consider the ordering of options? I'm
unsure for it. Does getopt_long() always return parsed options with the
specified order?The syntax is like that used in 'vacuumdb' and supports multiple '--table'
arguments per database, including optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right? I'm afraid
that current API has too complex.We have another example to consider: pg_amcheck, which allows users to specify
multiple databases.I don't think that's quite the point, as I understand it. pg_amcheck might
allow you to have multiple --database arguments, but I don't think it depends
on the order of arguments. You didn't answer his question about what
getopt_long() does. I don't recall if it is free to mangle the argument order.
I think you might misunderstand my proposal. I am suggesting an alternative
interface style that employs database-qualified table names, which doesn't
depend on the order of options. This style is already used by pg_amcheck when
dealing with multiple database specifications. I referenced pg_amcheck as an
example. Please see below:
--
Following this precedent, it may be beneficial to adopt a similar style in
pg_createsubscriber. E.g., Users could specify tables using database-qualified
names, such as:
./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'
This approach enables the tool to internally categorize specified tables by
database and create publications accordingly.
--
Best Regards,
Hou zj
On 2025-08-01 Fr 11:03 AM, Zhijie Hou (Fujitsu) wrote:
On Friday, August 1, 2025 8:56 PM Andrew Dunstan<andrew@dunslane.net> wrote:
On 2025-08-01 Fr 4:03 AM, Zhijie Hou (Fujitsu) wrote:
On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)
mailto:kuroda.hayato@fujitsu.com wrote:Dear Shubham,
The attached patch introduces a new '--table' option that can be specified
after each '--database' argument.Do we have another example which we consider the ordering of options? I'm
unsure for it. Does getopt_long() always return parsed options with the
specified order?The syntax is like that used in 'vacuumdb' and supports multiple '--table'
arguments per database, including optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right? I'm afraid
that current API has too complex.We have another example to consider: pg_amcheck, which allows users to specify
multiple databases.I don't think that's quite the point, as I understand it. pg_amcheck might
allow you to have multiple --database arguments, but I don't think it depends
on the order of arguments. You didn't answer his question about what
getopt_long() does. I don't recall if it is free to mangle the argument order.I think you might misunderstand my proposal. I am suggesting an alternative
interface style that employs database-qualified table names, which doesn't
depend on the order of options. This style is already used by pg_amcheck when
dealing with multiple database specifications. I referenced pg_amcheck as an
example.
I simple took your own description:
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.
Maybe I need some remedial English, but to me that "after" says that
argument order is significant.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Saturday, August 2, 2025 12:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-08-01 Fr 11:03 AM, Zhijie Hou (Fujitsu) wrote:
On Friday, August 1, 2025 8:56 PM Andrew Dunstan mailto:andrew@dunslane.net
wrote:
We have another example to consider: pg_amcheck, which allows users to
specify multiple databases.I don't think that's quite the point, as I understand it. pg_amcheck might
allow you to have multiple --database arguments, but I don't think it depends
on the order of arguments. You didn't answer his question about what
getopt_long() does. I don't recall if it is free to mangle the argument order.I think you might misunderstand my proposal. I am suggesting an alternative
interface style that employs database-qualified table names, which doesn't
depend on the order of options. This style is already used by pg_amcheck when
dealing with multiple database specifications. I referenced pg_amcheck as an
example.I simple took your own description: The attached patch introduces a new
'--table' option that can be specified after each '--database' argument. Maybe I
need some remedial English, but to me that "after" says that argument order is
significant.
Allow me to clarify the situation. The description you referenced is the
original interface proposed by the author in the initial email. However, it was
found to be unstable due to its reliance on the argument order. In response to
the discussion, instead of supporting the original interface, I suggested an
alternative interface to consider, which is the one that does not depend on
argument order, as I mentioned in my previous email.
Best Regards,
Hou zj
On 2025-08-01 Fr 8:24 PM, Zhijie Hou (Fujitsu) wrote:
On Saturday, August 2, 2025 12:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-08-01 Fr 11:03 AM, Zhijie Hou (Fujitsu) wrote:
On Friday, August 1, 2025 8:56 PM Andrew Dunstan mailto:andrew@dunslane.net
wrote:
We have another example to consider: pg_amcheck, which allows users to
specify multiple databases.I don't think that's quite the point, as I understand it. pg_amcheck might
allow you to have multiple --database arguments, but I don't think it depends
on the order of arguments. You didn't answer his question about what
getopt_long() does. I don't recall if it is free to mangle the argument order.I think you might misunderstand my proposal. I am suggesting an alternative
interface style that employs database-qualified table names, which doesn't
depend on the order of options. This style is already used by pg_amcheck when
dealing with multiple database specifications. I referenced pg_amcheck as an
example.I simple took your own description: The attached patch introduces a new
'--table' option that can be specified after each '--database' argument. Maybe I
need some remedial English, but to me that "after" says that argument order is
significant.Allow me to clarify the situation. The description you referenced is the
original interface proposed by the author in the initial email. However, it was
found to be unstable due to its reliance on the argument order. In response to
the discussion, instead of supporting the original interface, I suggested an
alternative interface to consider, which is the one that does not depend on
argument order, as I mentioned in my previous email.
Apologies, then, I misread the thread.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Fri, 1 Aug 2025 at 13:33, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of options? I'm
unsure
for it. Does getopt_long() always return parsed options with the specified
order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.We have another example to consider: pg_amcheck, which allows users to specify
multiple databases. Following this precedent, it may be beneficial to adopt a
similar style in pg_createsubscriber. E.g., Users could specify tables using
database-qualified names, such as:./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'
pg_amcheck allows specifying tables as a pattern, the below note is from [1]https://www.postgresql.org/docs/devel/app-pgamcheck.html:
Patterns may be unqualified, e.g. myrel*, or they may be
schema-qualified, e.g. myschema*.myrel* or database-qualified and
schema-qualified, e.g. mydb*.myschema*.myrel*. A database-qualified
pattern will add matching databases to the list of databases to be
checked.
In pg_createsubscriber will it be using the exact spec of pg_amcheck
or will the user have to give fully qualified names?
[1]: https://www.postgresql.org/docs/devel/app-pgamcheck.html
Regards,
Vignesh
On Wednesday, August 6, 2025 7:23 PM vignesh C <vignesh21@gmail.com> wrote:
On Fri, 1 Aug 2025 at 13:33, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of
options? I'm unsure for it. Does getopt_long() always return parsed
options with the specified order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.We have another example to consider: pg_amcheck, which allows users to
specify multiple databases. Following this precedent, it may be
beneficial to adopt a similar style in pg_createsubscriber. E.g.,
Users could specify tables using database-qualified names, such as:./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'pg_amcheck allows specifying tables as a pattern, the below note is from [1]:
Patterns may be unqualified, e.g. myrel*, or they may be schema-qualified, e.g.
myschema*.myrel* or database-qualified and schema-qualified, e.g.
mydb*.myschema*.myrel*. A database-qualified pattern will add matching
databases to the list of databases to be checked.In pg_createsubscriber will it be using the exact spec of pg_amcheck or will the
user have to give fully qualified names?
Both options are acceptable to me. Fully qualified names might be more familiar
to users of publication DDLs, given that regex is not supported for these
statements. So, I personally think that if we want to start with something
simple, using fully qualified names is sensible, with the possibility to extend
this functionality later if needed.
Best Regards,
Hou zj
On Fri, 8 Aug 2025 at 13:47, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Wednesday, August 6, 2025 7:23 PM vignesh C <vignesh21@gmail.com> wrote:
On Fri, 1 Aug 2025 at 13:33, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of
options? I'm unsure for it. Does getopt_long() always return parsed
options with the specified order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.We have another example to consider: pg_amcheck, which allows users to
specify multiple databases. Following this precedent, it may be
beneficial to adopt a similar style in pg_createsubscriber. E.g.,
Users could specify tables using database-qualified names, such as:./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'pg_amcheck allows specifying tables as a pattern, the below note is from [1]:
Patterns may be unqualified, e.g. myrel*, or they may be schema-qualified, e.g.
myschema*.myrel* or database-qualified and schema-qualified, e.g.
mydb*.myschema*.myrel*. A database-qualified pattern will add matching
databases to the list of databases to be checked.In pg_createsubscriber will it be using the exact spec of pg_amcheck or will the
user have to give fully qualified names?Both options are acceptable to me. Fully qualified names might be more familiar
to users of publication DDLs, given that regex is not supported for these
statements. So, I personally think that if we want to start with something
simple, using fully qualified names is sensible, with the possibility to extend
this functionality later if needed.
+1 for implementing this way.
Regards,
Vignesh
On Fri, Aug 8, 2025 at 1:47 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
On Wednesday, August 6, 2025 7:23 PM vignesh C <vignesh21@gmail.com> wrote:
On Fri, 1 Aug 2025 at 13:33, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:On Monday, July 28, 2025 1:07 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Shubham,
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument.Do we have another example which we consider the ordering of
options? I'm unsure for it. Does getopt_long() always return parsed
options with the specified order?The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.Vacuumdb nor pg_restore do not accept multiple --database, right?
I'm afraid that current API has too complex.We have another example to consider: pg_amcheck, which allows users to
specify multiple databases. Following this precedent, it may be
beneficial to adopt a similar style in pg_createsubscriber. E.g.,
Users could specify tables using database-qualified names, such as:./pg_createsubscriber --database db1 --table 'db1.public.t1' --table
'db1.public.t2(a,b) WHERE a > 100' --database db2 --table 'db2.public.t3'pg_amcheck allows specifying tables as a pattern, the below note is from [1]:
Patterns may be unqualified, e.g. myrel*, or they may be schema-qualified, e.g.
myschema*.myrel* or database-qualified and schema-qualified, e.g.
mydb*.myschema*.myrel*. A database-qualified pattern will add matching
databases to the list of databases to be checked.In pg_createsubscriber will it be using the exact spec of pg_amcheck or will the
user have to give fully qualified names?Both options are acceptable to me. Fully qualified names might be more familiar
to users of publication DDLs, given that regex is not supported for these
statements. So, I personally think that if we want to start with something
simple, using fully qualified names is sensible, with the possibility to extend
this functionality later if needed.
Thanks for the suggestion. I have implemented your approach and
incorporated the required changes into the attached patch.
Thanks and regards,
Shubham Khanna.
Attachments:
v2-0001-Support-tables-via-pg_createsubscriber.patchapplication/octet-stream; name=v2-0001-Support-tables-via-pg_createsubscriber.patchDownload+262-3
Hi Shubham,
Some review comments for patch v2-0001
======
1. General - compile errors!
Patch applies OK, but I cannot build pg_createsubscriber. e.g.
pg_createsubscriber.c: In function ‘main’:
pg_createsubscriber.c:2237:5: error: a label can only be part of a
statement and a declaration is not a statement
TableListPerDB * newdb;
^
pg_createsubscriber.c:2324:5: error: a label can only be part of a
statement and a declaration is not a statement
TableSpec * ts = pg_malloc0(sizeof(TableSpec));
^
pg_createsubscriber.c:2325:5: error: expected expression before ‘PQExpBuffer’
PQExpBuffer dbbuf;
^
pg_createsubscriber.c:2326:5: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
PQExpBuffer schemabuf;
^
pg_createsubscriber.c:2335:5: error: ‘dbbuf’ undeclared (first use in
this function)
dbbuf = createPQExpBuffer();
I'm not sure how this can be working for you (???). You cannot declare
variables without introducing a code block in the scope of the switch
case.
======
doc/src/sgml/ref/pg_createsubscriber.sgml
2.
--table is missing from synopsis
~~~
3.
+ <term><option>--table=<replaceable
class="parameter">table</replaceable></option></term>
+ <listitem>
+ <para>
+ Adds a table to be included in the publication for the most recently
+ specified database. Can be repeated multiple times. The syntax
+ supports optional column lists and WHERE clauses.
+ </para>
+ </listitem>
+ </varlistentry>
Lacks detail, so I can't tell how to use this. e.g:
- What does the syntax actually look like?
- code suggests you can specify DB, but this doc says it only applies
to the most recent DB
- how supports column lists and WHERE clause (needs examples)
- needs rules FOR ALL TABLES, etc.
- allowed in combination with --all?
- etc.
======
src/bin/pg_basebackup/pg_createsubscriber.c
4.
+typedef struct TableListPerDB
+{
+ char *dbname;
+ TableSpec *tables;
+ struct TableListPerDB *next;
+} TableListPerDB;
I didn't understand the need for this "per-DB" structure.
Later, you declare "TableSpec *tables;" within the "LogicalRepInfo"
structure (which is per-DB) so you already have the "per-DB" table
list right there. Even if you need to maintain some global static
list, then I imagine you could just put a 'dbname' member in
TableSpec. You don't need a whole new structure to do it.
~~~
create_publication:
5.
+ if (dbinfo->tables == NULL)
+ appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES", ipubname_esc);
+ else
+ {
+ bool first = true;
+
+ appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR TABLE ", ipubname_esc);
+ for (TableSpec * tbl = dbinfo->tables; tbl != NULL; tbl = tbl->next)
What if '*' is specified for the table name? Should that cause a
"CREATE PUBLICATION ... FOR TABLES IN SCHEMA ..." instead of making a
publication with 100s or more tables in a FOR TABLES?
~~~
6.
+ for (int i = 0; i < PQntuples(tres); i++)
+ {
+ char *escaped_schema = PQescapeIdentifier(conn, PQgetvalue(tres, i, 0),
+ strlen(PQgetvalue(tres, i, 0)));
+ char *escaped_table = PQescapeIdentifier(conn, PQgetvalue(tres, i, 1),
+ strlen(PQgetvalue(tres, i, 1)));
+
+ appendPQExpBuffer(str, "%s%s.%s", first ? "" : ", ",
+ escaped_schema, escaped_table);
+
+ PQfreemem(escaped_schema);
+ PQfreemem(escaped_table);
+ first = false;
+ }
6a.
How about some other simple variables to avoid all the repeated PQgetvalue?
e.g.
char *sch = PQgetvalue(tres, i, 0);
char *tbl = PQgetvalue(tres, i, 1);
char *escaped_schema = PQescapeIdentifier(conn, sch, strlen(sch));
char *escaped_table = PQescapeIdentifier(conn, tbl, strlen(tbl));
~
6b.
Variable 'first' is redundant. Same as checking 'i == 0'.
~~~
7.
+ if (dry_run)
+ {
+ res = PQexec(conn, "BEGIN");
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
Would it be better to use if/else instead of:
if (dry_run)
if (!dry_run)
~~~
main:
8.
+ TableListPerDB * newdb;
if (!simple_string_list_member(&opt.database_names, optarg))
{
simple_string_list_append(&opt.database_names, optarg);
8a.
Compile error. Need {} scope to declare that variable!
~
8b.
This seems a strange muddle of 'd' which represents DATABASE and
TableListPerDB, which is a list of tables (not a database). I have
doubts that most of this linked list code is even necessary for the
'd' case.
~~~
9.
+ case 6:
+ TableSpec * ts = pg_malloc0(sizeof(TableSpec));
+ PQExpBuffer dbbuf;
Compile error. Need {} scope to declare that variable!
~~~
10.
+ if (dotcnt == 2)
+ {
+ ts->pattern_db_regex = NULL;
+ ts->pattern_schema_regex = pg_strdup(schemabuf->data);
+ ts->pattern_table_regex = pg_strdup(namebuf->data);
+ }
+ else if (dotcnt == 1)
+ {
+ ts->pattern_db_regex = NULL;
+ ts->pattern_schema_regex = pg_strdup(dbbuf->data);
+ ts->pattern_table_regex = pg_strdup(schemabuf->data);
+ }
+ else
+ pg_fatal("invalid --table specification: %s", optarg);
10a.
This code seems quite odd to me:
- The DB becomes the schema?
- The schema becomes the table?
Rather than fudging all the names of the --table parts, if you don't
know what they represent up-fron,t probably it is better to call them
just part1,part2,part3.
~~~
10b.
Why is pattern_db_regex always NULL? If it is always NULL why have it at all?
======
.../t/040_pg_createsubscriber.pl
11.
+# Test: Table-level publication creation
+$node_p->safe_psql($db3, "CREATE TABLE public.t1 (id int, val text)");
+$node_p->safe_psql($db3, "CREATE TABLE public.t2 (id int, val text)");
+$node_p->safe_psql($db4,
+ "CREATE TABLE public.t3 (id int, val text, extra int)");
+
IIUC, the schema name is part of the table syntax. So, you should
include test cases for different schemas.
~~~
12.
+# Run pg_createsubscriber with table-level options
+command_ok(
+ [
+ 'pg_createsubscriber',
+ '--verbose',
+ '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
+ '--pgdata' => $node_s2->data_dir,
+ '--publisher-server' => $node_p->connstr($db3),
+ '--socketdir' => $node_s2->host,
+ '--subscriber-port' => $node_s2->port,
+ '--database' => $db3,
+ '--table' => "$db3.public.t1",
+ '--table' => "$db3.public.t2",
+ '--database' => $db4,
+ '--table' => "$db4.public.t3",
+ ],
+ 'pg_createsubscriber runs with table-level publication (existing nodes)');
12a.
This is not really testing the same as what the commit message
describes. e.g. what about a test case where --table does not mention
the db explicitly, so relies on the most recent.
~
12b.
What should happen if the explicitly named DB in --table is not the
same as the most recent --database, even though it is otherwise
correct?
e.g.
'--database' => $db3,
'--table' => "$db3.public.t1",
'--database' => $db4,
'--table' => "$db3.public.t2",
'--table' => "$db4.public.t3",
I quickly tried it and AFAICT this was silently accepted and then the
test failed because it gave unexpected results. It doesn't seem good
behaviour.
~
12c.
(related to 12b/12c).
I became suspicious that the DB name part of the --table option is
completely bogus. And it is. The test still passes OK even after I
write junk in the --table database part, like below.
command_ok(
[
'pg_createsubscriber',
'--verbose',
'--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
'--pgdata' => $node_s2->data_dir,
'--publisher-server' => $node_p->connstr($db3),
'--socketdir' => $node_s2->host,
'--subscriber-port' => $node_s2->port,
'--database' => $db3,
'--table' => "$db3.public.t1",
'--table' => "REALLY???.public.t2",
'--database' => $db4,
'--table' => "$db4.public.t3",
],
'pg_createsubscriber runs with table-level publication (existing nodes)');
~~~
13.
+# Get the publication name created by pg_createsubscriber for db3
+my $pubname1 = $node_p->safe_psql(
+ $db3, qq(
+ SELECT pubname FROM pg_publication
+ WHERE pubname LIKE 'pg_createsubscriber_%'
+ ORDER BY pubname LIMIT 1
+));
+
Why don't you just name the publication explicitly in the command?
Then you don't need any of this code to discover the publication name
here.
~~~
14.
+# Check publication tables for db3
+my $actual1 = $node_p->safe_psql(
+ $db3, qq(
+ SELECT pubname || '|public|' || tablename
+ FROM pg_publication_tables
+ WHERE pubname = '$pubname1'
+ ORDER BY tablename
+));
+is($actual1, "$pubname1|public|t1\n$pubname1|public|t2",
+ 'single publication for both tables created successfully on database db3'
+);
What is the point of hardwiring the 'public' in the concatenated
string, and then verifying that it is still there in the result? Why
not hardwire 'banana' instead of 'public' -- it passes the test just
the same.
~~~
15.
+# Get the publication name created by pg_createsubscriber for db4
+my $pubname2 = $node_p->safe_psql(
+ $db4, qq(
+ SELECT pubname FROM pg_publication
+ WHERE pubname LIKE 'pg_createsubscriber_%'
+ ORDER BY pubname LIMIT 1
+));
+
(same as #13 before)
Why don't you simply name the publication explicitly in the command?
Then you don't need any of this code to discover the publication name
here.
~~~
16.
+# Check publication tables for db4
+my $actual2 = $node_p->safe_psql(
+ $db4, qq(
+ SELECT pubname || '|public|' || tablename
+ FROM pg_publication_tables
+ WHERE pubname = '$pubname2'
+ ORDER BY tablename
+));
+is($actual2, "$pubname2|public|t3",
+ 'single publication for t3 created successfully on database db4');
+
(same as #14 before)
What is the point of the hardwired 'public'?
======
Kind Regards,
Peter Smith.
Fujitsu Australia
Hi Shubham,
The patch claims (e.g. in the PG docs and in the commit message) that
"Column lists" and "WHERE clause" are possible, but I don't see how
they can work. AFAICT the patch assumes everything to the right of the
rightmost dot (.) must be the relation name.
~~~
WHERE Clause
------------
e.g.
If I say something like this:
'--table' => "$db3.public.t1 WHERE (id != 1234)",
Gives error like:
2025-08-18 09:41:50.295 AEST client backend[17727]
040_pg_createsubscriber.pl LOG: execute <unnamed>: SELECT n.nspname,
c.relname FROM pg_class c JOIN pg_namespace n ON n.oid =
c.relnamespace WHERE n.nspname ~ $1 AND c.relname ~ $2 AND c.relkind
IN ('r','p') ORDER BY 1, 2
2025-08-18 09:41:50.295 AEST client backend[17727]
040_pg_createsubscriber.pl DETAIL: Parameters: $1 = '^(public)$', $2
= '^(t1 where (id != 1234))$'
~~~
Column Lists
------------
Same. These don't work either...
e.g.
--table' => "$db3.public.t1(id,val)",
Gives error like:
2025-08-18 09:53:20.338 AEST client backend[19785]
040_pg_createsubscriber.pl LOG: execute <unnamed>: SELECT n.nspname,
c.relname FROM pg_class c JOIN pg_namespace n ON n.oid =
c.relnamespace WHERE n.nspname ~ $1 AND c.relname ~ $2 AND c.relkind
IN ('r','p') ORDER BY 1, 2
2025-08-18 09:53:20.338 AEST client backend[19785]
040_pg_createsubscriber.pl DETAIL: Parameters: $1 = '^(public)$', $2
= '^(t1(id,val))$'
======
Kind Regards
Peter Smith.
Fujitsu Australia
On Fri, Aug 15, 2025 at 12:46 PM Peter Smith <smithpb2250@gmail.com> wrote:
Hi Shubham,
Some review comments for patch v2-0001
======
1. General - compile errors!Patch applies OK, but I cannot build pg_createsubscriber. e.g.
pg_createsubscriber.c: In function ‘main’:
pg_createsubscriber.c:2237:5: error: a label can only be part of a
statement and a declaration is not a statement
TableListPerDB * newdb;
^
pg_createsubscriber.c:2324:5: error: a label can only be part of a
statement and a declaration is not a statement
TableSpec * ts = pg_malloc0(sizeof(TableSpec));
^
pg_createsubscriber.c:2325:5: error: expected expression before ‘PQExpBuffer’
PQExpBuffer dbbuf;
^
pg_createsubscriber.c:2326:5: warning: ISO C90 forbids mixed
declarations and code [-Wdeclaration-after-statement]
PQExpBuffer schemabuf;
^
pg_createsubscriber.c:2335:5: error: ‘dbbuf’ undeclared (first use in
this function)
dbbuf = createPQExpBuffer();I'm not sure how this can be working for you (???). You cannot declare
variables without introducing a code block in the scope of the switch
case.
Fixed.
======
doc/src/sgml/ref/pg_createsubscriber.sgml2.
--table is missing from synopsis~~~
Fixed.
3. + <term><option>--table=<replaceable class="parameter">table</replaceable></option></term> + <listitem> + <para> + Adds a table to be included in the publication for the most recently + specified database. Can be repeated multiple times. The syntax + supports optional column lists and WHERE clauses. + </para> + </listitem> + </varlistentry>Lacks detail, so I can't tell how to use this. e.g:
- What does the syntax actually look like?
- code suggests you can specify DB, but this doc says it only applies
to the most recent DB
- how supports column lists and WHERE clause (needs examples)
- needs rules FOR ALL TABLES, etc.
- allowed in combination with --all?
- etc.
Fixed.
======
src/bin/pg_basebackup/pg_createsubscriber.c4. +typedef struct TableListPerDB +{ + char *dbname; + TableSpec *tables; + struct TableListPerDB *next; +} TableListPerDB;I didn't understand the need for this "per-DB" structure.
Later, you declare "TableSpec *tables;" within the "LogicalRepInfo"
structure (which is per-DB) so you already have the "per-DB" table
list right there. Even if you need to maintain some global static
list, then I imagine you could just put a 'dbname' member in
TableSpec. You don't need a whole new structure to do it.~~~
Removed the structure TableListPerDB as suggested.
create_publication:
5. + if (dbinfo->tables == NULL) + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR ALL TABLES", ipubname_esc); + else + { + bool first = true; + + appendPQExpBuffer(str, "CREATE PUBLICATION %s FOR TABLE ", ipubname_esc); + for (TableSpec * tbl = dbinfo->tables; tbl != NULL; tbl = tbl->next)What if '*' is specified for the table name? Should that cause a
"CREATE PUBLICATION ... FOR TABLES IN SCHEMA ..." instead of making a
publication with 100s or more tables in a FOR TABLES?~~~
When the user specifies '*' for the table name in pg_createsubscriber,
it makes sense to generate a SQL publication command using FOR TABLES
IN SCHEMA instead of enumerating all tables explicitly. This approach
is more efficient and scalable, especially when dealing with schemas
containing hundreds or more tables, and it ensures future tables added
to the schema are automatically included without needing to recreate
the publication.
Currently, the provided patches do not implement this behavior and
always enumerate tables in the FOR TABLE clause regardless of
wildcards. I acknowledge this and plan to address the handling of the
'*' wildcard and generate FOR TABLES IN SCHEMA publications in a
future update.
6. + for (int i = 0; i < PQntuples(tres); i++) + { + char *escaped_schema = PQescapeIdentifier(conn, PQgetvalue(tres, i, 0), + strlen(PQgetvalue(tres, i, 0))); + char *escaped_table = PQescapeIdentifier(conn, PQgetvalue(tres, i, 1), + strlen(PQgetvalue(tres, i, 1))); + + appendPQExpBuffer(str, "%s%s.%s", first ? "" : ", ", + escaped_schema, escaped_table); + + PQfreemem(escaped_schema); + PQfreemem(escaped_table); + first = false; + }6a.
How about some other simple variables to avoid all the repeated PQgetvalue?e.g.
char *sch = PQgetvalue(tres, i, 0);
char *tbl = PQgetvalue(tres, i, 1);
char *escaped_schema = PQescapeIdentifier(conn, sch, strlen(sch));
char *escaped_table = PQescapeIdentifier(conn, tbl, strlen(tbl));~
Fixed.
6b.
Variable 'first' is redundant. Same as checking 'i == 0'.~~~
Fixed.
7. + if (dry_run) + { + res = PQexec(conn, "BEGIN"); + if (PQresultStatus(res) != PGRES_COMMAND_OK) + {Would it be better to use if/else instead of:
if (dry_run)
if (!dry_run)~~~
Fixed.
main:
8.
+ TableListPerDB * newdb;
if (!simple_string_list_member(&opt.database_names, optarg))
{
simple_string_list_append(&opt.database_names, optarg);8a.
Compile error. Need {} scope to declare that variable!~
Fixed.
8b.
This seems a strange muddle of 'd' which represents DATABASE and
TableListPerDB, which is a list of tables (not a database). I have
doubts that most of this linked list code is even necessary for the
'd' case.~~~
Fixed.
9. + case 6: + TableSpec * ts = pg_malloc0(sizeof(TableSpec)); + PQExpBuffer dbbuf;Compile error. Need {} scope to declare that variable!
~~~
Fixed.
10. + if (dotcnt == 2) + { + ts->pattern_db_regex = NULL; + ts->pattern_schema_regex = pg_strdup(schemabuf->data); + ts->pattern_table_regex = pg_strdup(namebuf->data); + } + else if (dotcnt == 1) + { + ts->pattern_db_regex = NULL; + ts->pattern_schema_regex = pg_strdup(dbbuf->data); + ts->pattern_table_regex = pg_strdup(schemabuf->data); + } + else + pg_fatal("invalid --table specification: %s", optarg);10a.
This code seems quite odd to me:
- The DB becomes the schema?
- The schema becomes the table?Rather than fudging all the names of the --table parts, if you don't
know what they represent up-fron,t probably it is better to call them
just part1,part2,part3.~~~
Fixed.
10b.
Why is pattern_db_regex always NULL? If it is always NULL why have it at all?
Removed.
======
.../t/040_pg_createsubscriber.pl11. +# Test: Table-level publication creation +$node_p->safe_psql($db3, "CREATE TABLE public.t1 (id int, val text)"); +$node_p->safe_psql($db3, "CREATE TABLE public.t2 (id int, val text)"); +$node_p->safe_psql($db4, + "CREATE TABLE public.t3 (id int, val text, extra int)"); +IIUC, the schema name is part of the table syntax. So, you should
include test cases for different schemas.~~~
Fixed.
12. +# Run pg_createsubscriber with table-level options +command_ok( + [ + 'pg_createsubscriber', + '--verbose', + '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default, + '--pgdata' => $node_s2->data_dir, + '--publisher-server' => $node_p->connstr($db3), + '--socketdir' => $node_s2->host, + '--subscriber-port' => $node_s2->port, + '--database' => $db3, + '--table' => "$db3.public.t1", + '--table' => "$db3.public.t2", + '--database' => $db4, + '--table' => "$db4.public.t3", + ], + 'pg_createsubscriber runs with table-level publication (existing nodes)');12a.
This is not really testing the same as what the commit message
describes. e.g. what about a test case where --table does not mention
the db explicitly, so relies on the most recent.~
12b.
What should happen if the explicitly named DB in --table is not the
same as the most recent --database, even though it is otherwise
correct?e.g.
'--database' => $db3,
'--table' => "$db3.public.t1",
'--database' => $db4,
'--table' => "$db3.public.t2",
'--table' => "$db4.public.t3",
I quickly tried it and AFAICT this was silently accepted and then the
test failed because it gave unexpected results. It doesn't seem good
behaviour.~
12c.
(related to 12b/12c).
I became suspicious that the DB name part of the --table option is
completely bogus. And it is. The test still passes OK even after I
write junk in the --table database part, like below.command_ok(
[
'pg_createsubscriber',
'--verbose',
'--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
'--pgdata' => $node_s2->data_dir,
'--publisher-server' => $node_p->connstr($db3),
'--socketdir' => $node_s2->host,
'--subscriber-port' => $node_s2->port,
'--database' => $db3,
'--table' => "$db3.public.t1",
'--table' => "REALLY???.public.t2",
'--database' => $db4,
'--table' => "$db4.public.t3",
],
'pg_createsubscriber runs with table-level publication (existing nodes)');
~~~
v3-0002 patch handles the comment 12.
13. +# Get the publication name created by pg_createsubscriber for db3 +my $pubname1 = $node_p->safe_psql( + $db3, qq( + SELECT pubname FROM pg_publication + WHERE pubname LIKE 'pg_createsubscriber_%' + ORDER BY pubname LIMIT 1 +)); +Why don't you just name the publication explicitly in the command?
Then you don't need any of this code to discover the publication name
here.~~~
Fixed.
14. +# Check publication tables for db3 +my $actual1 = $node_p->safe_psql( + $db3, qq( + SELECT pubname || '|public|' || tablename + FROM pg_publication_tables + WHERE pubname = '$pubname1' + ORDER BY tablename +)); +is($actual1, "$pubname1|public|t1\n$pubname1|public|t2", + 'single publication for both tables created successfully on database db3' +);What is the point of hardwiring the 'public' in the concatenated
string, and then verifying that it is still there in the result? Why
not hardwire 'banana' instead of 'public' -- it passes the test just
the same.~~~
Fixed.
15. +# Get the publication name created by pg_createsubscriber for db4 +my $pubname2 = $node_p->safe_psql( + $db4, qq( + SELECT pubname FROM pg_publication + WHERE pubname LIKE 'pg_createsubscriber_%' + ORDER BY pubname LIMIT 1 +)); +(same as #13 before)
Why don't you simply name the publication explicitly in the command?
Then you don't need any of this code to discover the publication name
here.~~~
Fixed.
16. +# Check publication tables for db4 +my $actual2 = $node_p->safe_psql( + $db4, qq( + SELECT pubname || '|public|' || tablename + FROM pg_publication_tables + WHERE pubname = '$pubname2' + ORDER BY tablename +)); +is($actual2, "$pubname2|public|t3", + 'single publication for t3 created successfully on database db4'); +(same as #14 before)
What is the point of the hardwired 'public'?
Fixed.
The attached patches contain the suggested changes. They also address
the comments given by Peter at [1]/messages/by-id/CAHut+PuG8Vd=MNbQyN-3D1nsfEatmcd5bG6+L-GnOyoR9tC_6w@mail.gmail.com.
[1]: /messages/by-id/CAHut+PuG8Vd=MNbQyN-3D1nsfEatmcd5bG6+L-GnOyoR9tC_6w@mail.gmail.com
Thanks and regards,
Shubham Khanna.
Attachments:
v3-0001-Support-tables-via-pg_createsubscriber.patchapplication/octet-stream; name=v3-0001-Support-tables-via-pg_createsubscriber.patchDownload+366-11
v3-0002-Support-WHERE-clause-and-COLUMN-list-in-table-arg.patchapplication/octet-stream; name=v3-0002-Support-WHERE-clause-and-COLUMN-list-in-table-arg.patchDownload+49-1
On Thu, Aug 21, 2025, at 6:08 AM, Shubham Khanna wrote:
Attachments:
* v3-0001-Support-tables-via-pg_createsubscriber.patch
* v3-0002-Support-WHERE-clause-and-COLUMN-list-in-table-arg.patch
+ <term><option>--table=<replaceable class="parameter">table</replaceable></option></term>
+ <listitem>
+ <para>
+ Adds one or more specific tables to the publication for the most recently
+ specified <option>--database</option>. This option can be given multiple
+ times to include additional tables.
+ </para>
I think it is a really bad idea to rely on the order of options to infer which
tables are from which database. The current design about publication and
subscription names imply order but it doesn't mean subscription name must be
the next option after the publication name.
Your explanation from the initial email:
The attached patch introduces a new '--table' option that can be
specified after each '--database' argument. It allows users to
selectively replicate specific tables within a database instead of
defaulting to all tables. The syntax is like that used in 'vacuumdb'
and supports multiple '--table' arguments per database, including
optional column lists and row filters.
vacuumdb doesn't accept multiple --table options if you specify multiple
databases.
$ vacuumdb -a -t public.pgbench_branches -t public.pgbench_tellers -d db1 -d db2
vacuumdb: error: cannot vacuum all databases and a specific one at the same time
However, it accepts multiple --table options if you specify --all options.
(Although, I think it needs better error handling.)
$ vacuumdb -a -t public.pgbench_branches
vacuumdb: vacuuming database "contrib_regression"
vacuumdb: error: query failed: ERROR: relation "public.pgbench_branches" does not exist
LINE 2: VALUES ('public.pgbench_branches'::pg_catalog.regclass, NU...
^
vacuumdb: detail: Query was: WITH listed_objects (object_oid, column_list) AS (
VALUES ('public.pgbench_branches'::pg_catalog.regclass, NULL::pg_catalog.text)
)
Let's recap the initial goal: pg_createsubscriber creates a new logical replica
from a physical standby server. Your proposal is extending the tool to create a
partial logical replica but doesn't mention what you would do with the other
part; that is garbage after the conversion. I'm not convinced that the current
proposal is solid as-is.
+ <para>
+ The argument must be a fully qualified table name in one of the
+ following forms:
+ <itemizedlist><listitem><para><literal>schema.table</literal></para></listitem>
+ <listitem><para><literal>db.schema.table</literal></para></listitem></itemizedlist>
+ If the database name is provided, it must match the most recent
+ <option>--database</option> argument.
+ </para>
Why do you want to include the database if you already specified it?
+ <para>
+ A table specification may also include an optional column list and/or
+ row filter:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>schema.table(col1, col2, ...)</literal> — publishes
+ only the specified columns.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>schema.table WHERE (predicate)</literal> — publishes
+ only rows that satisfy the given condition.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Both forms can be combined, e.g.
+ <literal>schema.table(col1, col2) WHERE (id > 100)</literal>.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
This is a bad idea for some cases. Let's say your setup involves a filter that
uses only 10% of the rows from a certain table. It is better to do a manual
setup. Besides that, expressions in options can open a can of worms. In the
column list case, there might be corner cases like if you have a constraint in
a certain column and that column was not included in the column list, the setup
will fail; there isn't a cheap way to detect such cases.
It seems this proposal doesn't serve a general purpose. It is copying a *whole*
cluster to use only a subset of tables. Your task with pg_createsubscriber is
more expensive than doing a manual logical replication setup. If you have 500
tables and want to replicate only 400 tables, it doesn't seem productive to
specify 400 -t options. There are some cases like a small set of big tables
that this feature makes sense. However, I'm wondering if a post script should
be used to adjust your setup. There might be cases that involves only dozens of
tables but my experience says it is rare. My expectation is that this feature
is useful for avoiding some specific tables. Hence, the copy of the whole
cluster is worthwhile.
--
Euler Taveira
EDB https://www.enterprisedb.com/
Hi Shubham.
Some review comments for patch v3-0001.
======
Commit message
1.
Allows optional column lists and row filters.
~
Is this right? Aren't the column lists and row filters now separated
in patch 0002?
======
doc/src/sgml/ref/pg_createsubscriber.sgml
2. synopsis
pg_createsubscriber [option...] { -d | --database }dbname { -D |
--pgdata }datadir { -P | --publisher-server }connstr { --table
}table-name
2a.
The --table should follow the --database instead of being last.
~
2b.
The brackets {--table} don't seem useful
~
2c.
Since there can be multiple tables per database, I feel an ellipsis
(...) is needed too
~
3.
+ <varlistentry>
+ <term><option>--table=<replaceable
class="parameter">table</replaceable></option></term>
+ <listitem>
This belongs more naturally immediately following --database.
~~~
4.
+ <para>
+ The argument must be a fully qualified table name in one of the
+ following forms:
+ <itemizedlist><listitem><para><literal>schema.table</literal></para></listitem>
+ <listitem><para><literal>db.schema.table</literal></para></listitem></itemizedlist>
+ If the database name is provided, it must match the most recent
+ <option>--database</option> argument.
+ </para>
4a.
It would be nicer if the SGML were less compacted.
e.g.
<itemizedlist>
<listitem><para><literal>schema.table</literal></para></listitem>
<listitem><para><literal>db.schema.table</literal></para></listitem>
</itemizedlist>
~
4b.
Why do we insist that the user must explicitly give a schema name?
Can't a missing schema just imply "public" so user doesn't have to
type so much?
~
4c.
FUNDAMENTAL SPEC QUESTION....
I am confused why this patch allows specifying the 'db' name part of
--table when you also insist it must be identical to the most recent
--database name. With this rule, it seems unnecessary and merely means
more validation code is required.
I can understand having the 'db' name part might be useful if you
would also permit the --table to be specified anywhere on the command,
but insisting it must match the most recent --database name seems to
cancel out any reason for allowing it in the first place.
~~~
5.
+ <para>
+ A table specification may also include an optional column list and/or
+ row filter:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>schema.table(col1, col2, ...)</literal> — publishes
+ only the specified columns.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>schema.table WHERE (predicate)</literal> — publishes
+ only rows that satisfy the given condition.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Both forms can be combined, e.g.
+ <literal>schema.table(col1, col2) WHERE (id > 100)</literal>.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
AFAIK, support for this was separated into patch 0002. So these docs
should be in patch 0002, not here.
======
src/bin/pg_basebackup/pg_createsubscriber.c
6.
+typedef struct TableSpec
+{
+ char *spec;
+ char *dbname;
+ char *pattern_regex;
+ char *pattern_part1_regex;
+ char *pattern_part2_regex;
+ char *pattern_part3_regex;
+ struct TableSpec *next;
+} TableSpec;
+
6a.
Add a typedefs.list entry for this new typedef, and run pg_indent.
~
6b.
pattern_regex is unused?
~
6c.
pattern_part1_regex seems assigned, but it is otherwise unused (??). Needed?
~
6d.
I had previously given a review comment suggesting names like
part1/2/3 because previously, you were using members with different
meanings depending on the tablespec. But, now it seems all the dot
parse logic is re-written, so AFAICT you are always using part1 means
dbregex; part2 means schemaregex; part3 means tableregex ... So, the
"part" names are strange in the current impl - if you really do know
what they represent, then call them by their proper names.
~~~
7.
+static TableSpec * table_list_head = NULL;
+static TableSpec * table_list_tail = NULL;
+static char *current_dbname = NULL;
7a.
Why isn't 'current_dbname' just a local variable of main()?
~
7b.
I doubt the 'tail' var is needed. See a later review comment for details.
~~~
8.
-
/*
* Cleanup objects that were created by pg_createsubscriber if there is an
* error.
Whitespace change unrelated to this patch?
~~~
usage:
9.
printf(_(" --subscription=NAME subscription name\n"));
+ printf(_(" --table table to subscribe to;
can be specified multiple times\n"));
printf(_(" -V, --version output version
information, then exit\n"));
Or, should this say "table to publish". Maybe it amounts to the same
thing, but this patch modifies CREATE PUBLICATION, not CREATE
SUBSCRIPTION.
~~~
store_pub_sub_info:
10.
+ for (i = 0; i < num_dbs; i++)
+ {
+ TableSpec *prev = NULL;
+ TableSpec *cur = table_list_head;
+ TableSpec *filtered_head = NULL;
+ TableSpec *filtered_tail = NULL;
+
+ while (cur != NULL)
+ {
+ TableSpec *next = cur->next;
+
+ if (strcmp(cur->dbname, dbinfo[i].dbname) == 0)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ table_list_head = next;
+
+ cur->next = NULL;
+ if (!filtered_head)
+ filtered_head = filtered_tail = cur;
+ else
+ {
+ filtered_tail->next = cur;
+ filtered_tail = cur;
+ }
+ }
+ else
+ prev = cur;
+ cur = next;
+ }
+ dbinfo[i].tables = filtered_head;
+ }
+
10a.
Is there a bug? I have not debugged this, but when you are assigning
filtered_tail = cur, who is ensuring that "filtered" list is
NULL-terminated? e.g. I suspect the cur->next might still point to
something inappropriate.
~
10b.
I doubt all that 'filered_head/tail' logic is needed at all. Can't you
just assign directly to the head of dbinfo[i].tables list as you
encounter appropriate tables for that db? The order may become
reversed, but does that matter?
e.g. something like this
cur->next = dbinfo[i].tables ? dbinfo[i].tables.next : NULL;
dbinfo[i].tables = cur;
~~~
create_publication:
11.
PGresult *res;
char *ipubname_esc;
char *spubname_esc;
+ bool first_table = true;
This 'first_table' variable is redundant. Just check i == 0 in the loop.
~~~
12.
+ appendPQExpBuffer(str, "%s%s.%s",
+ first_table ? "" : ", ",
+ escaped_schema, escaped_table);
SUGGESTION
if (i > 0)
appendPQExpBuffer(str, ", ");
appendPQExpBuffer(str, "%s.%s", escaped_schema, escaped_table);
~~~
13.
- if (!simple_string_list_member(&opt.database_names, optarg))
{
- simple_string_list_append(&opt.database_names, optarg);
- num_dbs++;
+ if (current_dbname)
+ pg_free(current_dbname);
+ current_dbname = pg_strdup(optarg);
+
+ if (!simple_string_list_member(&opt.database_names, optarg))
+ {
+ simple_string_list_append(&opt.database_names, optarg);
+ num_dbs++;
+ }
+ else
+ pg_fatal("database \"%s\" specified more than once for
-d/--database", optarg);
+ break;
}
- else
- pg_fatal("database \"%s\" specified more than once for
-d/--database", optarg);
- break;
13a.
The scope {} is not needed. You removed the previously declared variable.
~
13b.
The pfree might be overkill. I think a few little database name
strdups leaks are hardly going to be a problem. It's up to you.
~~~
14.
+ char *copy_arg;
+ char *first_dot;
+ char *second_dot;
+ char *dbname_arg = NULL;
+ char *schema_table_part;
+ TableSpec *ts;
Does /dbname_arg/dbname_part/make more sense here?
~~~
15.
+ if (first_dot != NULL)
+ second_dot = strchr(first_dot + 1, '.');
+ else
+ second_dot = NULL;
+
+
The 'else' is not needed if you had assigned NULL at the declaration.
~~~
16.
The logic seems quite brittle. e.g. Maybe you should only parse
looking for dots to the end of the copy_arg or the first space.
Otherwise, other "dots" in the table spec will mess up your logic.
e.g.
"db.sch.tbl" -> 2 dots
"sch.tbl WHERE (x > 1.0)" -> also 2 dots
Similarly, what you called 'schema_table_part' cannot be allowed to
extend beyond any space; otherwise, it can easily return an unexpected
dotcnt.
"db.sch.tbl WHERE (x > 1.0)" -> 3 dots
~~~
17.
+ patternToSQLRegex(encoding, dbbuf, schemabuf, namebuf,
+ schema_table_part, false, false, &dotcnt);
+
+ if (dbname_arg != NULL)
+ dotcnt++;
+
+ if (dotcnt == 2)
+ {
+ ts->pattern_part1_regex = pg_strdup(dbbuf->data);
+ ts->pattern_part2_regex = pg_strdup(schemabuf->data);
+ ts->pattern_part3_regex = namebuf->len > 0 ? pg_strdup(namebuf->data) : NULL;
+ }
+ else if (dotcnt == 1)
+ {
+ ts->pattern_part1_regex = NULL;
+ ts->pattern_part2_regex = pg_strdup(dbbuf->data);
+ ts->pattern_part3_regex = NULL;
+ }
+ else
+ pg_fatal("invalid table specification \"%s\"", optarg);
Something seems very strange here...
I haven't debugged this, but I imagine if "--table sch.tab" then
dotcnt will be 1.
But then
+ ts->pattern_part1_regex = NULL;
+ ts->pattern_part2_regex = pg_strdup(dbbuf->data);
+ ts->pattern_part3_regex = NULL;
The schema regex is assigned the dbbuf->data (???) I don't trust this
logic. I saw that there are no test cases where dbpart is omitted, so
maybe this is a lurking bug?
~~~
18.
+
+ if (!table_list_head)
+ table_list_head = table_list_tail = ts;
+ else
+ table_list_tail->next = ts;
+
+ table_list_tail = ts;
+
All this 'tail' stuff looks potentially unnecessary. e.g. I think you
could simplify all this just by adding to the front of the list.
SUGGESTION
ts->next = table_list_head;
table_list_head = ts;
======
.../t/040_pg_createsubscriber.pl
19.
+# Declare database names
+my $db3 = 'db3';
+my $db4 = 'db4';
+
What do these variables achieve? e.g. Why not just use hardcoded
strings 'db1' and 'db2'.
~~~
20.
+# Test: Table-level publication creation
+$node_p->safe_psql($db3, "CREATE TABLE public.t1 (id int, val text)");
+$node_p->safe_psql($db3, "CREATE TABLE public.t2 (id int, val text)");
+$node_p->safe_psql($db3, "CREATE TABLE myschema.t4 (id int, val text)");
You could combine all these.
~~~
21.
+$node_p->safe_psql($db4,
+ "CREATE TABLE public.t3 (id int, val text, extra int)");
+$node_p->safe_psql($db4,
+ "CREATE TABLE otherschema.t5 (id serial primary key, info text)");
You could combine these.
~~~
22.
+# Create explicit publications
+my $pubname1 = 'pub1';
+my $pubname2 = 'pub2';
+
What does creating these variables achieve? e.g. Why not just use
hardcoded strings 'pub1' and 'pub2'.
~~~
23.
+# Run pg_createsubscriber with table-level options
+command_ok(
+ [
+ 'pg_createsubscriber',
+ '--verbose',
+ '--recovery-timeout' => $PostgreSQL::Test::Utils::timeout_default,
+ '--pgdata' => $node_s2->data_dir,
+ '--publisher-server' => $node_p->connstr($db3),
+ '--socketdir' => $node_s2->host,
+ '--subscriber-port' => $node_s2->port,
+ '--publication' => $pubname1,
+ '--publication' => $pubname2,
+ '--database' => $db3,
+ '--table' => "$db3.public.t1",
+ '--table' => "$db3.public.t2",
+ '--table' => "$db3.myschema.t4",
+ '--database' => $db4,
+ '--table' => "$db4.public.t3",
+ '--table' => "$db4.otherschema.t5",
+ ],
+ 'pg_createsubscriber runs with table-level publication (existing nodes)');
+
There is no test for a --table spec where the dbpart is omitted.
~~~
24.
+# Check publication tables for db3 with public schema first
What is the significance "with public schema first" in this comment?
~~
25.
+# Check publication tables for db4, with public schema first
+my $actual2 = $node_p->safe_psql(
+ $db4, qq(
+ SELECT pubname || '|' || schemaname || '|' || tablename
+ FROM pg_publication_tables
+ WHERE pubname = '$pubname2'
+ ORDER BY schemaname, tablename
+ )
+);
+is( $actual2,
+ "$pubname2|otherschema|t5\n$pubname2|public|t3",
+ 'publication includes tables in public and otherschema schemas on db4');
+
Ditto. What is the significance "with public schema first" in this comment?
======
Kind Regards,
Peter Smith.
Fujitsu Australia
Hi Shubham,
Some brief review comments about patch v3-0002.
======
Commit message
1.
This patch support the specification of both a WHERE clause (row filter) and a
column list in the table specification via pg_createsubscriber, and modify the
utility's name (for example, to a more descriptive or aligned name).
For eg:-
CREATE PUBLICATION pub FOR TABLE schema.table (col1, col2) WHERE
(predicate);
1a.
/support/supports/
~
1b.
"and modify the utility's name (for example, to a more descriptive or
aligned name)."
/modify/modifies/ but more importantly, what does this sentence mean?
======
GENERAL
2.
Should be some docs moved to this patch
~~~
3.
Missing test cases
======
src/bin/pg_basebackup/pg_createsubscriber.c
typedef struct TableSpec:
4.
char *dbname;
+ char *att_names;
+ char *row_filter;
char *pattern_regex;
4a.
FUNDAMENTAL DESIGN QUESTION
Why do we even want to distinguish these? IIUC, all you need to know
is "char *extra;" which can represent *anything* else the user may
tack onto the end of the table name. You don't even need to parse
it... e.g. You could let the "CREATE PUBLICATION" check the syntax.
~
4b.
Current patch is not even assigning these members (??)
~~~
5.
+char *
+pg_strcasestr(const char *haystack, const char *needle)
+{
+ if (!*needle)
+ return (char *) haystack;
+ for (; *haystack; haystack++)
+ {
+ const char *h = haystack;
+ const char *n = needle;
+
+ while (*h && *n && pg_tolower((unsigned char) *h) ==
pg_tolower((unsigned char) *n))
+ ++h, ++n;
+ if (!*n)
+ return (char *) haystack;
+ }
+ return NULL;
+}
+
Not needed.
~~~
create_publication:
6.
+ if (tbl->att_names && strlen(tbl->att_names) > 0)
+ appendPQExpBuffer(str, " ( %s )", tbl->att_names);
+
+ if (tbl->row_filter && strlen(tbl->row_filter) > 0)
+ appendPQExpBuffer(str, " WHERE %s", tbl->row_filter);
Overkill? I imagine this could all be replaced with something simple
without trying to distinguish between them.
if (tbl->extra)
appendPQExpBuffer(str, " %s", tbl->extra);
~~~
main:
7.
+ where_start = pg_strcasestr(copy_arg, " where ");
+ if (where_start != NULL)
+ {
+ *where_start = '\0';
+ where_start += 7;
+ }
+
+ paren_start = strchr(copy_arg, '(');
+ if (paren_start != NULL)
+ {
+ char *paren_end = strrchr(paren_start, ')');
+
+ if (!paren_end)
+ pg_fatal("unmatched '(' in --table argument \"%s\"", optarg);
+
+ *paren_start = '\0';
+ *paren_end = '\0';
+
+ paren_start++;
+ }
+
Overkill? IIUC, all you need to know is whether there's something
beyond the table name. Just looking for a space ' ' or a parenthesis
'(' delimiter could be enough, right?
e.g. Something like:
char *extra = strpbk(copy_arg, ' (');
if (extra)
ts->extra = extra + 1;
======
Kind Regards,
Peter Smith.
Fujitsu Australia
On Friday, August 22, 2025 11:19 AM Euler Taveira <euler@eulerto.com> wrote:
On Thu, Aug 21, 2025, at 6:08 AM, Shubham Khanna wrote:
Attachments:
* v3-0001-Support-tables-via-pg_createsubscriber.patch
* v3-0002-Support-WHERE-clause-and-COLUMN-list-in-table-arg.patch+ <term><option>--table=<replaceable class="parameter">table</replaceable></option></term> + <listitem> + <para> + Adds one or more specific tables to the publication for the most recently + specified <option>--database</option>. This option can be given multiple + times to include additional tables. + </para>I think it is a really bad idea to rely on the order of options to infer which tables
are from which database. The current design about publication and
subscription names imply order but it doesn't mean subscription name must
be the next option after the publication name.
The documentation appears incorrect and needs revision. The latest version no
longer depends on the option order; instead, it requires users to provide
database-qualified table names, such as -t "db1.sch1.tb1". This adjustment
allows the command to internally categorize tables by their target database.
Let's recap the initial goal: pg_createsubscriber creates a new logical replica
from a physical standby server. Your proposal is extending the tool to create a
partial logical replica but doesn't mention what you would do with the other
part; that is garbage after the conversion. I'm not convinced that the current
proposal is solid as-is.
I think we can explore extending the existing --clean option in a separate patch
to support table cleanup. This option is implemented in a way that allows adding
further cleanup objects later, so it should be easy to extend it for table.
Prior to this extension, it should be noted in the documentation that users are
required to clean up the tables themselves.
+ <para> + The argument must be a fully qualified table name in one of the + following forms: + <itemizedlist><listitem><para><literal>schema.table</literal></para></li stitem> + <listitem><para><literal>db.schema.table</literal></para></listitem></it emizedlist> + If the database name is provided, it must match the most recent + <option>--database</option> argument. + </para>Why do you want to include the database if you already specified it?
As mentioned earlier, this document needs to be corrected.
+ <para> + A table specification may also include an optional column list and/or + row filter: + <itemizedlist> + <listitem> + <para> + <literal>schema.table(col1, col2, ...)</literal> — publishes + only the specified columns. + </para> + </listitem> + <listitem> + <para> + <literal>schema.table WHERE (predicate)</literal> — publishes + only rows that satisfy the given condition. + </para> + </listitem> + <listitem> + <para> + Both forms can be combined, e.g. + <literal>schema.table(col1, col2) WHERE (id > 100)</literal>. + </para> + </listitem> + </itemizedlist> + </para>This is a bad idea for some cases. Let's say your setup involves a filter that uses
only 10% of the rows from a certain table. It is better to do a manual setup.
Besides that, expressions in options can open a can of worms. In the column
list case, there might be corner cases like if you have a constraint in a certain
column and that column was not included in the column list, the setup will fail;
there isn't a cheap way to detect such cases.
I agree that supporting row filter and column list is not straightforward, and
we can consider it separately and do not implement that in the first version.
It seems this proposal doesn't serve a general purpose. It is copying a *whole*
cluster to use only a subset of tables. Your task with pg_createsubscriber is
more expensive than doing a manual logical replication setup. If you have 500
tables and want to replicate only 400 tables, it doesn't seem productive to
specify 400 -t options.
Specifying multiple -t options should not be problematic, as users has already
done similar things for "FOR TABLE" publication DDLs. I think it's not hard
for user to convert FOR TABLE list to -t option list.
There are some cases like a small set of big tables that
this feature makes sense. However, I'm wondering if a post script should be
used to adjust your setup.
I think it's not very convenient for users to perform this conversion manually.
I've learned in PGConf.dev this year that some users avoid using
pg_createsubscriber because they are unsure of the standard steps required to
convert it into subset table replication. Automating this process would be
beneficial, enabling more users to use pg_createsubscriber and take advantage of
the rapid initial table synchronization.
There might be cases that involves only dozens of
tables but my experience says it is rare. My expectation is that this feature is
useful for avoiding some specific tables. Hence, the copy of the whole cluster is
worthwhile.
We could consider adding an --exclude-table option later if needed. However, as
mentioned earlier, I think specifying multiple -t options can address this use
case as well.
Best Regards,
Hou zj