[PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi hackers,
I created a patch to reuse tablesync workers and their replication slots
for more tables that are not synced yet. So that overhead of creating and
dropping workers/replication slots can be reduced.
Current version of logical replication has two steps: tablesync and apply.
In tablesync step, apply worker creates a tablesync worker for each table
and those tablesync workers are killed when they're done with their
associated table. (the number of tablesync workers running at the same time
is limited by "max_sync_workers_per_subscription")
Each tablesync worker also creates a replication slot on publisher during
its lifetime and drops the slot before exiting.
The purpose of this patch is getting rid of the overhead of
creating/killing a new worker (and replication slot) for each table.
It aims to reuse tablesync workers and their replication slots so that
tablesync workers can copy multiple tables from publisher to subscriber
during their lifetime.
The benefits of reusing tablesync workers can be significant if tables are
empty or close to empty.
In an empty table case, spawning tablesync workers and handling replication
slots are where the most time is spent since the actual copy phase takes
too little time.
The changes in the behaviour of tablesync workers with this patch as
follows:
1- After tablesync worker is done with syncing the current table, it takes
a lock and fetches tables in init state
2- it looks for a table that is not already being synced by another worker
from the tables with init state
3- If it founds one, updates its state for the new table and loops back to
beginning to start syncing
4- If no table found, it drops the replication slot and exits
With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch.
"max_sync_workers_per_subscription" is set to 2 as default.
Got some results simply averaging timings from 5 consecutive runs for each
branch.
First, tested logical replication with empty tables.
10 tables
----------------
- master: 286.964 ms
- the patch: 116.852 ms
100 tables
----------------
- master: 2785.328 ms
- the patch: 706.817 ms
10K tables
----------------
- master: 39612.349 ms
- the patch: 12526.981 ms
Also tried replication tables with some data
10 tables loaded with 10MB data
----------------
- master: 1517.714 ms
- the patch: 1399.965 ms
100 tables loaded with 10MB data
----------------
- master: 16327.229 ms
- the patch: 11963.696 ms
Then loaded more data
10 tables loaded with 100MB data
----------------
- master: 13910.189 ms
- the patch: 14770.982 ms
100 tables loaded with 100MB data
----------------
- master: 146281.457 ms
- the patch: 156957.512
If tables are mostly empty, the improvement can be significant - up to 3x
faster logical replication.
With some data loaded, it can still be faster to some extent.
When the table size increases more, the advantage of reusing workers
becomes insignificant.
I would appreciate your comments and suggestions.Thanks in advance for
reviewing.
Best,
Melih
Attachments:
0001-Reuse-Logical-Replication-Background-worker.patchapplication/octet-stream; name=0001-Reuse-Logical-Replication-Background-worker.patchDownload+488-213
On Tue, Jul 5, 2022 at 7:20 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
I created a patch to reuse tablesync workers and their replication slots for more tables that are not synced yet. So that overhead of creating and dropping workers/replication slots can be reduced.
Current version of logical replication has two steps: tablesync and apply.
In tablesync step, apply worker creates a tablesync worker for each table and those tablesync workers are killed when they're done with their associated table. (the number of tablesync workers running at the same time is limited by "max_sync_workers_per_subscription")
Each tablesync worker also creates a replication slot on publisher during its lifetime and drops the slot before exiting.The purpose of this patch is getting rid of the overhead of creating/killing a new worker (and replication slot) for each table.
It aims to reuse tablesync workers and their replication slots so that tablesync workers can copy multiple tables from publisher to subscriber during their lifetime.The benefits of reusing tablesync workers can be significant if tables are empty or close to empty.
In an empty table case, spawning tablesync workers and handling replication slots are where the most time is spent since the actual copy phase takes too little time.The changes in the behaviour of tablesync workers with this patch as follows:
1- After tablesync worker is done with syncing the current table, it takes a lock and fetches tables in init state
2- it looks for a table that is not already being synced by another worker from the tables with init state
3- If it founds one, updates its state for the new table and loops back to beginning to start syncing
4- If no table found, it drops the replication slot and exits
How would you choose the slot name for the table sync, right now it
contains the relid of the table for which it needs to perform sync?
Say, if we ignore to include the appropriate identifier in the slot
name, we won't be able to resue/drop the slot after restart of table
sync worker due to an error.
With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch. "max_sync_workers_per_subscription" is set to 2 as default.
Got some results simply averaging timings from 5 consecutive runs for each branch.First, tested logical replication with empty tables.
10 tables
----------------
- master: 286.964 ms
- the patch: 116.852 ms100 tables
----------------
- master: 2785.328 ms
- the patch: 706.817 ms10K tables
----------------
- master: 39612.349 ms
- the patch: 12526.981 msAlso tried replication tables with some data
10 tables loaded with 10MB data
----------------
- master: 1517.714 ms
- the patch: 1399.965 ms100 tables loaded with 10MB data
----------------
- master: 16327.229 ms
- the patch: 11963.696 msThen loaded more data
10 tables loaded with 100MB data
----------------
- master: 13910.189 ms
- the patch: 14770.982 ms100 tables loaded with 100MB data
----------------
- master: 146281.457 ms
- the patch: 156957.512If tables are mostly empty, the improvement can be significant - up to 3x faster logical replication.
With some data loaded, it can still be faster to some extent.
These results indicate that it is a good idea, especially for very small tables.
When the table size increases more, the advantage of reusing workers becomes insignificant.
It seems from your results that performance degrades for large
relations. Did you try to investigate the reasons for the same?
--
With Regards,
Amit Kapila.
On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
How would you choose the slot name for the table sync, right now it
contains the relid of the table for which it needs to perform sync?
Say, if we ignore to include the appropriate identifier in the slot
name, we won't be able to resue/drop the slot after restart of table
sync worker due to an error.
I had a quick look into the patch and it seems it is using the worker
array index instead of relid while forming the slot name, and I think
that make sense, because now whichever worker is using that worker
index can reuse the slot created w.r.t that index.
With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch. "max_sync_workers_per_subscription" is set to 2 as default.
Got some results simply averaging timings from 5 consecutive runs for each branch.First, tested logical replication with empty tables.
10 tables
----------------
- master: 286.964 ms
- the patch: 116.852 ms100 tables
----------------
- master: 2785.328 ms
- the patch: 706.817 ms10K tables
----------------
- master: 39612.349 ms
- the patch: 12526.981 msAlso tried replication tables with some data
10 tables loaded with 10MB data
----------------
- master: 1517.714 ms
- the patch: 1399.965 ms100 tables loaded with 10MB data
----------------
- master: 16327.229 ms
- the patch: 11963.696 msThen loaded more data
10 tables loaded with 100MB data
----------------
- master: 13910.189 ms
- the patch: 14770.982 ms100 tables loaded with 100MB data
----------------
- master: 146281.457 ms
- the patch: 156957.512If tables are mostly empty, the improvement can be significant - up to 3x faster logical replication.
With some data loaded, it can still be faster to some extent.These results indicate that it is a good idea, especially for very small tables.
When the table size increases more, the advantage of reusing workers becomes insignificant.
It seems from your results that performance degrades for large
relations. Did you try to investigate the reasons for the same?
Yeah, that would be interesting to know that why there is a drop in some cases.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 6, 2022 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
How would you choose the slot name for the table sync, right now it
contains the relid of the table for which it needs to perform sync?
Say, if we ignore to include the appropriate identifier in the slot
name, we won't be able to resue/drop the slot after restart of table
sync worker due to an error.I had a quick look into the patch and it seems it is using the worker
array index instead of relid while forming the slot name, and I think
that make sense, because now whichever worker is using that worker
index can reuse the slot created w.r.t that index.
I think that won't work because each time on restart the slot won't be
fixed. Now, it is possible that we may drop the wrong slot if that
state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
that while creating a slot, we fail because the same name slot already
exists due to some other worker which has created that slot has been
restarted. Also, what about origin_name, won't that have similar
problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
the slot is not the same as we have used in the previous run of a
particular worker, it may start WAL streaming from a different point
based on the slot's confirmed_flush_location.
--
With Regards,
Amit Kapila.
On Wed, Jul 6, 2022 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jul 6, 2022 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Jul 6, 2022 at 9:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
How would you choose the slot name for the table sync, right now it
contains the relid of the table for which it needs to perform sync?
Say, if we ignore to include the appropriate identifier in the slot
name, we won't be able to resue/drop the slot after restart of table
sync worker due to an error.I had a quick look into the patch and it seems it is using the worker
array index instead of relid while forming the slot name, and I think
that make sense, because now whichever worker is using that worker
index can reuse the slot created w.r.t that index.I think that won't work because each time on restart the slot won't be
fixed. Now, it is possible that we may drop the wrong slot if that
state of copying rel is SUBREL_STATE_DATASYNC.
So it will drop the previous slot the worker at that index was using,
so it is possible that on that slot some relation was at
SUBREL_STATE_FINISHEDCOPY or so and we will drop that slot. Because
now relid and replication slot association is not 1-1 so it would be
wrong to drop based on the relstate which is picked by this worker.
In short it makes sense what you have pointed out.
Also, it is possible
that while creating a slot, we fail because the same name slot already
exists due to some other worker which has created that slot has been
restarted. Also, what about origin_name, won't that have similar
problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
the slot is not the same as we have used in the previous run of a
particular worker, it may start WAL streaming from a different point
based on the slot's confirmed_flush_location.
Yeah this is also true, when a tablesync worker has to do catch up
after completing the copy then it might stream from the wrong lsn.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi Amit and Dilip,
Thanks for the replies.
I had a quick look into the patch and it seems it is using the worker
array index instead of relid while forming the slot name
Yes, I changed the slot names so they include slot index instead of
relation id.
This was needed because I aimed to separate replication slots from
relations.
I think that won't work because each time on restart the slot won't be
fixed. Now, it is possible that we may drop the wrong slot if that
state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
that while creating a slot, we fail because the same name slot already
exists due to some other worker which has created that slot has been
restarted. Also, what about origin_name, won't that have similar
problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
the slot is not the same as we have used in the previous run of a
particular worker, it may start WAL streaming from a different point
based on the slot's confirmed_flush_location.
You're right Amit. In case of a failure, tablesync phase of a relation may
continue with different worker and replication slot due to this change in
naming.
Seems like the same replication slot should be used from start to end for a
relation during tablesync. However, creating/dropping replication slots can
be a major overhead in some cases.
It would be nice if these slots are somehow reused.
To overcome this issue, I've been thinking about making some changes in my
patch.
So far, my proposal would be as follows:
Slot naming can be like pg_<sub_id>_<worker_pid> instead of
pg_<sub_id>_<slot_index>. This way each worker can use the same replication
slot during their lifetime.
But if a worker is restarted, then it will switch to a new replication slot
since its pid has changed.
pg_subscription_rel catalog can store replication slot name for each
non-ready relation. Then we can find the slot needed for that particular
relation to complete tablesync.
If a worker syncs a relation without any error, everything works well and
this new replication slot column from the catalog will not be needed.
However if a worker is restarted due to a failure, the previous run of that
worker left its slot behind since it did not exit properly.
And the restarted worker (with a different pid) will see that the relation
is actually in SUBREL_STATE_FINISHEDCOPY and want to proceed for the
catchup step.
Then the worker can look for that particular relation's replication slot
from pg_subscription_rel catalog (slot name should be there since relation
state is not ready). And tablesync can proceed with that slot.
There might be some cases where some replication slots are left behind. An
example to such cases would be when the slot is removed from
pg_subscription_rel catalog and detached from any relation, but tha slot
actually couldn't be dropped for some reason. For such cases, a slot
cleanup logic is needed. This cleanup can also be done by tablesync workers.
Whenever a tablesync worker is created, it can look for existing
replication slots that do not belong to any relation and any worker (slot
name has pid for that), and drop those slots if it finds any.
What do you think about this new way of handling slots? Do you see any
points of concern?
I'm currently working on adding this change into the patch. And would
appreciate any comment.
Thanks,
Melih
It seems from your results that performance degrades for large
relations. Did you try to investigate the reasons for the same?
I have not tried to investigate the performance degradation for large
relations yet.
Once I'm done with changes for the slot usage, I'll look into this and come
with more findings.
Thanks,
Melih
On Fri, Jul 8, 2022 at 10:26 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
I think that won't work because each time on restart the slot won't be
fixed. Now, it is possible that we may drop the wrong slot if that
state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
that while creating a slot, we fail because the same name slot already
exists due to some other worker which has created that slot has been
restarted. Also, what about origin_name, won't that have similar
problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
the slot is not the same as we have used in the previous run of a
particular worker, it may start WAL streaming from a different point
based on the slot's confirmed_flush_location.You're right Amit. In case of a failure, tablesync phase of a relation may continue with different worker and replication slot due to this change in naming.
Seems like the same replication slot should be used from start to end for a relation during tablesync. However, creating/dropping replication slots can be a major overhead in some cases.
It would be nice if these slots are somehow reused.To overcome this issue, I've been thinking about making some changes in my patch.
So far, my proposal would be as follows:Slot naming can be like pg_<sub_id>_<worker_pid> instead of pg_<sub_id>_<slot_index>. This way each worker can use the same replication slot during their lifetime.
But if a worker is restarted, then it will switch to a new replication slot since its pid has changed.
I think using worker_pid also has similar risks of mixing slots from
different workers because after restart same worker_pid could be
assigned to a totally different worker. Can we think of using a unique
64-bit number instead? This will be allocated when each workers
started for the very first time and after that we can refer catalog to
find it as suggested in the idea below.
pg_subscription_rel catalog can store replication slot name for each non-ready relation. Then we can find the slot needed for that particular relation to complete tablesync.
Yeah, this is worth investigating. However, instead of storing the
slot_name, we can store just the unique number (as suggested above).
We should use the same for the origin name as well.
If a worker syncs a relation without any error, everything works well and this new replication slot column from the catalog will not be needed.
However if a worker is restarted due to a failure, the previous run of that worker left its slot behind since it did not exit properly.
And the restarted worker (with a different pid) will see that the relation is actually in SUBREL_STATE_FINISHEDCOPY and want to proceed for the catchup step.
Then the worker can look for that particular relation's replication slot from pg_subscription_rel catalog (slot name should be there since relation state is not ready). And tablesync can proceed with that slot.There might be some cases where some replication slots are left behind. An example to such cases would be when the slot is removed from pg_subscription_rel catalog and detached from any relation, but tha slot actually couldn't be dropped for some reason. For such cases, a slot cleanup logic is needed. This cleanup can also be done by tablesync workers.
Whenever a tablesync worker is created, it can look for existing replication slots that do not belong to any relation and any worker (slot name has pid for that), and drop those slots if it finds any.
This sounds tricky. Why not first drop slot/origin and then detach it
from pg_subscription_rel? On restarts, it is possible that we may
error out after dropping the slot or origin but before updating the
catalog entry but in such case we can ignore missing slot/origin and
detach them from pg_subscription_rel. Also, if we use the unique
number as suggested above, I think even if we don't remove it after
the relation state is ready, it should be okay.
What do you think about this new way of handling slots? Do you see any points of concern?
I'm currently working on adding this change into the patch. And would appreciate any comment.
Thanks for making progress!
--
With Regards,
Amit Kapila.
Hi Amit,
I updated the patch in order to prevent the problems that might be caused
by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the
catalog. So slot names can be reached later and it is ensured
that same replication slot is used during tablesync step of a table.
With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It
stores the slot name for tablesync
-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name
includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog:
i. If the table state is DATASYNC, drop the replication slot from the
catalog and proceed tablesync with a new slot.
ii. If the table state is FINISHEDCOPY, use the replicaton slot from the
catalog, do not create a new slot.
4. Before worker moves to new table, drop any replication slot that are
retrieved from the catalog and used.
5. In case of no table left to sync, drop the replication slot of that sync
worker with worker pid if it exists. (It's possible that a sync worker do
not create a replication slot for itself but uses slots read from the
catalog in each iteration)
I think using worker_pid also has similar risks of mixing slots from
different workers because after restart same worker_pid could be
assigned to a totally different worker. Can we think of using a unique
64-bit number instead? This will be allocated when each workers
started for the very first time and after that we can refer catalog to
find it as suggested in the idea below.
I'm not sure how likely to have colliding pid's for different tablesync
workers in the same subscription.
Though ,having pid in slot name makes it easier to track which slot belongs
to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number.
So I can remove pid's from slot names, if you think that it would be
better.
We should use the same for the origin name as well.
I did not really change anything related to origin names. Origin names are
still the same and include relation id. What do you think would be an issue
with origin names in this patch?
This sounds tricky. Why not first drop slot/origin and then detach it
from pg_subscription_rel? On restarts, it is possible that we may
error out after dropping the slot or origin but before updating the
catalog entry but in such case we can ignore missing slot/origin and
detach them from pg_subscription_rel. Also, if we use the unique
number as suggested above, I think even if we don't remove it after
the relation state is ready, it should be okay.
Right, I did not add an additional slot cleanup step. The patch now drops
the slot when we're done with it and then removes it from the catalog.
Thanks,
Melih
Attachments:
v2-0001-Reuse-Logical-Replication-Background-worker.patchapplication/octet-stream; name=v2-0001-Reuse-Logical-Replication-Background-worker.patchDownload+766-241
On Wed, Jul 27, 2022 at 3:56 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
Hi Amit,
I updated the patch in order to prevent the problems that might be caused by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the catalog. So slot names can be reached later and it is ensured
that same replication slot is used during tablesync step of a table.With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It stores the slot name for tablesync-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog:i. If the table state is DATASYNC, drop the replication slot from the catalog and proceed tablesync with a new slot.
ii. If the table state is FINISHEDCOPY, use the replicaton slot from the catalog, do not create a new slot.
4. Before worker moves to new table, drop any replication slot that are retrieved from the catalog and used.
Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?
5. In case of no table left to sync, drop the replication slot of that sync worker with worker pid if it exists. (It's possible that a sync worker do not create a replication slot for itself but uses slots read from the catalog in each iteration)
I think using worker_pid also has similar risks of mixing slots from
different workers because after restart same worker_pid could be
assigned to a totally different worker. Can we think of using a unique
64-bit number instead? This will be allocated when each workers
started for the very first time and after that we can refer catalog to
find it as suggested in the idea below.I'm not sure how likely to have colliding pid's for different tablesync workers in the same subscription.
Hmm, I think even if there is an iota of a chance which I think is
there, we can't use worker_pid. Assume, that if the same worker_pid is
assigned to another worker once the worker using it got an error out,
the new worker will fail as soon as it will try to create a
replication slot.
Though ,having pid in slot name makes it easier to track which slot belongs to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number. So I can remove pid's from slot names, if you think that it would be better.
I feel it would be better or maybe we need to think of some other
identifier but one thing we need to think about before using a 64-bit
unique identifier here is how will we retrieve its last used value
after restart of server. We may need to store it in a persistent way
somewhere.
We should use the same for the origin name as well.
I did not really change anything related to origin names. Origin names are still the same and include relation id. What do you think would be an issue with origin names in this patch?
The problems will be similar to the slot name. The origin is used to
track the progress of replication, so, if we use the wrong origin name
after the restart, it can send the wrong start_streaming position to
the publisher.
--
With Regards,
Amit Kapila.
Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?
The replication slots that we read from the catalog will not be used for
anything else after we're done with syncing the table which the rep slot
belongs to.
It's removed from the catalog when the sync is completed and it basically
becomes a slot that is not linked to any table or worker. That's why I
think it should be dropped rather than left behind.
Note that if a worker dies and its replication slot continues to exist,
that slot will only be used to complete the sync process of the one table
that the dead worker was syncing but couldn't finish.
When that particular table is synced and becomes ready, the replication
slot has no use anymore.
Hmm, I think even if there is an iota of a chance which I think is
there, we can't use worker_pid. Assume, that if the same worker_pid is
assigned to another worker once the worker using it got an error out,
the new worker will fail as soon as it will try to create a
replication slot.
Right. If something like that happens, worker will fail without doing
anything. Then a new one will be launched and that one will continue to
do the work.
The worst case might be having conflicting pid over and over again while
also having replication slots whose name includes one of those pids still
exist.
It seems unlikely but possible, yes.
I feel it would be better or maybe we need to think of some other
identifier but one thing we need to think about before using a 64-bit
unique identifier here is how will we retrieve its last used value
after restart of server. We may need to store it in a persistent way
somewhere.
We might consider storing this info in a catalog again. Since this last
used value will be different for each subscription, pg_subscription can be
a good place to keep that.
The problems will be similar to the slot name. The origin is used to
track the progress of replication, so, if we use the wrong origin name
after the restart, it can send the wrong start_streaming position to
the publisher.
I understand. But origin naming logic is still the same. Its format is like
pg_<subid>_<relid> .
I did not need to change this since it seems to me origins should belong to
only one table. The patch does not reuse origins.
So I don't think this change introduces an issue with origin. What do you
think?
Thanks,
Melih
On Thu, Jul 28, 2022 at 9:32 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?The replication slots that we read from the catalog will not be used for anything else after we're done with syncing the table which the rep slot belongs to.
It's removed from the catalog when the sync is completed and it basically becomes a slot that is not linked to any table or worker. That's why I think it should be dropped rather than left behind.Note that if a worker dies and its replication slot continues to exist, that slot will only be used to complete the sync process of the one table that the dead worker was syncing but couldn't finish.
When that particular table is synced and becomes ready, the replication slot has no use anymore.
Why can't it be used to sync the other tables if any?
Hmm, I think even if there is an iota of a chance which I think is
there, we can't use worker_pid. Assume, that if the same worker_pid is
assigned to another worker once the worker using it got an error out,
the new worker will fail as soon as it will try to create a
replication slot.Right. If something like that happens, worker will fail without doing anything. Then a new one will be launched and that one will continue to do the work.
The worst case might be having conflicting pid over and over again while also having replication slots whose name includes one of those pids still exist.
It seems unlikely but possible, yes.I feel it would be better or maybe we need to think of some other
identifier but one thing we need to think about before using a 64-bit
unique identifier here is how will we retrieve its last used value
after restart of server. We may need to store it in a persistent way
somewhere.We might consider storing this info in a catalog again. Since this last used value will be different for each subscription, pg_subscription can be a good place to keep that.
This sounds reasonable. Let's do this unless we get some better idea.
The problems will be similar to the slot name. The origin is used to
track the progress of replication, so, if we use the wrong origin name
after the restart, it can send the wrong start_streaming position to
the publisher.I understand. But origin naming logic is still the same. Its format is like pg_<subid>_<relid> .
I did not need to change this since it seems to me origins should belong to only one table. The patch does not reuse origins.
So I don't think this change introduces an issue with origin. What do you think?
There is no such restriction that origins should belong to only one
table. What makes you think like that?
--
With Regards,
Amit Kapila.
Hi Amit,
Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?
The replication slots that we read from the catalog will not be used for
anything else after we're done with syncing the table which the rep slot
belongs to.It's removed from the catalog when the sync is completed and it
basically becomes a slot that is not linked to any table or worker. That's
why I think it should be dropped rather than left behind.Note that if a worker dies and its replication slot continues to exist,
that slot will only be used to complete the sync process of the one table
that the dead worker was syncing but couldn't finish.When that particular table is synced and becomes ready, the replication
slot has no use anymore.
Why can't it be used to sync the other tables if any?
It can be used. But I thought it would be better not to, for example in the
following case:
Let's say a sync worker starts with a table in INIT state. The worker
creates a new replication slot to sync that table.
When sync of the table is completed, it will move to the next one. This
time the new table may be in FINISHEDCOPY state, so the worker may need to
use the new table's existing replication slot.
Before the worker will move to the next table again, there will be two
replication slots used by the worker. We might want to keep one and drop
the other.
At this point, I thought it would be better to keep the replication slot
created by this worker in the first place. I think it's easier to track
slots this way since we know how to generate the rep slots name.
Otherwise we would need to store the replication slot name somewhere too.
This sounds reasonable. Let's do this unless we get some better idea.
I updated the patch to use an unique id for replication slot names and
store the last used id in the catalog.
Can you look into it again please?
There is no such restriction that origins should belong to only one
table. What makes you think like that?
I did not reuse origins since I didn't think it would significantly improve
the performance as reusing replication slots does.
So I just kept the origins as they were, even if it was possible to reuse
them. Does that make sense?
Best,
Melih
Attachments:
v3-0001-Reuse-Logical-Replication-Background-worker.patchapplication/octet-stream; name=v3-0001-Reuse-Logical-Replication-Background-worker.patchDownload+794-247
On Fri, Aug 5, 2022 at 7:25 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
Why can't it be used to sync the other tables if any?
It can be used. But I thought it would be better not to, for example in the following case:
Let's say a sync worker starts with a table in INIT state. The worker creates a new replication slot to sync that table.
When sync of the table is completed, it will move to the next one. This time the new table may be in FINISHEDCOPY state, so the worker may need to use the new table's existing replication slot.
Before the worker will move to the next table again, there will be two replication slots used by the worker. We might want to keep one and drop the other.
At this point, I thought it would be better to keep the replication slot created by this worker in the first place. I think it's easier to track slots this way since we know how to generate the rep slots name.
Otherwise we would need to store the replication slot name somewhere too.
I think there is some basic flaw in slot reuse design. Currently, we
copy the table by starting a repeatable read transaction (BEGIN READ
ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
establishes a snapshot which is first used for copy and then LSN
returned by it is used in the catchup phase after the copy is done.
The patch won't establish such a snapshot before a table copy as it
won't create a slot each time. If this understanding is correct, I
think we need to use ExportSnapshot/ImportSnapshot functionality to
achieve it or do something else to avoid the problem mentioned.
This sounds reasonable. Let's do this unless we get some better idea.
There is no such restriction that origins should belong to only one
table. What makes you think like that?I did not reuse origins since I didn't think it would significantly improve the performance as reusing replication slots does.
So I just kept the origins as they were, even if it was possible to reuse them. Does that make sense?
For small tables, it could have a visible performance difference as it
involves database write operations to each time create and drop the
origin. But if we don't want to reuse then also you need to set its
origin_lsn appropriately. Currently (without this patch), after
creating the slot, we directly use the origin_lsn returned by
create_slot API whereas now it won't be the same case as the patch
doesn't create a slot every time.
--
With Regards,
Amit Kapila.
Hi Amit,
Amit Kapila <amit.kapila16@gmail.com>, 6 Ağu 2022 Cmt, 16:01 tarihinde şunu
yazdı:
I think there is some basic flaw in slot reuse design. Currently, we
copy the table by starting a repeatable read transaction (BEGIN READ
ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
establishes a snapshot which is first used for copy and then LSN
returned by it is used in the catchup phase after the copy is done.
The patch won't establish such a snapshot before a table copy as it
won't create a slot each time. If this understanding is correct, I
think we need to use ExportSnapshot/ImportSnapshot functionality to
achieve it or do something else to avoid the problem mentioned.
I did not really think about the snapshot created by replication slot while
making this change. Thanks for pointing it out.
I've been thinking about how to fix this issue. There are some points I'm
still not sure about.
If the worker will not create a new replication slot, which snapshot should
we actually export and then import?
At the line where the worker was supposed to create replication slot but
now will reuse an existing slot instead, calling pg_export_snapshot() can
export the snapshot instead of CREATE_REPLICATION_SLOT.
However, importing that snapshot into the current transaction may not make
any difference since we exported that snapshot from the same transaction. I
think this wouldn't make sense.
How else an export/import snapshot logic can be placed in this change?
LSN also should be set accurately. The current change does not handle LSN
properly.
I see that CREATE_REPLICATION_SLOT returns consistent_point which indicates
the earliest location which streaming can start from. And this
consistent_point is used as origin_startpos.
If that's the case, would it make sense to use "confirmed_flush_lsn" of the
replication slot in case the slot is being reused?
Since confirmed_flush_lsn can be considered as the safest, earliest
location which streaming can start from, I think it would work.
And at this point, with the correct LSN, I'm wondering whether this
export/import logic is really necessary if the worker does not create a
replication slot. What do you think?
For small tables, it could have a visible performance difference as it
involves database write operations to each time create and drop the
origin. But if we don't want to reuse then also you need to set its
origin_lsn appropriately. Currently (without this patch), after
creating the slot, we directly use the origin_lsn returned by
create_slot API whereas now it won't be the same case as the patch
doesn't create a slot every time.
Correct. For this issue, please consider the LSN logic explained above.
Thanks,
Melih
On Mon, Aug 15, 2022 at 4:56 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
Hi Amit,
Amit Kapila <amit.kapila16@gmail.com>, 6 Ağu 2022 Cmt, 16:01 tarihinde şunu yazdı:
I think there is some basic flaw in slot reuse design. Currently, we
copy the table by starting a repeatable read transaction (BEGIN READ
ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
establishes a snapshot which is first used for copy and then LSN
returned by it is used in the catchup phase after the copy is done.
The patch won't establish such a snapshot before a table copy as it
won't create a slot each time. If this understanding is correct, I
think we need to use ExportSnapshot/ImportSnapshot functionality to
achieve it or do something else to avoid the problem mentioned.I did not really think about the snapshot created by replication slot while making this change. Thanks for pointing it out.
I've been thinking about how to fix this issue. There are some points I'm still not sure about.
If the worker will not create a new replication slot, which snapshot should we actually export and then import?
Can we (export/import) use the snapshot we used the first time when a
slot is created for future transactions that copy other tables?
Because if we can do that then I think we can use the same LSN as
returned for the slot when it was created for all other table syncs.
--
With Regards,
Amit Kapila.
2022年8月5日(金) 22:55 Melih Mutlu <m.melihmutlu@gmail.com>:
Hi Amit,
Why after step 4, do you need to drop the replication slot? Won't just
clearing the required info from the catalog be sufficient?The replication slots that we read from the catalog will not be used for anything else after we're done with syncing the table which the rep slot belongs to.
It's removed from the catalog when the sync is completed and it basically becomes a slot that is not linked to any table or worker. That's why I think it should be dropped rather than left behind.Note that if a worker dies and its replication slot continues to exist, that slot will only be used to complete the sync process of the one table that the dead worker was syncing but couldn't finish.
When that particular table is synced and becomes ready, the replication slot has no use anymore.Why can't it be used to sync the other tables if any?
It can be used. But I thought it would be better not to, for example in the following case:
Let's say a sync worker starts with a table in INIT state. The worker creates a new replication slot to sync that table.
When sync of the table is completed, it will move to the next one. This time the new table may be in FINISHEDCOPY state, so the worker may need to use the new table's existing replication slot.
Before the worker will move to the next table again, there will be two replication slots used by the worker. We might want to keep one and drop the other.
At this point, I thought it would be better to keep the replication slot created by this worker in the first place. I think it's easier to track slots this way since we know how to generate the rep slots name.
Otherwise we would need to store the replication slot name somewhere too.This sounds reasonable. Let's do this unless we get some better idea.
I updated the patch to use an unique id for replication slot names and store the last used id in the catalog.
Can you look into it again please?There is no such restriction that origins should belong to only one
table. What makes you think like that?I did not reuse origins since I didn't think it would significantly improve the performance as reusing replication slots does.
So I just kept the origins as they were, even if it was possible to reuse them. Does that make sense?
Hi
cfbot reports the patch no longer applies [1]http://cfbot.cputube.org/patch_40_3784.log. As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.
[1]: http://cfbot.cputube.org/patch_40_3784.log
Thanks
Ian Barwick
Hi hackers,
I've been working on/struggling with this patch for a while. But I haven't
updated this thread regularly.
So sharing what I did with this patch so far.
Amit Kapila <amit.kapila16@gmail.com>, 6 Ağu 2022 Cmt, 16:01 tarihinde
şunu yazdı:I think there is some basic flaw in slot reuse design. Currently, we
copy the table by starting a repeatable read transaction (BEGIN READ
ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
establishes a snapshot which is first used for copy and then LSN
returned by it is used in the catchup phase after the copy is done.
The patch won't establish such a snapshot before a table copy as it
won't create a slot each time. If this understanding is correct, I
think we need to use ExportSnapshot/ImportSnapshot functionality to
achieve it or do something else to avoid the problem mentioned.
This issue that Amit mentioned causes some problems such as duplicated rows
in the subscriber.
Basically, with this patch, tablesync worker creates a replication slot
only in its first run. To ensure table copy and sync are consistent with
each other, the worker needs the correct snapshot and LSN which both are
returned by slot create operation.
Since this patch does not create a rep. slot in each table copy and instead
reuses the one created in the beginning, we do not get a new snapshot and
LSN for each table anymore. Snapshot gets lost right after the transaction
is committed, but the patch continues to use the same LSN for next tables
without the proper snapshot.
In the end, for example, the worker might first copy some rows, then apply
changes from rep. slot and inserts those rows again for the tables in
later iterations.
I discussed some possible ways to resolve this with Amit offline:
1- Copy all tables in one transaction so that we wouldn't need to deal with
snapshots.
Not easy to keep track of the progress. If the transaction fails, we would
need to start all over again.
2- Don't lose the first snapshot (by keeping a transaction open with the
snapshot imported or some other way) and use the same snapshot and LSN for
all tables.
I'm not sure about the side effects of keeping a transaction open that long
or using a snapshot that might be too old after some time.
Still seems like it might work.
3- For each table, get a new snapshot and LSN by using an existing
replication slot.
Even though this approach wouldn't create a new replication slot, preparing
the slot for snapshot and then taking the snapshot may be costly.
Need some numbers here to see how much this approach would improve the
performance.
I decided to go with approach 3 (creating a new snapshot with an existing
replication slot) for now since it would require less change in the
tablesync worker logic than the other options would.
To achieve this, this patch introduces a new command for Streaming
Replication Protocol.
The new REPLICATION_SLOT_SNAPSHOT command basically mimics how
CREATE_REPLICATION_SLOT creates a snapshot, but without actually creating a
new replication slot.
Later the tablesync worker calls this command if it decides not to create a
new rep. slot, uses the snapshot created and LSN returned by the command.
Also;
After the changes discussed here [1]/messages/by-id/20220714115155.GA5439@depesz.com, concurrent replication origin drops
by apply worker and tablesync workers may hold each other on wait due to
locks taken by replorigin_drop_by_name.
I see that this harms the performance of logical replication quite a bit in
terms of speed.
Even though reusing replication origins was discussed in this thread
before, the patch didn't include any change to do so.
The updated version of the patch now also reuses replication origins too.
Seems like even only changes to reuse origins by itself improves the
performance significantly.
Attached two patches:
0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol.
0002: Reuses workers/replication slots and origins for tablesync
I would appreciate any feedback/review/thought on the approach and both
patches.
I will also share some numbers to compare performances of the patch and
master branch soon.
[1]: /messages/by-id/20220714115155.GA5439@depesz.com
/messages/by-id/20220714115155.GA5439@depesz.com
Best,
--
Melih Mutlu
Microsoft
Attachments:
0001-Add-replication-protocol-cmd-to-create-a-snapshot.patchapplication/octet-stream; name=0001-Add-replication-protocol-cmd-to-create-a-snapshot.patchDownload+275-5
v4-0002-Reuse-Logical-Replication-Background-worker.patchapplication/octet-stream; name=v4-0002-Reuse-Logical-Replication-Background-worker.patchDownload+1095-356
On Mon, Dec 5, 2022 at 6:30 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
Attached two patches:
0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol.
0002: Reuses workers/replication slots and origins for tablesyncI would appreciate any feedback/review/thought on the approach and both patches.
I will also share some numbers to compare performances of the patch and master branch soon.
It would be interesting to see the numbers differently for resue of
replication slots and origins. This will let us know how much each of
those optimizations helps with the reuse of workers.
--
With Regards,
Amit Kapila.
Hi,
Attached new versions of the patch with some changes/fixes.
Here also some numbers to compare the performance of log. rep. with this
patch against the current master branch.
My method of benchmarking is the same with what I did earlier in this
thread. (on a different environment, so not compare the result from this
email with the ones from earlier emails)
With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch.
"max_sync_workers_per_subscription" is set to 2 as default.
Got some results simply averaging timings from 5 consecutive runs for each
branch.
Since this patch is expected to improve log. rep. of empty/close-to-empty
tables, started with measuring performance with empty tables.
| 10 tables | 100 tables | 1000 tables
------------------------------------------------------------------------------
master | 283.430 ms | 22739.107 ms | 105226.177 ms
------------------------------------------------------------------------------
patch | 189.139 ms | 1554.802 ms | 23091.434 ms
After the changes discussed here [1], concurrent replication origin drops
by apply worker and tablesync workers may hold each other on wait due to
locks taken by replorigin_drop_by_name.
I see that this harms the performance of logical replication quite a bit
in terms of speed.
[1]
/messages/by-id/20220714115155.GA5439@depesz.com
Firstly, as I mentioned, replication origin drops made things worse for the
master branch.
Locks start being a more serious issue when the number of tables increases.
The patch reuses the origin so does not need to drop them in each
iteration. That's why the difference between the master and the patch is
more significant now than it was when I first sent the patch.
To just show that the improvement is not only the result of reuse of
origins, but also reuse of rep. slots and workers, I just reverted those
commits which causes the origin drop issue.
| 10 tables | 100 tables | 1000 tables
-----------------------------------------------------------------------------
reverted | 270.012 ms | 2483.907 ms | 31660.758 ms
-----------------------------------------------------------------------------
patch | 189.139 ms | 1554.802 ms | 23091.434 ms
With this patch, logical replication is still faster, even if we wouldn't
have an issue with rep. origin drops.
Also here are some numbers with 10 tables loaded with some data :
| 10 MB | 100 MB
----------------------------------------------------------
master | 2868.524 ms | 14281.711 ms
----------------------------------------------------------
patch | 1750.226 ms | 14592.800 ms
The difference between the master and the patch is getting close when the
size of tables increase, as expected.
I would appreciate any feedback/thought on the approach/patch/numbers etc.
Thanks,
--
Melih Mutlu
Microsoft