Is it OK to perform logging while holding a LWLock?

Started by Chao Li4 months ago10 messageshackers

li.evan.chao@gmail.com

4 months ago

Hi,

As $SUBJECT says, my understanding is no. I think LWLocks are generally only held for a very short duration, like a few CPU cycles to read or modify some shared data, so operations that might involve additional overhead (like logging, which could touch I/O paths) are better done outside the lock section when possible.

I don’t find a concrete reference or documentation explicitly stating this, but it seems to be a common convention in the codebase.

Am I missing something here? Or are there cases where doing so is considered acceptable?

I’m asking because I noticed that in DisableLogicalDecoding(), we emit an ereport() before releasing LogicalDecodingControlLock:
```
if (!in_recovery)
ereport(LOG,
errmsg("logical decoding is disabled because there are no valid logical replication slots"));

LWLockRelease(LogicalDecodingControlLock);
```

In this particular case, the ereport() doesn’t seem to depend on the lock at all, so it looks safe to release the lock first and then log.

So, If my understanding is correct, please see the attached patch. Otherwise, feel free to ignore it.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

tgl@sss.pgh.pa.us

4 months ago

In reply to: Chao Li (#1)

Re: Is it OK to perform logging while holding a LWLock?

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

li.evan.chao@gmail.com

4 months ago

In reply to: Tom Lane (#2)

Re: Is it OK to perform logging while holding a LWLock?

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

Thanks for the detailed explanation. I’ve written this down in my notebook, very helpful.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

li.evan.chao@gmail.com

4 months ago

In reply to: Tom Lane (#2)

Re: Is it OK to perform logging while holding a LWLock?

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Chao Li (#4)

Re: Is it OK to perform logging while holding a LWLock?

Hi,

On Wed, Feb 11, 2026 at 5:07 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/

While I agree with the concern Tom pointed out, I can find that there
are other places where we do ereport() while holding a lwlock. For
instance:

src/backend/access/transam/multixact.c:
else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
{
ereport(LOG,
(errmsg("cannot truncate up to MultiXact %u because it
does not exist on disk, skipping truncation",
newOldestMulti)));
LWLockRelease(MultiXactTruncationLock);
return;
}

src/backend/commands/vacuum.c:
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed in over
2 billion transactions"),
errdetail("You might have already suffered
transaction-wraparound data loss.")));
LWLockRelease(WrapLimitsVacuumLock);
return;
}

src/backend/postmaster/autovacuum.c:
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);

/*
* No other process can put a worker in starting mode, so if
* startingWorker is still INVALID after exchanging our lock,
* we assume it's the same one we saw above (so we don't
* recheck the launch time).
*/
if (AutoVacuumShmem->av_startingWorker != NULL)
{
worker = AutoVacuumShmem->av_startingWorker;
worker->wi_dboid = InvalidOid;
worker->wi_tableoid = InvalidOid;
worker->wi_sharedrel = false;
worker->wi_proc = NULL;
worker->wi_launchtime = 0;
dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
&worker->wi_links);
AutoVacuumShmem->av_startingWorker = NULL;
ereport(WARNING,
errmsg("autovacuum worker took too long to
start; canceled"));
}

Should we fix these places as well?

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

li.evan.chao@gmail.com

3 months ago

In reply to: Masahiko Sawada (#5)

Re: Is it OK to perform logging while holding a LWLock?

On Mar 4, 2026, at 04:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Wed, Feb 11, 2026 at 5:07 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/

While I agree with the concern Tom pointed out, I can find that there
are other places where we do ereport() while holding a lwlock. For
instance:

src/backend/access/transam/multixact.c:
else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
{
ereport(LOG,
(errmsg("cannot truncate up to MultiXact %u because it
does not exist on disk, skipping truncation",
newOldestMulti)));
LWLockRelease(MultiXactTruncationLock);
return;
}

src/backend/commands/vacuum.c:
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed in over
2 billion transactions"),
errdetail("You might have already suffered
transaction-wraparound data loss.")));
LWLockRelease(WrapLimitsVacuumLock);
return;
}

src/backend/postmaster/autovacuum.c:
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);

/*
* No other process can put a worker in starting mode, so if
* startingWorker is still INVALID after exchanging our lock,
* we assume it's the same one we saw above (so we don't
* recheck the launch time).
*/
if (AutoVacuumShmem->av_startingWorker != NULL)
{
worker = AutoVacuumShmem->av_startingWorker;
worker->wi_dboid = InvalidOid;
worker->wi_tableoid = InvalidOid;
worker->wi_sharedrel = false;
worker->wi_proc = NULL;
worker->wi_launchtime = 0;
dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
&worker->wi_links);
AutoVacuumShmem->av_startingWorker = NULL;
ereport(WARNING,
errmsg("autovacuum worker took too long to
start; canceled"));
}

Should we fix these places as well?

Maybe. If needed, I can take a look at them one by one.

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

That sounds reasonable. However, looking at the current code, the “enabling” log is printed after releasing the lock:
```
LWLockRelease(LogicalDecodingControlLock);

END_CRIT_SECTION();

if (!in_recovery)
ereport(LOG,
errmsg("logical decoding is enabled upon creating a new logical replication slot"));
```

So the log order is not currently protected by the lock. If we really want to ensure the ordering between these two messages, we might instead need to move the “enabling” log before releasing the lock.

In my understanding logs are never strictly ordered across processes anyway. Even if we move the “enabling” log into the locked section, it would only reduce the possibility of disorder rather than fully prevent it.

I wonder what you think would be the best way to proceed here.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Chao Li (#6)

Re: Is it OK to perform logging while holding a LWLock?

On Tue, Mar 3, 2026 at 2:49 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Mar 4, 2026, at 04:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Wed, Feb 11, 2026 at 5:07 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/

While I agree with the concern Tom pointed out, I can find that there
are other places where we do ereport() while holding a lwlock. For
instance:

src/backend/access/transam/multixact.c:
else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
{
ereport(LOG,
(errmsg("cannot truncate up to MultiXact %u because it
does not exist on disk, skipping truncation",
newOldestMulti)));
LWLockRelease(MultiXactTruncationLock);
return;
}

src/backend/commands/vacuum.c:
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed in over
2 billion transactions"),
errdetail("You might have already suffered
transaction-wraparound data loss.")));
LWLockRelease(WrapLimitsVacuumLock);
return;
}

src/backend/postmaster/autovacuum.c:
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);

/*
* No other process can put a worker in starting mode, so if
* startingWorker is still INVALID after exchanging our lock,
* we assume it's the same one we saw above (so we don't
* recheck the launch time).
*/
if (AutoVacuumShmem->av_startingWorker != NULL)
{
worker = AutoVacuumShmem->av_startingWorker;
worker->wi_dboid = InvalidOid;
worker->wi_tableoid = InvalidOid;
worker->wi_sharedrel = false;
worker->wi_proc = NULL;
worker->wi_launchtime = 0;
dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
&worker->wi_links);
AutoVacuumShmem->av_startingWorker = NULL;
ereport(WARNING,
errmsg("autovacuum worker took too long to
start; canceled"));
}

Should we fix these places as well?

Maybe. If needed, I can take a look at them one by one.

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

That sounds reasonable. However, looking at the current code, the “enabling” log is printed after releasing the lock:
```
LWLockRelease(LogicalDecodingControlLock);

END_CRIT_SECTION();

if (!in_recovery)
ereport(LOG,
errmsg("logical decoding is enabled upon creating a new logical replication slot"));
```

So the log order is not currently protected by the lock. If we really want to ensure the ordering between these two messages, we might instead need to move the “enabling” log before releasing the lock.

No, IIUC activation happens while a valid logical replication slot is
held. Since deactivation requires that no slots are present, there's
no risk of a concurrent deactivation occurring between the
LWLockRelease() and the ereport().

I wonder what you think would be the best way to proceed here.

Thinking more about this: while the fix might be applicable elsewhere,
it seems unnecessary here. The deactivation process is restricted to
the startup and checkpointer processes. Given that these processes
don't access system catalogs for example, I think there is little or
zero risk of a deadlock occurring even if we do something in
errcontext.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Masahiko Sawada (#5)

Re: Is it OK to perform logging while holding a LWLock?

On Tue, Mar 03, 2026 at 12:48:26PM -0800, Masahiko Sawada wrote:

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

I don't really understand why we need to care about changing these
code paths. LWLocks are not bound to requirements like avoiding
elog() or Postgres-specific calls while being hold, so what we are
doing is basically fine. None of the code paths changed here are
relevant performance-wise, as well. Hence, why caring at all with
such changes?
--
Michael

li.evan.chao@gmail.com

3 months ago

In reply to: Masahiko Sawada (#7)

Re: Is it OK to perform logging while holding a LWLock?

On Mar 4, 2026, at 07:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 3, 2026 at 2:49 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Mar 4, 2026, at 04:48, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Wed, Feb 11, 2026 at 5:07 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 11, 2026, at 11:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:

As $SUBJECT says, my understanding is no.

It's not a great idea, but I'm not sure it's fatal. There are places
that hold LWLocks for awhile.

I think LWLocks are generally only held for a very short duration,
like a few CPU cycles to read or modify some shared data,

Spinlocks are treated that way, but we're willing to hold LWLocks
longer. The main thing I'd be concerned about is that there is no
deadlock-detection infrastructure for LWLocks, so you'd better be
darn certain there is no possibility of deadlock. That usually
means you want to limit the extent of code that could run while
you're holding the lock.

In your specific example, the thing I'd be afraid of is that an
errcontext callback might do something you're not expecting.
We don't forbid errcontext callbacks from doing catalog lookups,
for instance. So on the whole I agree with this patch, with
or without any concrete example that fails.

regards, tom lane

As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/

While I agree with the concern Tom pointed out, I can find that there
are other places where we do ereport() while holding a lwlock. For
instance:

src/backend/access/transam/multixact.c:
else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
{
ereport(LOG,
(errmsg("cannot truncate up to MultiXact %u because it
does not exist on disk, skipping truncation",
newOldestMulti)));
LWLockRelease(MultiXactTruncationLock);
return;
}

src/backend/commands/vacuum.c:
if (frozenAlreadyWrapped)
{
ereport(WARNING,
(errmsg("some databases have not been vacuumed in over
2 billion transactions"),
errdetail("You might have already suffered
transaction-wraparound data loss.")));
LWLockRelease(WrapLimitsVacuumLock);
return;
}

src/backend/postmaster/autovacuum.c:
LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);

/*
* No other process can put a worker in starting mode, so if
* startingWorker is still INVALID after exchanging our lock,
* we assume it's the same one we saw above (so we don't
* recheck the launch time).
*/
if (AutoVacuumShmem->av_startingWorker != NULL)
{
worker = AutoVacuumShmem->av_startingWorker;
worker->wi_dboid = InvalidOid;
worker->wi_tableoid = InvalidOid;
worker->wi_sharedrel = false;
worker->wi_proc = NULL;
worker->wi_launchtime = 0;
dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
&worker->wi_links);
AutoVacuumShmem->av_startingWorker = NULL;
ereport(WARNING,
errmsg("autovacuum worker took too long to
start; canceled"));
}

Should we fix these places as well?

Maybe. If needed, I can take a look at them one by one.

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

That sounds reasonable. However, looking at the current code, the “enabling” log is printed after releasing the lock:
```
LWLockRelease(LogicalDecodingControlLock);

END_CRIT_SECTION();

if (!in_recovery)
ereport(LOG,
errmsg("logical decoding is enabled upon creating a new logical replication slot"));
```

So the log order is not currently protected by the lock. If we really want to ensure the ordering between these two messages, we might instead need to move the “enabling” log before releasing the lock.

No, IIUC activation happens while a valid logical replication slot is
held. Since deactivation requires that no slots are present, there's
no risk of a concurrent deactivation occurring between the
LWLockRelease() and the ereport().

I wonder what you think would be the best way to proceed here.

Thinking more about this: while the fix might be applicable elsewhere,
it seems unnecessary here. The deactivation process is restricted to
the startup and checkpointer processes. Given that these processes
don't access system catalogs for example, I think there is little or
zero risk of a deadlock occurring even if we do something in
errcontext.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Thanks for the explanations. In that case, I will withdraw this patch.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Michael Paquier (#8)

Re: Is it OK to perform logging while holding a LWLock?

On Tue, Mar 3, 2026 at 4:15 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Mar 03, 2026 at 12:48:26PM -0800, Masahiko Sawada wrote:

Also, if we reverse the ereport() and LWLockRelease() in the specific
example in logicalctl.c, it would happen that a concurrent logical
decoding activation writes the log "logical decoding is enabled upon
creating a new logical replication slot" before the deactivation
"logical decoding is disabled because there are no valid logical
replication slots", confusing users since the logical decoding is
active even though the last log saying "logical decoding is disabled".

I don't really understand why we need to care about changing these
code paths. LWLocks are not bound to requirements like avoiding
elog() or Postgres-specific calls while being hold, so what we are
doing is basically fine. None of the code paths changed here are
relevant performance-wise, as well. Hence, why caring at all with
such changes?

We were concerned about potential deadlocks that might happen if we do
something (including system catalog lookups etc.) in errcontext.
However, as I mentioned in the previous email[1]/messages/by-id/CAD21AoDgyTdJgd1Ep1Dgu12Wa7JXzp78f+8-BC=MzeT1qt_9hA@mail.gmail.com, these changes are
not necessary as we don't need to be concerned about deadlocks in this
case.

Regards,

[1]: /messages/by-id/CAD21AoDgyTdJgd1Ep1Dgu12Wa7JXzp78f+8-BC=MzeT1qt_9hA@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com