Bug in v13 due to "Fix corruption when relation truncation fails."

Started by Yura Sokolovabout 1 year ago3 messages
#1Yura Sokolov
y.sokolov@postgrespro.ru

Good day.

Commit "Fix corruption when relation truncation fails." [0]https://git.postgresql.org/gitweb/?p=postgresql.git;h=2280912165d makes
smgrtruncate be called in a critical section. Unfortunately in version
13 it leads to occasional call to palloc0 inside of critical section,
which triggers Assert in debug builds:

- smgrtruncate calls mdtruncate
- mdtruncate may call register_dirty_segment
- register_dirty_segment calls RegisterSyncRequest
- RegisterSyncRequest calls ForwardSyncRequest
- ForwardSyncRequest may call CompactCheckpointerRequestQueue
- CompactCheckpointerRequestQueue may call palloc0 ...

In versions 14 and above CompactCheckpointerRequestQueue does check for
critical section due to commit "Fix bugs in MultiXact truncation" [1]https://git.postgresql.org/gitweb/?p=postgresql.git;h=4c8e00ae9ae,
which were backported down to 14 version, but not 13.

We caught it in our private tests, so it is real.
Cherry-pick of [1]https://git.postgresql.org/gitweb/?p=postgresql.git;h=4c8e00ae9ae from 14 version to 13 solves the issue.

[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;h=2280912165d
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;h=4c8e00ae9ae

------

regards,
Yura Sokolov aka funny-falcon

#2Michael Paquier
michael@paquier.xyz
In reply to: Yura Sokolov (#1)
Re: Bug in v13 due to "Fix corruption when relation truncation fails."

On Thu, Dec 26, 2024 at 01:10:59PM +0300, Yura Sokolov wrote:

Commit "Fix corruption when relation truncation fails." [0] makes
smgrtruncate be called in a critical section. Unfortunately in version 13 it
leads to occasional call to palloc0 inside of critical section, which
triggers Assert in debug builds:

- smgrtruncate calls mdtruncate
- mdtruncate may call register_dirty_segment
- register_dirty_segment calls RegisterSyncRequest
- RegisterSyncRequest calls ForwardSyncRequest
- ForwardSyncRequest may call CompactCheckpointerRequestQueue
- CompactCheckpointerRequestQueue may call palloc0 ...

In versions 14 and above CompactCheckpointerRequestQueue does check for
critical section due to commit "Fix bugs in MultiXact truncation" [1],
which were backported down to 14 version, but not 13.

We caught it in our private tests, so it is real.
Cherry-pick of [1] from 14 version to 13 solves the issue.

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;h=2280912165d
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;h=4c8e00ae9ae

Ah, you're right. This chain of events is possible in REL_13_STABLE,
and only there. Thomas?
--
Michael

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#2)
Re: Bug in v13 due to "Fix corruption when relation truncation fails."

On Wed, Jan 8, 2025 at 4:31 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Dec 26, 2024 at 01:10:59PM +0300, Yura Sokolov wrote:

Commit "Fix corruption when relation truncation fails." [0] makes
smgrtruncate be called in a critical section. Unfortunately in version 13 it
leads to occasional call to palloc0 inside of critical section, which
triggers Assert in debug builds:

- smgrtruncate calls mdtruncate
- mdtruncate may call register_dirty_segment
- register_dirty_segment calls RegisterSyncRequest
- RegisterSyncRequest calls ForwardSyncRequest
- ForwardSyncRequest may call CompactCheckpointerRequestQueue
- CompactCheckpointerRequestQueue may call palloc0 ...

In versions 14 and above CompactCheckpointerRequestQueue does check for
critical section due to commit "Fix bugs in MultiXact truncation" [1],
which were backported down to 14 version, but not 13.

We caught it in our private tests, so it is real.
Cherry-pick of [1] from 14 version to 13 solves the issue.

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;h=2280912165d
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;h=4c8e00ae9ae

Ah, you're right. This chain of events is possible in REL_13_STABLE,
and only there. Thomas?

Yeah. 4c8e00ae9ae addressed two separate issues: (1) another critical
section thing just like this one, and this one in passing, and (2) a
deadlock risk involving checkpoint delays. I had actually worried
about that exact deadlock scenario a bit earlier[1]/messages/by-id/CA+hUKGJC+EDcjXFoeOopcE7zxDwASUyWXh8RkRkgbq=vYTQ-qg@mail.gmail.com and figured out
that it couldn't happen in this code path, so I guess we technically
only need to back-patch the couple of lines with the fast return, but
it seems harmless, simple and good to back-patch Heikki's whole patch
from 14->13. I will study this a bit more and then go ahead and do
that. Thanks for the report, Yura, and the reminder, Michael.

[1]: /messages/by-id/CA+hUKGJC+EDcjXFoeOopcE7zxDwASUyWXh8RkRkgbq=vYTQ-qg@mail.gmail.com