Using XLogFileNameP in critical section

Started by Masahiko Sawadaover 6 years ago12 messageshackers
Jump to latest
#1Masahiko Sawada
sawada.mshk@gmail.com

Hi,

I encountered that the assertion error is reported instead of a proper
PANIC message when failed to fsync WAL. The cause is that there are
multiple places where we call XLogFileNameP function that calls palloc
during critical section, for example XLogWrite function.

TRAP: FailedAssertion("CritSectionCount == 0 ||
(context)->allowInCritSection", File: "mcxt.c", Line: 956)

As far as I can see there are five places we need to fix.I've attached a patch.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

fix_XLogFileNameP.patchapplication/x-patch; name=fix_XLogFileNameP.patchDownload+38-16
#2Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#1)
Re: Using XLogFileNameP in critical section

On Fri, Nov 29, 2019 at 06:44:58PM +0100, Masahiko Sawada wrote:

I encountered that the assertion error is reported instead of a proper
PANIC message when failed to fsync WAL. The cause is that there are
multiple places where we call XLogFileNameP function that calls palloc
during critical section, for example XLogWrite function.

TRAP: FailedAssertion("CritSectionCount == 0 ||
(context)->allowInCritSection", File: "mcxt.c", Line: 956)

As far as I can see there are five places we need to fix. I've
attached a patch.

+ msg = "could not fdatasync file \"%s\": %m";
Missing some translations, no?

You are missing a couple of code paths in walreceiver.c,
XLogWalRcvWrite(), where XLogFileNameP is used on a PANIC. This
brings me the following points:
1) If you look closely, all the callers of XLogFileNameP() are used
for the generation of error strings.
2) I am ready to bet that we'll have the same discussion in the future
because somebody will make the same mistake for a new code path.

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name. This leads
actually to some simplifications, see for example XLogWalRcvWrite..
--
Michael

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#2)
Re: Using XLogFileNameP in critical section

On 2019-Dec-02, Michael Paquier wrote:

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name. This leads
actually to some simplifications, see for example XLogWalRcvWrite..

+1, that function is a footgun and there's no possible way to fix it so
that it's never misused(*). Also, put a comment atop XLogFileName
indicating never to create such a "helper" function again.

(*) unless there was a way to have a StaticAssert() that the function is
not called in a critical section -- have the compiler work for us. I
don't think it's possible to write such a thing.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alvaro Herrera (#3)
Re: Using XLogFileNameP in critical section

On Mon, 2 Dec 2019 at 16:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Dec-02, Michael Paquier wrote:

I think that we had better just get rid of XLogFileNameP() (on HEAD)
and just change those code paths so as they use a buffer of size
MAXFNAMELEN, with XLogFileName() generating the file name. This leads
actually to some simplifications, see for example XLogWalRcvWrite..

+1, that function is a footgun and there's no possible way to fix it so
that it's never misused(*). Also, put a comment atop XLogFileName
indicating never to create such a "helper" function again.

(*) unless there was a way to have a StaticAssert() that the function is
not called in a critical section -- have the compiler work for us. I
don't think it's possible to write such a thing.

Agreed. I've attached the updated version patch. Please review it.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

fix_XLogFileNameP_v2.patchapplication/octet-stream; name=fix_XLogFileNameP_v2.patchDownload+95-42
#5Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#4)
Re: Using XLogFileNameP in critical section

On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:

Agreed. I've attached the updated version patch. Please review it.

Thanks, applied on HEAD after a few edits. gettext() does not set
errno, so the new style of issue_xlog_fsync() is actually fine.
Please note that there was one mistake in the patch: you forgot to
assign back errno in assign_xlog_sync_method() after generating the
file name.

Per the low probability of the failures, I did not backpatch that
stuff. I quickly looked at applying that further down, and attached
is a version for v12 FWIW, and I suspect much more conflicts the more
you go down (wal segment size added in 11, different code paths for
replication, etc.).
--
Michael

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#5)
Re: Using XLogFileNameP in critical section

On 2019-Dec-03, Michael Paquier wrote:

Per the low probability of the failures, I did not backpatch that
stuff. I quickly looked at applying that further down, and attached
is a version for v12 FWIW, and I suspect much more conflicts the more
you go down (wal segment size added in 11, different code paths for
replication, etc.).

You didn't attach anything, but I concur about the low probability
aspect: the assertion failure does not occur in production builds
(obviously); and only an out-of-memory situation is a real problem when
an fsync fails. Anyway this should be a very localized fix, right?

I'm not sure that the internationalization stuff in issue_xlog_fsync is
correct. I think the _() should be gettext_noop(), or alternatively the
errmsg() should be errmsg_internal(); otherwise the translation is
invoked twice. (I didn't verify this.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: Using XLogFileNameP in critical section

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I'm not sure that the internationalization stuff in issue_xlog_fsync is
correct. I think the _() should be gettext_noop(), or alternatively the
errmsg() should be errmsg_internal(); otherwise the translation is
invoked twice. (I didn't verify this.)

Also, buildfarm member drongo is not happy:

postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]

I'm guessing you missed a reference someplace.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: Using XLogFileNameP in critical section

I wrote:

Also, buildfarm member drongo is not happy:
postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
I'm guessing you missed a reference someplace.

Hm ... grep swears up and down that there is no remaining instance
of the string "XLogFileNameP" anywhere in the tree. So this doesn't
seem to be the fault of 9989d37d1 per se. What my eye now falls on
is this, a bit further up in the build log [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2019-12-03%2007%3A30%3A01:

...
PreLinkEvent:
Generate DEF file
perl src\tools\msvc\gendef.pl Release\postgres x64
:VCEnd
Not re-generating POSTGRES.DEF, file already exists.
Link:
...

So it seems that the problem might really be a faulty rule in our
MSVC build script about when postgres.def needs to be regenerated?
Or else it's some weird caching problem on drongo --- the lack of
complaints from other Windows critters might point the finger
that way.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2019-12-03%2007%3A30%3A01

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#5)
Re: Using XLogFileNameP in critical section

On Tue, 3 Dec 2019 at 07:09, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 02, 2019 at 10:14:58PM +0100, Masahiko Sawada wrote:

Agreed. I've attached the updated version patch. Please review it.

Thanks, applied on HEAD after a few edits. gettext() does not set
errno, so the new style of issue_xlog_fsync() is actually fine.

Thanks!

Please note that there was one mistake in the patch: you forgot to
assign back errno in assign_xlog_sync_method() after generating the
file name.

My bad. Thank you for fixing it.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#8)
Re: Using XLogFileNameP in critical section

On Tue, Dec 03, 2019 at 11:24:57AM -0500, Tom Lane wrote:

So it seems that the problem might really be a faulty rule in our
MSVC build script about when postgres.def needs to be regenerated?
Or else it's some weird caching problem on drongo --- the lack of
complaints from other Windows critters might point the finger
that way.

Yes, I saw the failure from the buildfarm logs, but I got to the
conclusion that the animal just got crazy with a portion of its
caching because there are no more references to routine removed. So I
did not bother much.

FWIW, I have seen sometimes similar warnings regarding conflicting
projects when doing a compilation on MSVC, applying a patch and then
recompiling (because that's just faster than recompiling the whole).
Perhaps we could do better here. I am not completely sure what
though, it's not a problem I have spent much brain-time on.
--
Michael

#11Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#6)
Re: Using XLogFileNameP in critical section

On Tue, Dec 03, 2019 at 09:35:00AM -0300, Alvaro Herrera wrote:

You didn't attach anything, but I concur about the low probability
aspect: the assertion failure does not occur in production builds
(obviously); and only an out-of-memory situation is a real problem
when
an fsync fails. Anyway this should be a very localized fix, right?

Sorry. You get something like the attached. The recent refactoring
work you committed in this area causes already conflicts on
REL_12_STABLE.

I'm not sure that the internationalization stuff in issue_xlog_fsync
is correct. I think the _() should be gettext_noop(), or alternatively
the errmsg() should be errmsg_internal(); otherwise the translation is
invoked twice. (I didn't verify this.)

Hmm. We actually do both in tablecmds.c:ATWrongRelkindError(), and
that's the code I was looking at yesterday when thinking about the
problem.. However, parse_agg.c, parse_expr.c and parse_func.c among
others like vacuumlazy.c use directly errmsg_internal() without
translating the string first. So there is indeed duplicated work for
both. Does the attached patch look correct to you?
--
Michael

Attachments:

gettext-fix.patchtext/x-diff; charset=us-asciiDownload+2-2
0001-Remove-use-of-XLogFileNameP-in-error-string-generati-v12.patchtext/x-diff; charset=iso-8859-1Download+112-36
#12Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: Using XLogFileNameP in critical section

On 12/3/19 11:24 AM, Tom Lane wrote:

I wrote:

Also, buildfarm member drongo is not happy:
postgres.def : error LNK2001: unresolved external symbol XLogFileNameP [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
Release/postgres/postgres.lib : fatal error LNK1120: 1 unresolved externals [C:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]
I'm guessing you missed a reference someplace.

Hm ... grep swears up and down that there is no remaining instance
of the string "XLogFileNameP" anywhere in the tree. So this doesn't
seem to be the fault of 9989d37d1 per se. What my eye now falls on
is this, a bit further up in the build log [1]:

...
PreLinkEvent:
Generate DEF file
perl src\tools\msvc\gendef.pl Release\postgres x64
:VCEnd
Not re-generating POSTGRES.DEF, file already exists.
Link:
...

So it seems that the problem might really be a faulty rule in our
MSVC build script about when postgres.def needs to be regenerated?
Or else it's some weird caching problem on drongo --- the lack of
complaints from other Windows critters might point the finger
that way.

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2019-12-03%2007%3A30%3A01

this was pilot error on my part. Should be fixed now.

cheers

andrew