pgbench error: (setshell) of script 0; execution of meta-command failed

Started by Andy Fanover 1 year ago17 messageshackers
Jump to latest
#1Andy Fan
zhihui.fan1213@gmail.com

Hi:

I run into the {subject} issue with the below setup.

cat foo.sql

\setshell txn_mode echo ${TXN_MODE}
\setshell speed echo ${SPEED}
\setshell sleep_ms echo ${SLEEP_MS}
\setshell subtxn_mode echo ${SUBTXN_MODE}

select 1;

$ TXN_MODE=-1 SPEED=1 SLEEP_MS=0 SUBTXN_MODE=-1 pgbench -n -ffoo.sql postgres -T5 -c4 --exit-on-abort

I *randomly*(7/8) get errors like:

pgbench (18devel)
pgbench: error: client 2 aborted in command 0 (setshell) of script 0; execution of meta-command failed
pgbench: error: Run was aborted due to an error in thread 0

I debug this for 1+ hour and didn't find anything useful, so I'd like
have a ask if there is any known issue or the way I use \setshell is
wrong?

Thanks

--
Best Regards
Andy Fan

#2Andy Fan
zhihui.fan1213@gmail.com
In reply to: Andy Fan (#1)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Andy Fan <zhihuifan1213@163.com> writes:

Hi:

I run into the {subject} issue with the below setup.

cat foo.sql

\setshell txn_mode echo ${TXN_MODE}
\setshell speed echo ${SPEED}
\setshell sleep_ms echo ${SLEEP_MS}
\setshell subtxn_mode echo ${SUBTXN_MODE}

select 1;

$ TXN_MODE=-1 SPEED=1 SLEEP_MS=0 SUBTXN_MODE=-1 pgbench -n -ffoo.sql postgres -T5 -c4 --exit-on-abort

I *randomly*(7/8) get errors like:

pgbench (18devel)
pgbench: error: client 2 aborted in command 0 (setshell) of script 0; execution of meta-command failed
pgbench: error: Run was aborted due to an error in thread 0

I think I have figured out the issue, if you want reproduce it quicker,
you can change the '-T5' to '-T1' in the pgbench command and run many times.

Here is the patch to fix it, would someone take a look at?

pgbench: Avoid misleading error for \[set]shell when timer_exceeded.

fgets in executeMetaCommand may return NULL if it receives a signal
during the shell command is executing. Before this commit, pgbench
client raises ERROR like below.

pgbench: error: client 3 aborted in command 3 (setshell) of script 0;
execution of meta-command failed

This behavior is misleading since people may think something is
wrong. In this commit, we just ignore fgets return NULL when
timer_exceeded.

--
Best Regards
Andy Fan

Attachments:

0001-pgbench-Avoid-misleading-error-for-set-shell-when-ti.patchtext/x-diffDownload+7-2
#3Fujii Masao
masao.fujii@gmail.com
In reply to: Andy Fan (#2)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

On 2025/01/10 16:09, Andy Fan wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Hi:

I run into the {subject} issue with the below setup.

cat foo.sql

\setshell txn_mode echo ${TXN_MODE}
\setshell speed echo ${SPEED}
\setshell sleep_ms echo ${SLEEP_MS}
\setshell subtxn_mode echo ${SUBTXN_MODE}

select 1;

$ TXN_MODE=-1 SPEED=1 SLEEP_MS=0 SUBTXN_MODE=-1 pgbench -n -ffoo.sql postgres -T5 -c4 --exit-on-abort

I *randomly*(7/8) get errors like:

pgbench (18devel)
pgbench: error: client 2 aborted in command 0 (setshell) of script 0; execution of meta-command failed
pgbench: error: Run was aborted due to an error in thread 0

Interestingly, my git bisect pointed to the following commit
as the cause of this issue, even though it seems unrelated to
the pgbench problem at all. It’s possible my git bisect result
is incorrect, but when I reverted this commit on HEAD,
the pgbench issue didn’t occur during my tests.

----------------------
06843df4abc5a0c24e4bd154a8a1327e074fa3ae is the first bad commit
commit 06843df4abc5a0c24e4bd154a8a1327e074fa3ae
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Sep 29 14:07:30 2023 -0400

Suppress macOS warnings about duplicate libraries in link commands.
----------------------

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#4Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andy Fan (#2)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Hi,

On Fri, 10 Jan 2025 at 10:10, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Hi:

I run into the {subject} issue with the below setup.

cat foo.sql

\setshell txn_mode echo ${TXN_MODE}
\setshell speed echo ${SPEED}
\setshell sleep_ms echo ${SLEEP_MS}
\setshell subtxn_mode echo ${SUBTXN_MODE}

select 1;

$ TXN_MODE=-1 SPEED=1 SLEEP_MS=0 SUBTXN_MODE=-1 pgbench -n -ffoo.sql postgres -T5 -c4 --exit-on-abort

I *randomly*(7/8) get errors like:

pgbench (18devel)
pgbench: error: client 2 aborted in command 0 (setshell) of script 0; execution of meta-command failed
pgbench: error: Run was aborted due to an error in thread 0

I think I have figured out the issue, if you want reproduce it quicker,
you can change the '-T5' to '-T1' in the pgbench command and run many times.

I ran it ~500 times on HEAD but the issue did not occur on my machine.
Is 500 times not enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#3)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

On 2025/01/10 21:41, Fujii Masao wrote:

On 2025/01/10 16:09, Andy Fan wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Hi:

I run into the {subject} issue with the below setup.

cat foo.sql

\setshell txn_mode echo ${TXN_MODE}
\setshell speed echo ${SPEED}
\setshell sleep_ms echo ${SLEEP_MS}
\setshell subtxn_mode echo ${SUBTXN_MODE}

select 1;

$ TXN_MODE=-1 SPEED=1 SLEEP_MS=0 SUBTXN_MODE=-1 pgbench -n -ffoo.sql postgres -T5 -c4 --exit-on-abort

I *randomly*(7/8) get errors like:

pgbench (18devel)
pgbench: error: client 2 aborted in command 0 (setshell) of script 0; execution of meta-command failed
pgbench: error: Run was aborted due to an error in thread 0

Interestingly, my git bisect pointed to the following commit
as the cause of this issue, even though it seems unrelated to
the pgbench problem at all. It’s possible my git bisect result
is incorrect, but when I reverted this commit on HEAD,
the pgbench issue didn’t occur during my tests.

----------------------
06843df4abc5a0c24e4bd154a8a1327e074fa3ae is the first bad commit
commit 06843df4abc5a0c24e4bd154a8a1327e074fa3ae
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Sep 29 14:07:30 2023 -0400

    Suppress macOS warnings about duplicate libraries in link commands.
----------------------

Before this commit, pgbench used pqsignal() from port/pqsignal.c
to set the signal handler for SIGALRM. This version of pqsignal()
sets SA_RESTART for frontend code, so fgets() in runShellCommand()
wouldn't return NULL even if SIGALRM arrived during fgets(),
preventing the reported error.

On the other hand, currently, pgbench seems to use pqsignal()
from legacy-pqsignal.c, which doesn't set SA_RESTART for SIGALRM.
As a result, SIGALRM can interrupt fgets() in runShellCommand()
and make it return NULL, leading to the reported error.

I'm not sure if this change was an intentional result of that commit...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#5)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Fujii Masao <masao.fujii@oss.nttdata.com> writes:

Before this commit, pgbench used pqsignal() from port/pqsignal.c
to set the signal handler for SIGALRM. This version of pqsignal()
sets SA_RESTART for frontend code, so fgets() in runShellCommand()
wouldn't return NULL even if SIGALRM arrived during fgets(),
preventing the reported error.

On the other hand, currently, pgbench seems to use pqsignal()
from legacy-pqsignal.c, which doesn't set SA_RESTART for SIGALRM.
As a result, SIGALRM can interrupt fgets() in runShellCommand()
and make it return NULL, leading to the reported error.

Oh, interesting.

I'm not sure if this change was an intentional result of that commit...

Definitely not. I think we'd better look into how to undo that
effect.

Since legacy-pqsignal is really not supposed to be used by clients
anymore, maybe we could just adjust it to set SA_RESTART for SIGALRM?
The other alternatives I can think of amount to re-introducing
link order dependencies, which would be horrid.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nazir Bilal Yavuz (#4)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Nazir Bilal Yavuz <byavuz81@gmail.com> writes:

I ran it ~500 times on HEAD but the issue did not occur on my machine.

What platform are you testing on?

regards, tom lane

#8Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Tom Lane (#7)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Hi,

On Fri, 10 Jan 2025 at 17:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nazir Bilal Yavuz <byavuz81@gmail.com> writes:

I ran it ~500 times on HEAD but the issue did not occur on my machine.

What platform are you testing on?

My local machine is: Linux 6.12.8-amd64 #1 SMP PREEMPT_DYNAMIC Debian
6.12.8-1 (2025-01-02) x86_64 GNU/Linux

Also, trying on macOS CI VM (by re-running with terminal access) but
no luck so far.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

I wrote:

Since legacy-pqsignal is really not supposed to be used by clients
anymore, maybe we could just adjust it to set SA_RESTART for SIGALRM?
The other alternatives I can think of amount to re-introducing
link order dependencies, which would be horrid.

Actually, after re-reading the thread that led to 06843df4a [1]/messages/by-id/467042.1695766998@sss.pgh.pa.us,
I think a better idea is to introduce some macro magic to force
frontend clients to use libpgport's version of pqsignal() instead
of the one from libpq. We mustn't change the real name of libpq's
version, but I think we could get away with that for libpgport.

I'm a bit tied up today, but can look at this over the weekend if
nobody beats me to it.

regards, tom lane

[1]: /messages/by-id/467042.1695766998@sss.pgh.pa.us

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

I wrote:

Actually, after re-reading the thread that led to 06843df4a [1],
I think a better idea is to introduce some macro magic to force
frontend clients to use libpgport's version of pqsignal() instead
of the one from libpq. We mustn't change the real name of libpq's
version, but I think we could get away with that for libpgport.

After studying this more, I think what we should do in HEAD is
even more aggressive: let's make the real name of port/pqsignal.c's
function be either pqsignal_fe in frontend, or pqsignal_be in backend.
This positively ensures that there's no collision with libpq's
export, and it seems like a good idea for the additional reason that
the frontend and backend versions are not really interchangeable.

However, we can't get away with that in released branches because
it'd be an ABI break for any outside code that calls pqsignal.
What I propose doing in the released branches is what's shown in
the attached patch for v17: rename port/pqsignal.c's function to
pqsignal_fe in frontend, but leave it as pqsignal in the backend.
Leaving it as pqsignal in the backend preserves ABI for extension
modules, at the cost that it's not entirely clear which pqsignal
an extension that's also linked with libpq will get. But that
hazard affects none of our code, and it existed already for any
third-party extensions that call pqsignal. In principle using
"pqsignal_fe" in frontend creates an ABI hazard for outside users of
libpgport.a. But I think it's not really a problem, because we don't
support that as a shared library. As long as people build with
headers and .a files from the same minor release, they'll be OK.

BTW, this decouples legacy-pqsignal.c from pqsignal.c enough
that we could now do what's contemplated in the comments from
3b00fdba9: simplify that version by making it return void,
or perhaps better just a true/false success report.
I've not touched that point here, though.

regards, tom lane

Attachments:

v1-fix-pqsignal-HEAD.patchtext/x-diff; charset=us-ascii; name=v1-fix-pqsignal-HEAD.patchDownload+19-2
v1-fix-pqsignal-REL17.patchtext/x-diff; charset=us-ascii; name=v1-fix-pqsignal-REL17.patchDownload+17-2
#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#10)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

On Sat, Jan 11, 2025 at 02:04:13PM -0500, Tom Lane wrote:

After studying this more, I think what we should do in HEAD is
even more aggressive: let's make the real name of port/pqsignal.c's
function be either pqsignal_fe in frontend, or pqsignal_be in backend.
This positively ensures that there's no collision with libpq's
export, and it seems like a good idea for the additional reason that
the frontend and backend versions are not really interchangeable.

That would be nice.

However, we can't get away with that in released branches because
it'd be an ABI break for any outside code that calls pqsignal.
What I propose doing in the released branches is what's shown in
the attached patch for v17: rename port/pqsignal.c's function to
pqsignal_fe in frontend, but leave it as pqsignal in the backend.
Leaving it as pqsignal in the backend preserves ABI for extension
modules, at the cost that it's not entirely clear which pqsignal
an extension that's also linked with libpq will get. But that
hazard affects none of our code, and it existed already for any
third-party extensions that call pqsignal. In principle using
"pqsignal_fe" in frontend creates an ABI hazard for outside users of
libpgport.a. But I think it's not really a problem, because we don't
support that as a shared library. As long as people build with
headers and .a files from the same minor release, they'll be OK.

I don't have any concrete reasons to think you are wrong about this, but it
does feel somewhat risky to me. It might be worth testing it with a couple
of third-party projects that use the frontend version of pqsignal().
codesearch.debian.net shows a couple that may be doing so [0]https://github.com/gleu/pgstats [1]https://github.com/hapostgres/pg_auto_failover [2]https://github.com/credativ/pg_checksums.

BTW, this decouples legacy-pqsignal.c from pqsignal.c enough
that we could now do what's contemplated in the comments from
3b00fdba9: simplify that version by making it return void,
or perhaps better just a true/false success report.
I've not touched that point here, though.

I think it should just return void since AFAICT nobody checks the return
value. But it would probably be a good idea to add some sort of error
checking within the function. I've attached a draft patch that adds some
new assertions. I originally tried to use elog() where it was available,
but besides making the code even more unreadable, I think it's unnecessary
(since AFAICT any problems are likely coding errors).

[0]: https://github.com/gleu/pgstats
[1]: https://github.com/hapostgres/pg_auto_failover
[2]: https://github.com/credativ/pg_checksums

--
nathan

Attachments:

v1-0001-modify-pqsignal.patchtext/plain; charset=us-asciiDownload+7-28
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#11)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Nathan Bossart <nathandbossart@gmail.com> writes:

On Sat, Jan 11, 2025 at 02:04:13PM -0500, Tom Lane wrote:

What I propose doing in the released branches is what's shown in
the attached patch for v17: rename port/pqsignal.c's function to
pqsignal_fe in frontend, but leave it as pqsignal in the backend.
Leaving it as pqsignal in the backend preserves ABI for extension
modules, at the cost that it's not entirely clear which pqsignal
an extension that's also linked with libpq will get. But that
hazard affects none of our code, and it existed already for any
third-party extensions that call pqsignal. In principle using
"pqsignal_fe" in frontend creates an ABI hazard for outside users of
libpgport.a. But I think it's not really a problem, because we don't
support that as a shared library. As long as people build with
headers and .a files from the same minor release, they'll be OK.

I don't have any concrete reasons to think you are wrong about this, but it
does feel somewhat risky to me. It might be worth testing it with a couple
of third-party projects that use the frontend version of pqsignal().
codesearch.debian.net shows a couple that may be doing so [0] [1] [2].

It's fair to worry about this, but I don't think my testing that would
prove a lot. AFAICS, whether somebody runs into trouble would depend
on many factors like their specific build process and what versions of
which packages they have installed.

In any case, I think we have a very limited amount of wiggle room.
We definitely cannot change libpq's ABI without provoking howls of
anguish.

I have been wondering whether it would help to add something like
this at the end of port/pqsignal.c in the back branches:

#ifdef FRONTEND
#undef pqsignal

/* ABI-compatibility wrapper */
pqsigfunc
pqsignal(int signo, pqsigfunc func)
{
return pqsignal_fe(signo, func);
}
#endif

(plus or minus an extern or so, but you get the idea). The point of
this is that compiling against old headers and then linking against
newer libpgport.a would still work. It does nothing however for the
reverse scenario of compiling against new headers and then linking
against old libpgport.a. So I haven't persuaded myself that it's
worth the trouble -- but I'm happy to include it if others think
it would help.

regards, tom lane

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#12)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

On Mon, Jan 13, 2025 at 05:51:54PM -0500, Tom Lane wrote:

It's fair to worry about this, but I don't think my testing that would
prove a lot. AFAICS, whether somebody runs into trouble would depend
on many factors like their specific build process and what versions of
which packages they have installed.

In any case, I think we have a very limited amount of wiggle room.
We definitely cannot change libpq's ABI without provoking howls of
anguish.

Fair point.

I have been wondering whether it would help to add something like
this at the end of port/pqsignal.c in the back branches:

#ifdef FRONTEND
#undef pqsignal

/* ABI-compatibility wrapper */
pqsigfunc
pqsignal(int signo, pqsigfunc func)
{
return pqsignal_fe(signo, func);
}
#endif

(plus or minus an extern or so, but you get the idea). The point of
this is that compiling against old headers and then linking against
newer libpgport.a would still work. It does nothing however for the
reverse scenario of compiling against new headers and then linking
against old libpgport.a. So I haven't persuaded myself that it's
worth the trouble -- but I'm happy to include it if others think
it would help.

After sleeping on it, I think I agree that the extra gymnastics aren't
worth it to partially fix something that wasn't supported anyway. But I'm
not mortally opposed to it if someone feels strongly that it should be
included.

--
nathan

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#13)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Nathan Bossart <nathandbossart@gmail.com> writes:

On Mon, Jan 13, 2025 at 05:51:54PM -0500, Tom Lane wrote:

(plus or minus an extern or so, but you get the idea). The point of
this is that compiling against old headers and then linking against
newer libpgport.a would still work. It does nothing however for the
reverse scenario of compiling against new headers and then linking
against old libpgport.a. So I haven't persuaded myself that it's
worth the trouble -- but I'm happy to include it if others think
it would help.

After sleeping on it, I think I agree that the extra gymnastics aren't
worth it to partially fix something that wasn't supported anyway. But I'm
not mortally opposed to it if someone feels strongly that it should be
included.

After more thought I've realized that the asymmetrical detection
here isn't all that bad, because the outcomes are different.
If we fail to catch old-headers-and-new-library, the result will
either be a link failure or (if the extension uses libpq) silently
linking to libpq's pqsignal, which was likely not what was intended.
If we fail to catch the other case, the result is always a link
failure, and that will happen at build time not in the field.

So now I'm inclined to include the ABI-compatible wrapper, which
will ensure that extensions continue to link to libpgport's pqsignal.

regards, tom lane

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#14)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

On Tue, Jan 14, 2025 at 02:52:29PM -0500, Tom Lane wrote:

After more thought I've realized that the asymmetrical detection
here isn't all that bad, because the outcomes are different.
If we fail to catch old-headers-and-new-library, the result will
either be a link failure or (if the extension uses libpq) silently
linking to libpq's pqsignal, which was likely not what was intended.
If we fail to catch the other case, the result is always a link
failure, and that will happen at build time not in the field.

Assuming libpgport is only used as a static library, I think that makes
sense. My web searches indicate that it is possible to do things like
convert a static library to a dynamic one, but perhaps that's far enough
beyond the realm of things we care about.

So now I'm inclined to include the ABI-compatible wrapper, which
will ensure that extensions continue to link to libpgport's pqsignal.

Fine by me.

--
nathan

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#15)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Jan 14, 2025 at 02:52:29PM -0500, Tom Lane wrote:

So now I'm inclined to include the ABI-compatible wrapper, which
will ensure that extensions continue to link to libpgport's pqsignal.

Fine by me.

OK, I'll make it so. Thanks for looking at it!

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: pgbench error: (setshell) of script 0; execution of meta-command failed

I wrote:

OK, I'll make it so. Thanks for looking at it!

Or not. My idea worked okay in v17, but not in older branches.
Pre-v17, libpq itself can call pqsignal (though only in non-
thread-safe builds). With this patch, that would have resulted
in pulling src/port/pqsignal.o into libpq. libpq itself is fine
with calling that version, but if a stub version of pqsignal()
comes along for the ride then we have two candidates for which
function will be exported.

It would probably be possible to fix that, say by making the
wrapper version be a separate .o module within libpgport.
But it would be more work and complication, and I couldn't
get excited about investing such effort for a hypothetical
build problem. So I've pushed the patches as I originally
proposed them. If anyone else is excited about doing something
more, step right up.

regards, tom lane