Is SPI + Procedures (with COMMIT) inside a bgworker broken?
Hi all,
I'm trying to execute a PROCEDURE (with COMMIT inside) called from a
background worker using SPI but I'm always getting the error below:
2021-09-13 09:36:43.568 -03 [23845] LOG: worker_spi worker 2 initialized
with schema2.counted
2021-09-13 09:36:43.568 -03 [23846] LOG: worker_spi worker 1 initialized
with schema1.counted
2021-09-13 09:36:43.571 -03 [23846] ERROR: invalid transaction termination
2021-09-13 09:36:43.571 -03 [23846] CONTEXT: PL/pgSQL function
schema1.counted_proc() line 1 at COMMIT
SQL statement "CALL "schema1"."counted_proc"()"
2021-09-13 09:36:43.571 -03 [23846] STATEMENT: CALL
"schema1"."counted_proc"()
2021-09-13 09:36:43.571 -03 [23845] ERROR: invalid transaction termination
2021-09-13 09:36:43.571 -03 [23845] CONTEXT: PL/pgSQL function
schema2.counted_proc() line 1 at COMMIT
SQL statement "CALL "schema2"."counted_proc"()"
2021-09-13 09:36:43.571 -03 [23845] STATEMENT: CALL
"schema2"."counted_proc"()
2021-09-13 09:36:43.571 -03 [23838] LOG: background worker "worker_spi"
(PID 23845) exited with exit code 1
2021-09-13 09:36:43.571 -03 [23838] LOG: background worker "worker_spi"
(PID 23846) exited with exit code 1
I changed the worker_spi example (attached) a bit to execute a simple
procedure. Even using SPI_connect_ext(SPI_OPT_NONATOMIC) I'm getting the
error "invalid transaction termination".
There are something wrong with the attached example or am I missing
something?
Regards,
--
Fabrízio de Royes Mello
Attachments:
worker_spi_procedure.patchtext/x-patch; charset=US-ASCII; name=worker_spi_procedure.patchDownload+20-35
=?UTF-8?Q?Fabr=C3=ADzio_de_Royes_Mello?= <fabriziomello@gmail.com> writes:
I'm trying to execute a PROCEDURE (with COMMIT inside) called from a
background worker using SPI but I'm always getting the error below:
2021-09-13 09:36:43.571 -03 [23846] ERROR: invalid transaction termination
The direct cause of that is that SPI_execute() doesn't permit the called
query to perform COMMIT/ROLLBACK, which is because most callers would fail
to cope with that. You can instruct SPI to allow that by replacing the
SPI_execute() call with something like
SPIExecuteOptions options;
...
memset(&options, 0, sizeof(options));
options.allow_nonatomic = true;
ret = SPI_execute_extended(buf.data, &options);
However, that's not enough to make this example work :-(.
I find that it still fails inside the procedure's COMMIT,
with
2021-09-13 15:14:54.775 EDT worker_spi[476310] ERROR: portal snapshots (0) did not account for all active snapshots (1)
2021-09-13 15:14:54.775 EDT worker_spi[476310] CONTEXT: PL/pgSQL function schema4.counted_proc() line 1 at COMMIT
SQL statement "CALL "schema4"."counted_proc"()"
I think what this indicates is that worker_spi_main's cavalier
management of the active snapshot isn't up to snuff for this
use-case. The error is coming from ForgetPortalSnapshots, which
is expecting that all active snapshots are attached to Portals;
but that one isn't.
Probably the most appropriate fix is to make worker_spi_main
set up a Portal to run the query inside of. There are other
bits of code that are not happy if they're not inside a Portal,
so if you're hoping to run arbitrary SQL this way, sooner or
later you're going to have to cross that bridge.
(I remain of the opinion that replication/logical/worker.c
is going to have to do that eventually, too...)
regards, tom lane
On Mon, Sep 13, 2021 at 4:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The direct cause of that is that SPI_execute() doesn't permit the called
query to perform COMMIT/ROLLBACK, which is because most callers would fail
to cope with that. You can instruct SPI to allow that by replacing the
SPI_execute() call with something likeSPIExecuteOptions options;
...
memset(&options, 0, sizeof(options));
options.allow_nonatomic = true;ret = SPI_execute_extended(buf.data, &options);
I completely forgot about the SPI execute options... Thanks for the
explanation!!!
However, that's not enough to make this example work :-(.
I find that it still fails inside the procedure's COMMIT,
with2021-09-13 15:14:54.775 EDT worker_spi[476310] ERROR: portal snapshots
(0) did not account for all active snapshots (1)
2021-09-13 15:14:54.775 EDT worker_spi[476310] CONTEXT: PL/pgSQL
function schema4.counted_proc() line 1 at COMMIT
SQL statement "CALL "schema4"."counted_proc"()"
I think what this indicates is that worker_spi_main's cavalier
management of the active snapshot isn't up to snuff for this
use-case. The error is coming from ForgetPortalSnapshots, which
is expecting that all active snapshots are attached to Portals;
but that one isn't.
That is exactly the root cause of all my investigation.
At Timescale we have a scheduler (background worker) that launches another
background worker to "execute a job", and by executing a job it means to
call a function [1]https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L726 or a procedure [2]https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L741 directly without a SPI.
But now a user raised an issue about snapshots [3]https://github.com/timescale/timescaledb/issues/3545 and when I saw the code
for the first time I tried to use SPI and it didn't work as expected.
Even tweaking worker_spi to execute the procedure without SPI by calling
ExecuteCallStmt (attached) we end up with the same situation about the
active snapshots:
2021-09-13 20:14:36.654 -03 [21483] LOG: worker_spi worker 2 initialized
with schema2.counted
2021-09-13 20:14:36.655 -03 [21484] LOG: worker_spi worker 1 initialized
with schema1.counted
2021-09-13 20:14:36.657 -03 [21483] ERROR: portal snapshots (0) did not
account for all active snapshots (1)
2021-09-13 20:14:36.657 -03 [21483] CONTEXT: PL/pgSQL function
schema2.counted_proc() line 1 at COMMIT
2021-09-13 20:14:36.657 -03 [21484] ERROR: portal snapshots (0) did not
account for all active snapshots (1)
2021-09-13 20:14:36.657 -03 [21484] CONTEXT: PL/pgSQL function
schema1.counted_proc() line 1 at COMMIT
2021-09-13 20:14:36.659 -03 [21476] LOG: background worker "worker_spi"
(PID 21483) exited with exit code 1
2021-09-13 20:14:36.659 -03 [21476] LOG: background worker "worker_spi"
(PID 21484) exited with exit code 1
Probably the most appropriate fix is to make worker_spi_main
set up a Portal to run the query inside of. There are other
bits of code that are not happy if they're not inside a Portal,
so if you're hoping to run arbitrary SQL this way, sooner or
later you're going to have to cross that bridge.
I started digging with it [4]https://github.com/fabriziomello/timescaledb/blob/issue/3545/tsl/src/bgw_policy/job.c#L824 by creating a Portal from scratch to execute
the Function or Procedure and it worked.
We're wondering if we can avoid the parser for PortalRun, can we??
Regards,
[1]: https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L726
https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L726
[2]: https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L741
https://github.com/timescale/timescaledb/blob/master/tsl/src/bgw_policy/job.c#L741
[3]: https://github.com/timescale/timescaledb/issues/3545
[4]: https://github.com/fabriziomello/timescaledb/blob/issue/3545/tsl/src/bgw_policy/job.c#L824
https://github.com/fabriziomello/timescaledb/blob/issue/3545/tsl/src/bgw_policy/job.c#L824
--
Fabrízio de Royes Mello