[Proposal] Add foreign-server health checks infrastructure
Dear Hackers,
I want to propose the feature that checks the health of foreign servers.
As a first step I want to add an infrastructure for periodical checking to PostgreSQL core.
Currently this is the WIP, it does not contain docs.
## Background
Currently there is no way to check the status of an foreign server in PostgreSQL.
If an foreign server's postmaster goes down, or if the network between servers is lost,
the backend process will only detect these when it uses the connection corresponding to that foreign server.
Consider a workload that updates data on an foreign server only at the beginning of a transaction,
and runs a lot of local SQLs. Even if the network is disconnected immediately after accessing the foreign server,
the backend process will continue to execute local SQLs without realizing it.
The process will eventually finish to execute SQLs and try to commit.
Only then will it realize that the foreign server cannot be connect and will abort the transaction.
This situation should be detected as soon as possible
because it is impossible to commit a transaction when the foreign server goes down.
This can be more of a problem if you have system-wide downtime requirements.
That's why I want to implement the health-check feature to postgres.
## Design
In general, PostgreSQL can have a variety of RDBMSs as foreign servers,
so the core cannot support all of them.
Therefore, I propose a method to leave the monitoring of the foreign server to each FDW extensions
and register it as a callback function on the body side.
The attached patch adds this monitoring infrastructure to core.
Within the callback functions, it is expected
that each FDWs will check the state of the connection they hold and call ereport (ERROR)
if it cannot connect to someone.
Of course, you can also have the callback function return false.
There is no particular reason to choose the current method.
Callback functions will be called periodically.
## Implementation
This patch introduces a new timeout and a new GUC parameter. GUC controls the timeout interval.
The timeout takes effect when the callback function is first registered,
before each SQL command is executed, and at the end of the timeout.
This implementation is based on the client_connection_check_interval and other timeouts.
## Further work
As the next step I have a plan to implement the callback function to postgres_fdw.
I already made a PoC, but it deeply depends on the following thread:
https://commitfest.postgresql.org/35/3098/
I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand such a dependency
and will throw build error. Do you know how to deal with them in this case?
Your comments and suggestions are very welcome.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v01_add_checking_infrastracture.patchapplication/octet-stream; name=v01_add_checking_infrastracture.patchDownload+139-2
Hi,
Thank you for the patch!
On 2021-10-30 12:50, kuroda.hayato@fujitsu.com wrote:
## Background
Currently there is no way to check the status of an foreign server in
PostgreSQL.
If an foreign server's postmaster goes down, or if the network between
servers is lost,
the backend process will only detect these when it uses the connection
corresponding to that foreign server.Consider a workload that updates data on an foreign server only at the
beginning of a transaction,
and runs a lot of local SQLs. Even if the network is disconnected
immediately after accessing the foreign server,
the backend process will continue to execute local SQLs without
realizing it.The process will eventually finish to execute SQLs and try to commit.
Only then will it realize that the foreign server cannot be connect
and will abort the transaction.
This situation should be detected as soon as possible
because it is impossible to commit a transaction when the foreign
server goes down.
This can be more of a problem if you have system-wide downtime
requirements.
That's why I want to implement the health-check feature to postgres.
It's a good idea. I also think such a situation should be avoided.
## Further work
As the next step I have a plan to implement the callback function to
postgres_fdw.
I already made a PoC, but it deeply depends on the following thread:
https://commitfest.postgresql.org/35/3098/I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand
such a dependency
and will throw build error. Do you know how to deal with them in this
case?
I don't know how to deal with them, but I hope you will attach the PoC,
as it may be easier to review.
--
Regards,
--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Dear Kato-san,
Thank you for your interest!
I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand
such a dependency
and will throw build error. Do you know how to deal with them in this
case?I don't know how to deal with them, but I hope you will attach the PoC,
as it may be easier to review.
OK, I attached the PoC along with the dependent patches. Please see the zip file.
add_helth_check_... patch is written by me, and other two patches are
just copied from [1]https://commitfest.postgresql.org/35/3098/.
In the new callback function ConnectionHash is searched sequentially and
WaitEventSetWait() is performed for WL_SOCKET_CLOSED socket event.
This event is added by the dependent ones.
===
How to use
===
I'll explain how to use it briefly.
1. boot two postmaster processes. One is coordinator, and another is worker
2. set remote_servers_connection_check_interval to non-zero value at the coordinator
3. create tables to worker DB-cluster.
4. create foreign server, user mapping, and foreign table to coordinator.
5. connect to coordinator via psql.
6. open a transaction and access to foreing tables.
7. do "pg_ctl stop" command to woker DB-cluser.
8. execute some commands that does not access an foreign table.
9. Finally the following output will be get:
ERROR: Postgres foreign server XXX might be down.
===
Example in some steps
===
3. at worker
```
postgres=# \d
List of relations
Schema | Name | Type | Owner
--------+--------+-------+--------
public | remote | table | hayato
(1 row)
```
4. at coordinator
```
postgres=# select * from pg_foreign_server ;
oid | srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | srvoptions
-------+---------+----------+--------+---------+------------+--------+-----------------------------
16406 | remote | 10 | 16402 | | | | {port=5433,dbname=postgres}
(1 row)
postgres=# select * from pg_user_mapping ;
oid | umuser | umserver | umoptions
-------+--------+----------+---------------
16407 | 10 | 16406 | {user=hayato}
(1 row)
postgres=# \d
List of relations
Schema | Name | Type | Owner
--------+--------+---------------+--------
public | local | table | hayato
public | remote | foreign table | hayato
(2 rows)
```
6-9. at coordinator
```
postgres=# begin;
BEGIN
postgres=*# select * from remote ;
id
----
1
(1 row)
postgres=*# select * from local ;
ERROR: Postgres foreign server remote might be down.
postgres=!#
```
Note that some keepalive settings are needed
if you want to detect cable breakdown events.
In my understanding following parameters are needed as server options:
* keepalives_idle
* keepalives_count
* keepalives_interval
[1]: https://commitfest.postgresql.org/35/3098/
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
On 2021-11-18 21:43, kuroda.hayato@fujitsu.com wrote:
Dear Kato-san,
Thank you for your interest!
I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand
such a dependency
and will throw build error. Do you know how to deal with them in this
case?I don't know how to deal with them, but I hope you will attach the
PoC,
as it may be easier to review.OK, I attached the PoC along with the dependent patches. Please see
the zip file.
add_helth_check_... patch is written by me, and other two patches are
just copied from [1].
In the new callback function ConnectionHash is searched sequentially
and
WaitEventSetWait() is performed for WL_SOCKET_CLOSED socket event.
This event is added by the dependent ones.
Thank you for sending the patches!
I confirmed that they can be compiled and tested successfully on CentOS
8.
I haven't looked at the core of the code yet, but I took a quick look at
it.
+ {
+ {"remote_servers_connection_check_interval", PGC_USERSET,
CONN_AUTH_SETTINGS,
+ gettext_noop("Sets the time interval between checks for
disconnection of remote servers."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &remote_servers_connection_check_interval,
+ 0, 0, INT_MAX,
+ },
If you don't use check_hook, assign_hook and show_hook, you should
explicitly write "NULL, NULL, NULL", as show below.
+ {
+ {"remote_servers_connection_check_interval", PGC_USERSET,
CONN_AUTH_SETTINGS,
+ gettext_noop("Sets the time interval between checks for
disconnection of remote servers."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &remote_servers_connection_check_interval,
+ 0, 0, INT_MAX,
+ NULL, NULL, NULL
+ },
+ ereport(ERROR,
+ errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("Postgres foreign server %s might be down.",
+ server->servername));
According to [1]https://www.postgresql.org/docs/devel/error-style-guide.html, error messages should start with a lowercase letter
and not use a period.
Also, along with the rest of the code, it is a good idea to enclose the
server name in double quotes.
I'll get back to you once I've read all the code.
[1]: https://www.postgresql.org/docs/devel/error-style-guide.html
--
Regards,
--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Dear Kato-san,
Thank you for reviewing!
Thank you for sending the patches!
I confirmed that they can be compiled and tested successfully on CentOS
8.
Thanks!
+ { + {"remote_servers_connection_check_interval", PGC_USERSET, CONN_AUTH_SETTINGS, + gettext_noop("Sets the time interval between checks for disconnection of remote servers."), + NULL, + GUC_UNIT_MS + }, + &remote_servers_connection_check_interval, + 0, 0, INT_MAX, + },If you don't use check_hook, assign_hook and show_hook, you should
explicitly write "NULL, NULL, NULL", as show below.
Yeah I forgot the line. Fixed.
+ ereport(ERROR, + errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("Postgres foreign server %s might be down.", + server->servername));According to [1], error messages should start with a lowercase letter
and not use a period.
Also, along with the rest of the code, it is a good idea to enclose the
server name in double quotes.
I confirmed the postgres error-reporting policy and fixed to follow that.
How do you think?
Attached are the latest version.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Kato-san,
I found the missing space, and I added a test.
I'm very happy if you review this.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
On Tue, Nov 23, 2021 at 8:57 PM kuroda.hayato@fujitsu.com <
kuroda.hayato@fujitsu.com> wrote:
Dear Kato-san,
Thank you for reviewing!
Thank you for sending the patches!
I confirmed that they can be compiled and tested successfully on CentOS
8.Thanks!
+ { + {"remote_servers_connection_check_interval", PGC_USERSET, CONN_AUTH_SETTINGS, + gettext_noop("Sets the time interval between checks for disconnection of remote servers."), + NULL, + GUC_UNIT_MS + }, + &remote_servers_connection_check_interval, + 0, 0, INT_MAX, + },If you don't use check_hook, assign_hook and show_hook, you should
explicitly write "NULL, NULL, NULL", as show below.Yeah I forgot the line. Fixed.
+ ereport(ERROR, + errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("Postgres foreign server %s might be down.", + server->servername));According to [1], error messages should start with a lowercase letter
and not use a period.
Also, along with the rest of the code, it is a good idea to enclose the
server name in double quotes.I confirmed the postgres error-reporting policy and fixed to follow that.
How do you think?Attached are the latest version.
Best Regards,
Hayato Kuroda
FUJITSU LIMITEDHi,
+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()
(CheckingRemoteServersHoldoffCount++)
The macro contains only one operation. Can the macro be removed (with
`CheckingRemoteServersHoldoffCount++` inlined) ?
+ if (CheckingRemoteServersTimeoutPending &&
CheckingRemoteServersHoldoffCount != 0)
+ {
+ /*
+ * Skip checking foreign servers while reading messages.
+ */
+ InterruptPending = true;
+ }
+ else if (CheckingRemoteServersTimeoutPending)
Would the code be more readable if the above two if blocks be moved inside
one enclosing if block (factoring the common condition)?
+ if (CheckingRemoteServersTimeoutPending)
Cheers
Dear Zhihong,
Thank you for giving comments! I'll post new patches later.
+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS() (CheckingRemoteServersHoldoffCount++)
The macro contains only one operation. Can the macro be removed (with `CheckingRemoteServersHoldoffCount++` inlined) ?
Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and HOLD_CANCEL_INTERRUPTS():
```
#define HOLD_INTERRUPTS() (InterruptHoldoffCount++)
#define RESUME_INTERRUPTS() \
do { \
Assert(InterruptHoldoffCount > 0); \
InterruptHoldoffCount--; \
} while(0)
#define HOLD_CANCEL_INTERRUPTS() (QueryCancelHoldoffCount++)
#define RESUME_CANCEL_INTERRUPTS() \
do { \
Assert(QueryCancelHoldoffCount > 0); \
QueryCancelHoldoffCount--; \
} while(0)
#define START_CRIT_SECTION() (CritSectionCount++)
#define END_CRIT_SECTION() \
do { \
Assert(CritSectionCount > 0); \
CritSectionCount--; \
} while(0)
```
So I want to keep the current style. Could you tell me if you have any other reasons?
+ if (CheckingRemoteServersTimeoutPending && CheckingRemoteServersHoldoffCount != 0) + { + /* + * Skip checking foreign servers while reading messages. + */ + InterruptPending = true; + } + else if (CheckingRemoteServersTimeoutPending)Would the code be more readable if the above two if blocks be moved inside one enclosing if block (factoring the common condition)?
+ if (CheckingRemoteServersTimeoutPending)
+1. Will fix.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Hackers,
I attached new version that is based from Zhihong's comments.
And I newly attached a doc patch. I think the infrastructure part is no longer WIP.
Could you review them?
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
On Mon, Nov 29, 2021 at 12:51 AM kuroda.hayato@fujitsu.com <
kuroda.hayato@fujitsu.com> wrote:
Dear Zhihong,
Thank you for giving comments! I'll post new patches later.
+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()
(CheckingRemoteServersHoldoffCount++)
The macro contains only one operation. Can the macro be removed (with
`CheckingRemoteServersHoldoffCount++` inlined) ?
Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and
HOLD_CANCEL_INTERRUPTS():```
#define HOLD_INTERRUPTS() (InterruptHoldoffCount++)#define RESUME_INTERRUPTS() \
do { \
Assert(InterruptHoldoffCount > 0); \
InterruptHoldoffCount--; \
} while(0)#define HOLD_CANCEL_INTERRUPTS() (QueryCancelHoldoffCount++)
#define RESUME_CANCEL_INTERRUPTS() \
do { \
Assert(QueryCancelHoldoffCount > 0); \
QueryCancelHoldoffCount--; \
} while(0)#define START_CRIT_SECTION() (CritSectionCount++)
#define END_CRIT_SECTION() \
do { \
Assert(CritSectionCount > 0); \
CritSectionCount--; \
} while(0)
```So I want to keep the current style. Could you tell me if you have any
other reasons?+ if (CheckingRemoteServersTimeoutPending &&
CheckingRemoteServersHoldoffCount != 0)
+ { + /* + * Skip checking foreign servers while reading messages. + */ + InterruptPending = true; + } + else if (CheckingRemoteServersTimeoutPending)Would the code be more readable if the above two if blocks be moved
inside one enclosing if block (factoring the common condition)?
+ if (CheckingRemoteServersTimeoutPending)
+1. Will fix.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Hi,
It is Okay to keep the macros.
Thanks
On 2021-11-29 21:36, Zhihong Yu wrote:
On Mon, Nov 29, 2021 at 12:51 AM kuroda.hayato@fujitsu.com
<kuroda.hayato@fujitsu.com> wrote:Dear Zhihong,
Thank you for giving comments! I'll post new patches later.
+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()
(CheckingRemoteServersHoldoffCount++)
The macro contains only one operation. Can the macro be removed
(with `CheckingRemoteServersHoldoffCount++` inlined) ?
Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and
HOLD_CANCEL_INTERRUPTS():```
#define HOLD_INTERRUPTS() (InterruptHoldoffCount++)#define RESUME_INTERRUPTS() \
do { \
Assert(InterruptHoldoffCount > 0); \
InterruptHoldoffCount--; \
} while(0)#define HOLD_CANCEL_INTERRUPTS() (QueryCancelHoldoffCount++)
#define RESUME_CANCEL_INTERRUPTS() \
do { \
Assert(QueryCancelHoldoffCount > 0); \
QueryCancelHoldoffCount--; \
} while(0)#define START_CRIT_SECTION() (CritSectionCount++)
#define END_CRIT_SECTION() \
do { \
Assert(CritSectionCount > 0); \
CritSectionCount--; \
} while(0)
```So I want to keep the current style. Could you tell me if you have
any other reasons?+ if (CheckingRemoteServersTimeoutPending &&
CheckingRemoteServersHoldoffCount != 0)
+ { + /* + * Skip checking foreign servers while reading messages. + */ + InterruptPending = true; + } + else if (CheckingRemoteServersTimeoutPending)Would the code be more readable if the above two if blocks be
moved inside one enclosing if block (factoring the common
condition)?+ if (CheckingRemoteServersTimeoutPending)
+1. Will fix.
Best Regards,
Hayato Kuroda
FUJITSU LIMITEDHi,
It is Okay to keep the macros.
Thanks
Hi, Kuroda-san. Sorry for late reply.
Even for local-only transaction, I thought it useless to execute
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
make it so that it determines outside whether it contains SQL to the
remote or not?
The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention of
this?
3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714, similar
to line 2707?
3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
/*
* Skip checking foreign servers while reading messages.
*/
to
/*
* Skip checking foreign servers while reading messages.
*/
4. In connection.c, There is a typo in line 1684, so "fucntion" should
be changed to "function".
--
Regards,
--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Dear Kato-san,
Thank you for giving comments! And sorry for late reply.
I rebased my patches.
Even for local-only transaction, I thought it useless to execute
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
make it so that it determines outside whether it contains SQL to the
remote or not?
Yeah, remote-checking timeout will be enable even ifa local transaction is opened.
In my understanding, postgres cannot distinguish whether opening transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXX functions
were called, but in some cases FDWs may not use remote connection even if
they call such a handler function. e.g. postgresExplainForeignScan().
Hence I tried another approach that unregister all checking callback
the end of each transactions. Only FDWs can notice that transactions are remote or local,
so they must register callback when they really connect to other database.
This implementation will avoid calling remote checking in the case of local transaction,
but multiple registering and unregistering may lead another overhead.
I attached which implements that.
The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention of
this?
3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714, similar
to line 2707?
3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
/*
* Skip checking foreign servers while reading
messages.
*/
to
/*
* Skip checking foreign servers while reading
messages.
*/
4. In connection.c, There is a typo in line 1684, so "fucntion" should
be changed to "function".
Maybe all of them were fixed. Thanks!
How do you think?
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Thank you for the new patch!
On 2021-12-15 15:40, kuroda.hayato@fujitsu.com wrote:
Dear Kato-san,
Thank you for giving comments! And sorry for late reply.
I rebased my patches.Even for local-only transaction, I thought it useless to execute
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
make it so that it determines outside whether it contains SQL to the
remote or not?Yeah, remote-checking timeout will be enable even ifa local
transaction is opened.
In my understanding, postgres cannot distinguish whether opening
transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXX
functions
were called, but in some cases FDWs may not use remote connection even
if
they call such a handler function. e.g. postgresExplainForeignScan().
Hence I tried another approach that unregister all checking callback
the end of each transactions. Only FDWs can notice that transactions
are remote or local,
so they must register callback when they really connect to other
database.
This implementation will avoid calling remote checking in the case of
local transaction,
but multiple registering and unregistering may lead another overhead.
I attached which implements that.
It certainly incurs another overhead, but I think it's better than the
previous one.
So far, I haven't encountered any problems, but I'd like other people's
opinions.
The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention
of
this?
3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714,
similar
to line 2707?
3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
/*
* Skip checking foreign servers while reading
messages.
*/
to
/*
* Skip checking foreign servers while reading
messages.
*/
4. In connection.c, There is a typo in line 1684, so "fucntion" should
be changed to "function".Maybe all of them were fixed. Thanks!
Thank you, and it looks good to me.
--
Regards,
--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Tue, Jan 4, 2022 at 10:21 PM Shinya Kato <Shinya11.Kato@oss.nttdata.com>
wrote:
Thank you for the new patch!
On 2021-12-15 15:40, kuroda.hayato@fujitsu.com wrote:
Dear Kato-san,
Thank you for giving comments! And sorry for late reply.
I rebased my patches.Even for local-only transaction, I thought it useless to execute
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
make it so that it determines outside whether it contains SQL to the
remote or not?Yeah, remote-checking timeout will be enable even ifa local
transaction is opened.
In my understanding, postgres cannot distinguish whether opening
transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXX
functions
were called, but in some cases FDWs may not use remote connection even
if
they call such a handler function. e.g. postgresExplainForeignScan().
Hence I tried another approach that unregister all checking callback
the end of each transactions. Only FDWs can notice that transactions
are remote or local,
so they must register callback when they really connect to other
database.
This implementation will avoid calling remote checking in the case of
local transaction,
but multiple registering and unregistering may lead another overhead.
I attached which implements that.It certainly incurs another overhead, but I think it's better than the
previous one.
So far, I haven't encountered any problems, but I'd like other people's
opinions.The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention
of
this?
3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714,
similar
to line 2707?
3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
/*
* Skip checking foreign servers while reading
messages.
*/
to
/*
* Skip checking foreign servers while reading
messages.
*/
4. In connection.c, There is a typo in line 1684, so "fucntion" should
be changed to "function".Maybe all of them were fixed. Thanks!
Thank you, and it looks good to me.
Hi,
+ UnregisterAllCheckingRemoteServersCallback();
UnregisterAllCheckingRemoteServersCallback
-> UnregisterAllCheckingRemoteServersCallbacks
+ CheckingRemoteServersCallbackItem *item;
+ item = fdw_callbacks;
The above two lines can be combined.
+UnregisterCheckingRemoteServersCallback(CheckingRemoteServersCallback
callback,
+ void *arg)
Is the above func called anywhere ?
+ if (item->callback == callback && item->arg == arg)
Is comparing argument pointers stable ? What if the same argument is cloned
?
+ CallCheckingRemoteServersCallbacks();
+
+ if (remote_servers_connection_check_interval > 0)
+ enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
+
remote_servers_connection_check_interval);
Should the call to CallCheckingRemoteServersCallbacks() be placed under the
if block checking remote_servers_connection_check_interval ?
If remote_servers_connection_check_interval is 0, it seems the callbacks
shouldn't be called.
Cheers
On 2021/12/15 15:40, kuroda.hayato@fujitsu.com wrote:
Yeah, remote-checking timeout will be enable even ifa local transaction is opened.
In my understanding, postgres cannot distinguish whether opening transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXX functions
were called,
What about starting the timeout in GetConnection(), instead?
I attached which implements that.
v05_0004_add_tests.patch failed to be applied to the master. Could you rebase it?
- This option is currently available only on systems that support the
- non-standard <symbol>POLLRDHUP</symbol> extension to the
- <symbol>poll</symbol> system call, including Linux.
+ This option relies on kernel events exposed by Linux, macOS, illumos
+ and the BSD family of operating systems, and is not currently available
+ on other systems.
The above change is included in both v5-0003-Use-WL_SOCKET_CLOSED-for-client_connection_check_.patch and v05_0002_add_doc.patch. If it should be in the former patch, it should be removed from your patch v05_0002_add_doc.patch.
There seems no user of UnregisterCheckingRemoteServersCallback(). So how about removing it?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Dear Zhihong,
Thank you for reviewing! And I have to apologize for the late.
I'll post new patchset later.
+ UnregisterAllCheckingRemoteServersCallback();
UnregisterAllCheckingRemoteServersCallback
-> UnregisterAllCheckingRemoteServersCallbacks
Will fix.
+ CheckingRemoteServersCallbackItem *item; + item = fdw_callbacks;The above two lines can be combined.
Will fix.
+UnregisterCheckingRemoteServersCallback(CheckingRemoteServersCallback callback, + void *arg)Is the above func called anywhere ?
Currently not, I just kept the API because of any other FDW extensions.
But I cannot find any use cases, so will remove.
+ if (item->callback == callback && item->arg == arg)
Is comparing argument pointers stable ? What if the same argument is cloned
?
This part is no longer needed. Will remove.
+ CallCheckingRemoteServersCallbacks(); + + if (remote_servers_connection_check_interval > 0) + enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT, + remote_servers_connection_check_interval);Should the call to CallCheckingRemoteServersCallbacks() be placed under the
if block checking remote_servers_connection_check_interval ?
If remote_servers_connection_check_interval is 0, it seems the callbacks
shouldn't be called.
Agreed. We force stopping timeout when the GUC sets to 0 in assign_hook,
so your saying is consistent. Will move.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Fujii-san,
Thank you for your interest! I'll post new version within several days.
Yeah, remote-checking timeout will be enable even ifa local transaction is
opened.
In my understanding, postgres cannot distinguish whether opening transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXXfunctions
were called,
What about starting the timeout in GetConnection(), instead?
Did you said about a function in postgres_fdw/connection.c?
In my understanding that means that the timeout should be enabled or disabled
by each FDW extensions.
I did not find bad cases for that, so I'll change like that and make new APIs.
v05_0004_add_tests.patch failed to be applied to the master. Could you rebase it?
It's caused because a testcase was added in postgres_fdw. Will rebase.
The above change is included in both
v5-0003-Use-WL_SOCKET_CLOSED-for-client_connection_check_.patch and
v05_0002_add_doc.patch. If it should be in the former patch, it should be removed
from your patch v05_0002_add_doc.patch.
I confused about doc-patch. Sorry for inconvenience.
There seems no user of UnregisterCheckingRemoteServersCallback(). So how
about removing it?
Previously I kept the API for any other extensions, but I cannot find use cases.
Will remove.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Fujii-san, Zhihong,
I attached the latest version.
The biggest change is that callbacks are no longer un-registered at the end of transactions.
FDW developer must enable or disable timeout instead, via new APIs.
The timer will be turned on when:
* new GUC is >= 0, and
* anyone calls TryEnableRemoteServerCheckingTimeout().
I think this version is reduced overhead, but it might not be developer friendly...
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
On 2022/01/21 15:08, kuroda.hayato@fujitsu.com wrote:
Dear Fujii-san, Zhihong,
I attached the latest version.
Thanks for updating the patch!
+int
+TryDisableRemoteServerCheckingTimeout(void)
When more than one FDWs are used, even if one FDW calls this function to disable the timeout, its remote-server-check-callback can still be called. Is this OK?
Please imagine the case where two FDWs are used and they registered their callbacks. Even when one FDW calls TryDisableRemoteServerCheckingTimeout(), if another FDW has not called that yet, the timeout is still being enabled. If the timeout is triggered during that period, even the callback registered by the FDW that has already called TryDisableRemoteServerCheckingTimeout() would be called.
+ if (remote_servers_connection_check_interval > 0)
+ {
+ CallCheckingRemoteServersCallbacks();
+ enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
+ remote_servers_connection_check_interval);
LockErrorCleanup() needs to be called before reporting the error, doesn't it?
This can cause an error even while DoingCommandRead == true. Is that safe?
The biggest change is that callbacks are no longer un-registered at the end of transactions.
FDW developer must enable or disable timeout instead, via new APIs.The timer will be turned on when:
* new GUC is >= 0, and
This can cause the timeout to be enabled even when no remote transaction is started?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Dear Fujii-san,
Thank you for reviewing! I attached the latest version.
When more than one FDWs are used, even if one FDW calls this function to
disable the timeout, its remote-server-check-callback can still be called. Is this
OK?
Please imagine the case where two FDWs are used and they registered their
callbacks. Even when one FDW calls TryDisableRemoteServerCheckingTimeout(),
if another FDW has not called that yet, the timeout is still being enabled. If the
timeout is triggered during that period, even the callback registered by the FDW
that has already called TryDisableRemoteServerCheckingTimeout() would be
called.
Indeed and it should be avoided. I added a counter to CheckingRemoteServersCallbackItem.
The register function returns the registered item, and it must be set as the argument for
enable and disable functions.
Callback functions will be called when item->counter is larger than zero.
+ if (remote_servers_connection_check_interval > 0) + { + CallCheckingRemoteServersCallbacks(); + enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT, + remote_servers_connection_check_interval);LockErrorCleanup() needs to be called before reporting the error, doesn't it?
You are right. I think this suggests that error-reporting should be done in the core layer.
For meaningful error reporting, I changed a callback specification
that it should return ForeignServer* which points to downed remote server.
This can cause an error even while DoingCommandRead == true. Is that safe?
I read codes again and I think it is not safe. It is OK when whereToSendOutput is DestRemote,
but not good in InteractiveBackend().
I changed that if-statement for CheckingRemoteServersTimeoutPending is moved just after
ClientConnectionLost, because the that may throw a FATAL error and disconnect from client.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED