psql query gets stuck indefinitely
Hi All
I have postgres installed in cluster setup. My system has a script which
executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??
Thanks...
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the
remote system( on which this query was supposed to execute) is
rebooted . What could be the reason ??
The issue will most likely be related to the network or to the
client-side host. Perhaps the client machine changed IP addresses (maybe
as part of a switch from WiFi to wired or similar) ?
Check the man page for psql in 9.1; I think client-side keepalive
support got committed for 9.1 . If it didn't, you can always set it
globally for all TCP/IP connections on your system. See eg
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html .
--
Craig Ringer
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??
I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which
system call it's blocked in in the kernel (if it's waiting on a
syscall). As you didn't mention your OS I'll assume you're on Linux,
where you'd use:
ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine
"n_tty_read" for example.
If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:
gdb -q -p 1234 <<__END__
bt
q
__END__
If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro
so I won't go into that; it's documented on the PostgreSQL wiki under
"how to get a stack trace" but you probably won't want to bother if this
is just for curiosity's sake.
You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6
... etc ...
--
Craig Ringer
Hi Craig
Thanks for your reply . But unfortunately I dont have that process running
right now. I have already killed that process . But I have seen this
problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Thanks....
Regards
Tamanna
On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which system
call it's blocked in in the kernel (if it's waiting on a syscall). As you
didn't mention your OS I'll assume you're on Linux, where you'd use:ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.If you really want to know what it's doing you can also attach gdb and get
a backtrace to see what code it's paused in inside psql:gdb -q -p 1234 <<__END__
bt
q
__END__If you get a message about "missing debuginfos", lots of lines reading "no
debugging symbols found" or lots of lines ending in "?? ()" then you need
to install debug symbols. How to do that depends on your OS/distro so I
won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6... etc ...
--
Craig Ringer
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
On 29/11/11 11:21, tamanna madaan wrote:
Hi Craig
Thanks for your reply . But unfortunately I dont have that process
running right now. I have already killed that process . But I have
seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some
reason (CPU utilization high etc.) . But whatever is the reason , I
would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes
hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Well, it *really* shouldn't hang locally.
To help you further I'll need you to collect the information on the
stuck process next time you encounter one and post that as a reply.
Maybe with a bit more info we can see what might be going on.
--
Craig Ringer
well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .
I am using postgres on linux platform .
On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:
Hi Craig
Thanks for your reply . But unfortunately I dont have that process running
right now. I have already killed that process . But I have seen this
problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.Thanks....
Regards
TamannaOn Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au>wrote:
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which system
call it's blocked in in the kernel (if it's waiting on a syscall). As you
didn't mention your OS I'll assume you're on Linux, where you'd use:ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:gdb -q -p 1234 <<__END__
bt
q
__END__If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro so
I won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6... etc ...
--
Craig Ringer--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USAOffice: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.
I have started the below query from one system let say A to system B in
cluster .
psql -U<dbname> -h<ip of system B> -c "select sleep(300);"
while this command is going on , system B is stopped abruptly by taking out
the power cable from it . This caused the above query on system A to hang.
This is still showing in 'ps -eaf' output after one day. I think the tcp
keepalive mechanism which has been set at system level should have closed
this connection. But it didnt . Following keepalive values have been set on
system A :
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
Why system level keepalive is not working in this case. Well, I learnt
, from the link you have provided, that programs must request keepalive
control for their sockets using the setsockopt interface. I wonder if
postgres8.1.2 supports / request for system level keepalive control ?? If
not, then which release/version of postgres supports that ??
Thanks...
Tamanna
On Tue, Nov 29, 2011 at 4:56 PM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:
well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .
I am using postgres on linux platform .
On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:Hi Craig
Thanks for your reply . But unfortunately I dont have that process
running right now. I have already killed that process . But I have seen
this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.Thanks....
Regards
TamannaOn Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au>wrote:
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which
system call it's blocked in in the kernel (if it's waiting on a syscall).
As you didn't mention your OS I'll assume you're on Linux, where you'd use:ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:gdb -q -p 1234 <<__END__
bt
q
__END__If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro so
I won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6... etc ...
--
Craig Ringer--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USAOffice: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USAOffice: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.
Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.
Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.
And finally - what do you mean by 'cluster setup'?
Tomas
Hi Tomas
I tried it on the system having postgres-8.4.0 . And the behavior is same
.
Cluster means a group of machines having postgres installed on all of them .
Same database is created on all the machines one of which working as master
DB
on which operation (like insert/delete/update) will be performed and others
working
as Slave Db which will get data replicated to them from master DB by slony
. In my
cluster setup there are only two machines ( A and B ) one having master Db
and other
being slave . I execute the below query from system A to system B :
psql -U<db name> -h<host ip of B> -c "select sleep(300);"
This query can be seen running on system B in `ps -eaf | grep postgres`
output .
Now, while this query is going on, execute below command on system A which
will block any packet coming to this machine :
iptables -I INPUT -i eth0 -j DROP .
Afer 5 mins (which is the sleep period) , the above query will finish on
system B . But it can still be seen
running on system A . This may be because of the reason that the message
(that the query is finished)
have not been received by system A .
Still I would assume that after (tcp_keepalive_time +
tcp_keepalive_probes*tcp_keepalive_intvl) , the above
psql query should return on system A as well. But, this query doesn't
return until it is killed manually .
What could be the reason of that ??
Well , I learnt below from the release notes of postgres :
==
=========================================================================================
postgres 8.1
server side chnages :
Add configuration parameters to control TCP/IP keep-alive times for idle,
interval, and count (Oliver Jowett)
These values can be changed to allow more rapid detection of lost client
connections.
postgres 9.0
E.8.3.9. Development Tools
E.8.3.9.1. libpq
Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert
Haas)
Keepalive settings were already supported on the server end of TCP
connections.
==============================================================================================
Does this mean that TCP keep alive settings(that are provided in postgres
8.1 onwards) would only work for lost connections to server and
won't work in the case above as above case requires psql (which is client )
to be returned ?? And for the above case the TCP keepalive settings in
libpq ( that are provided in postgres 9.0 onwards) would work ??
kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0.
keepalive setting are as below :
postgresql.conf
#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
system level setiing :
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Regards
Tamanna
On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.And finally - what do you mean by 'cluster setup'?
Tomas
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
Hi All
Please help me .
Thanks...
Tamanna
On Mon, Dec 5, 2011 at 12:45 PM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:
Hi Tomas
I tried it on the system having postgres-8.4.0 . And the behavior is same
.Cluster means a group of machines having postgres installed on all of them
.Same database is created on all the machines one of which working as
master DBon which operation (like insert/delete/update) will be performed and
others workingas Slave Db which will get data replicated to them from master DB by slony
. In mycluster setup there are only two machines ( A and B ) one having master Db
and otherbeing slave . I execute the below query from system A to system B :
psql -U<db name> -h<host ip of B> -c "select sleep(300);"
This query can be seen running on system B in `ps -eaf | grep postgres`
output .Now, while this query is going on, execute below command on system A which
will block any packet coming to this machine :iptables -I INPUT -i eth0 -j DROP .
Afer 5 mins (which is the sleep period) , the above query will finish on
system B . But it can still be seenrunning on system A . This may be because of the reason that the message
(that the query is finished)have not been received by system A .
Still I would assume that after (tcp_keepalive_time +
tcp_keepalive_probes*tcp_keepalive_intvl) , the abovepsql query should return on system A as well. But, this query doesn't
return until it is killed manually .What could be the reason of that ??
Well , I learnt below from the release notes of postgres :
==
=========================================================================================postgres 8.1
server side chnages :
Add configuration parameters to control TCP/IP keep-alive times for idle,
interval, and count (Oliver Jowett)These values can be changed to allow more rapid detection of lost client
connections.postgres 9.0
E.8.3.9. Development Tools
E.8.3.9.1. libpq
Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert
Haas)Keepalive settings were already supported on the server end of TCP
connections.==============================================================================================
Does this mean that TCP keep alive settings(that are provided in postgres
8.1 onwards) would only work for lost connections to server andwon't work in the case above as above case requires psql (which is client
) to be returned ?? And for the above case the TCP keepalive settings in
libpq ( that are provided in postgres 9.0 onwards) would work ??kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0.
keepalive setting are as below :postgresql.conf
#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
system level setiing :
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Regards
Tamanna
On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installedin
cluster setup.
Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.And finally - what do you mean by 'cluster setup'?
Tomas
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USAOffice: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com