psql query gets stuck indefinitely

Started by tamanna madaanover 14 years ago10 messagesgeneral

tamanna.madaan@globallogic.com

over 14 years ago

Hi All

I have postgres installed in cluster setup. My system has a script which
executes the below query on remote system in cluster.

psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"

But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

Office: +0-120-406-2000 x 2971

www.globallogic.com

Craig Ringer

craig@2ndquadrant.com

over 14 years ago

In reply to: tamanna madaan (#1)

Re: psql query gets stuck indefinitely

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the
remote system( on which this query was supposed to execute) is
rebooted . What could be the reason ??

The issue will most likely be related to the network or to the
client-side host. Perhaps the client machine changed IP addresses (maybe
as part of a switch from WiFi to wired or similar) ?

Check the man page for psql in 9.1; I think client-side keepalive
support got committed for 9.1 . If it didn't, you can always set it
globally for all TCP/IP connections on your system. See eg
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html .

--
Craig Ringer

Craig Ringer

craig@2ndquadrant.com

over 14 years ago

In reply to: tamanna madaan (#1)

Re: psql query gets stuck indefinitely

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which
system call it's blocked in in the kernel (if it's waiting on a
syscall). As you didn't mention your OS I'll assume you're on Linux,
where you'd use:

ps -C psql -o wchan:80=

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine
"n_tty_read" for example.

If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro
so I won't go into that; it's documented on the PostgreSQL wiki under
"how to get a stack trace" but you probably won't want to bother if this
is just for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

tamanna madaan

tamanna.madaan@globallogic.com

over 14 years ago

In reply to: Craig Ringer (#3)

Re: psql query gets stuck indefinitely

Hi Craig

Thanks for your reply . But unfortunately I dont have that process running
right now. I have already killed that process . But I have seen this
problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which system
call it's blocked in in the kernel (if it's waiting on a syscall). As you
didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.

If you really want to know what it's doing you can also attach gdb and get
a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading "no
debugging symbols found" or lots of lines ending in "?? ()" then you need
to install debug symbols. How to do that depends on your OS/distro so I
won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

Office: +0-120-406-2000 x 2971

www.globallogic.com

Craig Ringer

craig@2ndquadrant.com

over 14 years ago

In reply to: tamanna madaan (#4)

Re: psql query gets stuck indefinitely

On 29/11/11 11:21, tamanna madaan wrote:

Hi Craig

Thanks for your reply . But unfortunately I dont have that process
running right now. I have already killed that process . But I have
seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some
reason (CPU utilization high etc.) . But whatever is the reason , I
would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes
hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Well, it *really* shouldn't hang locally.

To help you further I'll need you to collect the information on the
stuck process next time you encounter one and post that as a reply.
Maybe with a bit more info we can see what might be going on.

--
Craig Ringer

tamanna madaan

tamanna.madaan@globallogic.com

over 14 years ago

In reply to: tamanna madaan (#4)

Re: psql query gets stuck indefinitely

well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .

I am using postgres on linux platform .

On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:

Hi Craig

Thanks for your reply . But unfortunately I dont have that process running
right now. I have already killed that process . But I have seen this
problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au>wrote:

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which system
call it's blocked in in the kernel (if it's waiting on a syscall). As you
didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.

If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro so
I won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

Office: +0-120-406-2000 x 2971

www.globallogic.com

tamanna madaan

tamanna.madaan@globallogic.com

over 14 years ago

In reply to: tamanna madaan (#6)

Re: psql query gets stuck indefinitely

Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.

I have started the below query from one system let say A to system B in
cluster .
psql -U<dbname> -h<ip of system B> -c "select sleep(300);"

while this command is going on , system B is stopped abruptly by taking out
the power cable from it . This caused the above query on system A to hang.
This is still showing in 'ps -eaf' output after one day. I think the tcp
keepalive mechanism which has been set at system level should have closed
this connection. But it didnt . Following keepalive values have been set on
system A :

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
Why system level keepalive is not working in this case. Well, I learnt
, from the link you have provided, that programs must request keepalive
control for their sockets using the setsockopt interface. I wonder if
postgres8.1.2 supports / request for system level keepalive control ?? If
not, then which release/version of postgres supports that ??

Thanks...
Tamanna

On Tue, Nov 29, 2011 at 4:56 PM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:

well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .

I am using postgres on linux platform .

On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:

Hi Craig

Thanks for your reply . But unfortunately I dont have that process
running right now. I have already killed that process . But I have seen
this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason
(CPU utilization high etc.) . But whatever is the reason , I would assume
that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it
doesn't return and remain stuck. Moreover, the same query sometimes hangs
even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au>wrote:

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which
system call it's blocked in in the kernel (if it's waiting on a syscall).
As you didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine "n_tty_read"
for example.

If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro so
I won't go into that; it's documented on the PostgreSQL wiki under "how to
get a stack trace" but you probably won't want to bother if this is just
for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

Office: +0-120-406-2000 x 2971

www.globallogic.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 14 years ago

In reply to: tamanna madaan (#7)

Re: psql query gets stuck indefinitely

On 1 Prosinec 2011, 12:57, tamanna madaan wrote:

Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

tamanna madaan

tamanna.madaan@globallogic.com

over 14 years ago

In reply to: Tomas Vondra (#8)

Re: psql query gets stuck indefinitely

Hi Tomas

I tried it on the system having postgres-8.4.0 . And the behavior is same
.

Cluster means a group of machines having postgres installed on all of them .

Same database is created on all the machines one of which working as master
DB

on which operation (like insert/delete/update) will be performed and others
working

as Slave Db which will get data replicated to them from master DB by slony
. In my

cluster setup there are only two machines ( A and B ) one having master Db
and other

being slave . I execute the below query from system A to system B :

psql -U<db name> -h<host ip of B> -c "select sleep(300);"

This query can be seen running on system B in `ps -eaf | grep postgres`
output .

Now, while this query is going on, execute below command on system A which
will block any packet coming to this machine :

iptables -I INPUT -i eth0 -j DROP .

Afer 5 mins (which is the sleep period) , the above query will finish on
system B . But it can still be seen

running on system A . This may be because of the reason that the message
(that the query is finished)

have not been received by system A .

Still I would assume that after (tcp_keepalive_time +
tcp_keepalive_probes*tcp_keepalive_intvl) , the above

psql query should return on system A as well. But, this query doesn't
return until it is killed manually .

What could be the reason of that ??

Well , I learnt below from the release notes of postgres :

==
=========================================================================================

postgres 8.1

server side chnages :

Add configuration parameters to control TCP/IP keep-alive times for idle,
interval, and count (Oliver Jowett)

These values can be changed to allow more rapid detection of lost client
connections.

postgres 9.0

E.8.3.9. Development Tools

E.8.3.9.1. libpq

Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert
Haas)

Keepalive settings were already supported on the server end of TCP
connections.

==============================================================================================

Does this mean that TCP keep alive settings(that are provided in postgres
8.1 onwards) would only work for lost connections to server and

won't work in the case above as above case requires psql (which is client )
to be returned ?? And for the above case the TCP keepalive settings in
libpq ( that are provided in postgres 9.0 onwards) would work ??

kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0.
keepalive setting are as below :

postgresql.conf

#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;

# 0 selects the system default

#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;

# 0 selects the system default

#tcp_keepalives_count = 0 # TCP_KEEPCNT;

# 0 selects the system default

system level setiing :

net.ipv4.tcp_keepalive_time = 7200

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 75

Regards

Tamanna

On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 1 Prosinec 2011, 12:57, tamanna madaan wrote:

Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed in
cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

Office: +0-120-406-2000 x 2971

www.globallogic.com

#10

tamanna madaan

tamanna.madaan@globallogic.com

over 14 years ago

In reply to: tamanna madaan (#9)

Re: psql query gets stuck indefinitely

Hi All

Please help me .

Thanks...
Tamanna

On Mon, Dec 5, 2011 at 12:45 PM, tamanna madaan <
tamanna.madaan@globallogic.com> wrote:

Hi Tomas

I tried it on the system having postgres-8.4.0 . And the behavior is same
.

Cluster means a group of machines having postgres installed on all of them
.

Same database is created on all the machines one of which working as
master DB

on which operation (like insert/delete/update) will be performed and
others working

as Slave Db which will get data replicated to them from master DB by slony
. In my

cluster setup there are only two machines ( A and B ) one having master Db
and other

being slave . I execute the below query from system A to system B :

psql -U<db name> -h<host ip of B> -c "select sleep(300);"

This query can be seen running on system B in `ps -eaf | grep postgres`
output .

Now, while this query is going on, execute below command on system A which
will block any packet coming to this machine :

iptables -I INPUT -i eth0 -j DROP .

Afer 5 mins (which is the sleep period) , the above query will finish on
system B . But it can still be seen

running on system A . This may be because of the reason that the message
(that the query is finished)

have not been received by system A .

Still I would assume that after (tcp_keepalive_time +
tcp_keepalive_probes*tcp_keepalive_intvl) , the above

psql query should return on system A as well. But, this query doesn't
return until it is killed manually .

What could be the reason of that ??

Well , I learnt below from the release notes of postgres :

==
=========================================================================================

postgres 8.1

server side chnages :

Add configuration parameters to control TCP/IP keep-alive times for idle,
interval, and count (Oliver Jowett)

These values can be changed to allow more rapid detection of lost client
connections.

postgres 9.0

E.8.3.9. Development Tools

E.8.3.9.1. libpq

Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert
Haas)

Keepalive settings were already supported on the server end of TCP
connections.

==============================================================================================

Does this mean that TCP keep alive settings(that are provided in postgres
8.1 onwards) would only work for lost connections to server and

won't work in the case above as above case requires psql (which is client
) to be returned ?? And for the above case the TCP keepalive settings in
libpq ( that are provided in postgres 9.0 onwards) would work ??

kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0.
keepalive setting are as below :

postgresql.conf

#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;

# 0 selects the system default

#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;

# 0 selects the system default

#tcp_keepalives_count = 0 # TCP_KEEPCNT;

# 0 selects the system default

system level setiing :

net.ipv4.tcp_keepalive_time = 7200

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 75

Regards

Tamanna

On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 1 Prosinec 2011, 12:57, tamanna madaan wrote:

Hi Craig
I am able to reproduce the issue now . I have postgres-8.1.2 installed

in

cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

Office: +0-120-406-2000 x 2971