CLOSE_WAIT pileup and Application Timeout

Started by KK CHNover 1 year ago8 messagesgeneral
Jump to latest
#1KK CHN
kkchn.in@gmail.com

List,

I am facing a network (TCP IP connection closing issue) .

Running a mobile tablet application, Android application to update the
status of vehicles fleet say around 1000 numbers installed with the app on
each vehicle along with a vehicle tracking application server solution
based on Java and Wildfly with PosrgreSQL16 backend.

The mobile tablets are installed with the android based vehicle
tracking app which updated every 30 seconds its location fitted inside the
vehicle ( lat long coordinates) to the PostgreSQL DB through the java
backend application to know the latest location of the vehicle and its
movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly
(version 27 ) deployed with the vehicle tracking application developed with
Java(version 17).

* The mobile tablet communicates to the backend application over mobile
data (4G/5G SIMS). *

The running vehicles may disconnect or be unable to send the location
data in between if the mobile data coverage is less or absent in a
particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's
time or so) shows connection timeout and is unable to serve tracking of
the vehicles further.

When we restart the Wildfly server the application returns to normal.
again the issue repeats after a week or two.

In the Server machine when this bottleneck occurs I am seeing a lot of
TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes
unresponsive.

What is the root cause of this issue ? Is it due to the android
application unable to send the CLOSE_WAIT ACK due to poor mobile data
connectivity ?

If so, how do people address this issue ? and what may be a fix ?

Any directions / or reference material most welcome.

Thank you,
Krishane

#2Francesco Benetton
frabe1579@gmail.com
In reply to: KK CHN (#1)
Re: CLOSE_WAIT pileup and Application Timeout

If I understand clearly, postgresql is used as a Data server for the
backend, and so the Android app does not connect directly to postgresql.
The first idea is a problem on closing or recycling the connection by the
backend after executing the request. Maybe wrong client connection pooling
settings?

Il ven 4 ott 2024, 06:29 KK CHN <kkchn.in@gmail.com> ha scritto:

Show quoted text

List,

I am facing a network (TCP IP connection closing issue) .

Running a mobile tablet application, Android application to update the
status of vehicles fleet say around 1000 numbers installed with the app on
each vehicle along with a vehicle tracking application server solution
based on Java and Wildfly with PosrgreSQL16 backend.

The mobile tablets are installed with the android based vehicle
tracking app which updated every 30 seconds its location fitted inside the
vehicle ( lat long coordinates) to the PostgreSQL DB through the java
backend application to know the latest location of the vehicle and its
movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly
(version 27 ) deployed with the vehicle tracking application developed with
Java(version 17).

* The mobile tablet communicates to the backend application over mobile
data (4G/5G SIMS). *

The running vehicles may disconnect or be unable to send the location
data in between if the mobile data coverage is less or absent in a
particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's
time or so) shows connection timeout and is unable to serve tracking of
the vehicles further.

When we restart the Wildfly server the application returns to normal.
again the issue repeats after a week or two.

In the Server machine when this bottleneck occurs I am seeing a lot of
TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes
unresponsive.

What is the root cause of this issue ? Is it due to the android
application unable to send the CLOSE_WAIT ACK due to poor mobile data
connectivity ?

If so, how do people address this issue ? and what may be a fix ?

Any directions / or reference material most welcome.

Thank you,
Krishane

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: KK CHN (#1)
Re: CLOSE_WAIT pileup and Application Timeout

On 10/3/24 21:29, KK CHN wrote:

List,

I am facing a  network (TCP IP connection closing issue) .

Running a  mobile tablet application, Android application to update the
status of vehicles fleet say around 1000 numbers installed with the app
on each vehicle along  with a  vehicle tracking  application server
solution based on Java and Wildfly with  PosrgreSQL16 backend.

The  running vehicles may disconnect  or be unable to send the location
data in between if the mobile data coverage is less or absent in a
particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's
time  or so) shows connection timeout and is unable to serve tracking
of  the vehicles further.

When we restart the  Wildfly server  the application returns to normal.
again the issue repeats  after a week or two.

Seems the issue is in the application server. What is not clear to me is
whether the connection timeout you refer to is from the mobile devices
to the application or the application to the Postgres server? I'm
guessing the latter as I would expect the mobile devices to drop
connections more often then weekly.

In the Server machine when this bottleneck occurs  I am seeing  a lot
of  TCP/IP CLOSE_WAIT   ( 3000 to 5000 ) when the server backend becomes
unresponsive.

Again not clear, are you referring to the application or the Postgres
database running on the server?

What is the root cause of this issue ?   Is it due to the android
application unable to send the CLOSE_WAIT ACK due to poor mobile data
connectivity ?

 If so, how do people  address this issue ?  and what may be a fix ?

 Any  directions / or reference material most welcome.

Thank you,
Krishane

--
Adrian Klaver
adrian.klaver@aklaver.com

#4KK CHN
kkchn.in@gmail.com
In reply to: Adrian Klaver (#3)
Re: CLOSE_WAIT pileup and Application Timeout

On Fri, Oct 4, 2024 at 9:17 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 10/3/24 21:29, KK CHN wrote:

List,

I am facing a network (TCP IP connection closing issue) .

Running a mobile tablet application, Android application to update the
status of vehicles fleet say around 1000 numbers installed with the app
on each vehicle along with a vehicle tracking application server
solution based on Java and Wildfly with PosrgreSQL16 backend.

The running vehicles may disconnect or be unable to send the location
data in between if the mobile data coverage is less or absent in a
particular area where data coverage is nil or signal strength less.

The server on which the backend application runs most often ( a week's
time or so) shows connection timeout and is unable to serve tracking
of the vehicles further.

When we restart the Wildfly server the application returns to normal.
again the issue repeats after a week or two.

Seems the issue is in the application server. What is not clear to me is
whether the connection timeout you refer to is from the mobile devices
to the application or the application to the Postgres server?

its from mobile devices to application server. When I do a restart of
application server everything backs to normal. But after a period of time
again it cripples. That time when I netstat on Application VM lots of
CLOSE_WAIT states as indicated.

I'm
guessing the latter as I would expect the mobile devices to drop
connections more often then weekly.

Yes mobile devices may drops connections at any point of time if it
reaches an area where signal strength is poor( eg; Underground parking or
near the areas where mobile data coverage is poor.

The topology is mobile devices connect and update the location via
application VM then finally in PGSQL VM.

The application server and Database server both separate virtual
machines. Application server hangs most often not the database VM.
Since there are other applications which update to the database VM without
any issue. The DB VM caters all the writes from other applications. But
those applications are different, not fleet management one.

Show quoted text

In the Server machine when this bottleneck occurs I am seeing a lot
of TCP/IP CLOSE_WAIT ( 3000 to 5000 ) when the server backend becomes
unresponsive.

Again not clear, are you referring to the application or the Postgres
database running on the server?

What is the root cause of this issue ? Is it due to the android
application unable to send the CLOSE_WAIT ACK due to poor mobile data
connectivity ?

If so, how do people address this issue ? and what may be a fix ?

Any directions / or reference material most welcome.

Thank you,
Krishane

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: KK CHN (#4)
Re: CLOSE_WAIT pileup and Application Timeout

On 10/6/24 06:26, KK CHN wrote:

On Fri, Oct 4, 2024 at 9:17 PM Adrian Klaver <adrian.klaver@aklaver.com

Seems the issue is in the application server. What is not clear to
me is
whether the connection timeout you refer to is from the mobile devices
to the application or the application to the Postgres server?

its from mobile devices to application server.  When I do a restart of
application server everything backs to normal.  But after a period of
time again it cripples.  That time when I netstat on Application VM lots
of  CLOSE_WAIT states as indicated.

I'm
guessing the latter as I would expect the mobile devices to drop
connections more often then weekly.

Yes mobile devices may drops connections at any point of time if it
reaches an area where signal strength is poor( eg; Underground
parking or near the areas where mobile data coverage is poor.

The topology is mobile devices  connect and update the location via
application VM then   finally in  PGSQL VM.

The application server and  Database server both separate virtual
machines.      Application server hangs most often not the database VM.
Since there are other applications which update to the database VM
without any issue.  The DB VM caters all the writes from other
applications. But those applications are different, not fleet management
one.

From what I see this really has nothing to do with the Postgres
backend. It is a matter of communication, actually lack of
communication, between the mobile devices and the application server. A
broad answer is that something needs to be done to gracefully deal with
mobile device connection drops

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: KK CHN (#1)
Re: CLOSE_WAIT pileup and Application Timeout

On 2024-Oct-04, KK CHN wrote:

The mobile tablets are installed with the android based vehicle
tracking app which updated every 30 seconds its location fitted inside the
vehicle ( lat long coordinates) to the PostgreSQL DB through the java
backend application to know the latest location of the vehicle and its
movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly
(version 27 ) deployed with the vehicle tracking application developed with
Java(version 17).

It sounds like setting TCP keepalives in the connections between the
Wildfly and the vehicles might help get the number of dead connections
down to a reasonable level. Then it's up to Wildfly to close the
connections to Postgres in a timely fashion. (It's not clear from your
description how do vehicle connections to Wildfly relate to Postgres
connections.)

I wonder if the connections from Wildfly to Postgres use SSL? Because
there are reported cases where TCP connections are kept and accumulate,
causing problems -- but apparently SSL is a necessary piece for that to
happen.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

#7KK CHN
kkchn.in@gmail.com
In reply to: Alvaro Herrera (#6)
Re: CLOSE_WAIT pileup and Application Timeout

On Mon, Oct 7, 2024 at 12:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2024-Oct-04, KK CHN wrote:

The mobile tablets are installed with the android based vehicle
tracking app which updated every 30 seconds its location fitted inside

the

vehicle ( lat long coordinates) to the PostgreSQL DB through the java
backend application to know the latest location of the vehicle and its
movement which will be rendered in a map based front end.

The vehicles on the field communicate via 443 to 8080 of the Wildfly
(version 27 ) deployed with the vehicle tracking application developed

with

Java(version 17).

It sounds like setting TCP keepalives in the connections between the
Wildfly and the vehicles might help get the number of dead connections
down to a reasonable level. Then it's up to Wildfly to close the
connections to Postgres in a timely fashion. (It's not clear from your
description how do vehicle connections to Wildfly relate to Postgres
connections.)

Where do I have to introduce the TCP keepalives ? in the OS level or
application code level ?

[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@dbch wildfly-27.0.0.Final]# cat
/proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@dbch wildfly-27.0.0.Final]#

These are the default values in the OS level. Do I need to reduce all the
above three values to say 600, 20, 5 ? Or need to be handled in the
application backend code ?

Any hints much appreciated..

I wonder if the connections from Wildfly to Postgres use SSL? Because
there are reported cases where TCP connections are kept and accumulate,
causing problems -- but apparently SSL is a necessary piece for that to
happen.

No SSL in between Wildfly (8080 ) to PGSQL(5432). Both the machines
internal lan VMs in the same network. Only the devices on the field
(fitted on the vehicles) communicate to the application backend via a
public URL :443 port then it connectes to the 8080 of wildfly then the
java code connects the database server running on 5432 on the internal LAN
network.

Show quoted text

--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: KK CHN (#7)
Re: CLOSE_WAIT pileup and Application Timeout

On 2024-Oct-07, KK CHN wrote:

On Mon, Oct 7, 2024 at 12:07 AM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Where do I have to introduce the TCP keepalives ? in the OS level or
application code level ?

[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@dbch wildfly-27.0.0.Final]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@dbch wildfly-27.0.0.Final]# cat
/proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@dbch wildfly-27.0.0.Final]#

These are the default values in the OS level. Do I need to reduce all the
above three values to say 600, 20, 5 ? Or need to be handled in the
application backend code ?

My understanding is that these values have no effect unless the socket
gets
setsockopt( ... , SO_KEEPALIVE, ...)

So that's definitely something that the app needs to do -- it's not
enabled automatically.

With these default settings, the connection would be closed about 2:11
after going quiet, so if your problem manifests only a week later, you
would have enough time for these to be cleaned up. But of course you
should monitor what happens.

I wonder if the connections from Wildfly to Postgres use SSL? Because
there are reported cases where TCP connections are kept and accumulate,
causing problems -- but apparently SSL is a necessary piece for that to
happen.

No SSL in between Wildfly (8080 ) to PGSQL(5432).

Okay, that's unlikely to be relevant then.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)