errors with high connections rate

Started by Pawel Veselovalmost 14 years ago8 messagesgeneral
Jump to latest
#1Pawel Veselov
pawel.veselov@gmail.com

Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon
AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written
out to the DB. There is no connection pool, instead, each worker thread
maintains it's own connection that it uses to write data to the database.
The connections are kept pthread's "specific" data blocks.

Each thread would connect to the DB when the first work message is
received, or when there was an "error" flag with a connection. The error
flag is set any time there is any error running a database statement.

When the work is "slow", I don't see any problem (slow was ~250 messages
per second). As I increased the load, when I restart the process, threads
start grabbing work at high enough rate, and each will first open a
connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is
not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

wi->pg_conn = PQconnectdb(conn_str);
ConnectionStatusType cst = PQstatus(wi->pg_conn);

if (cst != CONNECTION_OK) {
ERR("Can't connect to DB: %s\n", PQerrorMessage(wi->pg_conn));
}

Eventually, the errors go away (when the worker thread fail to connect,
they just pass the message to another thread, and wait for their turn, and
will try reconnecting again), so it does seem that the remedy is just
spreading the connections in time.

The connection string is '' (empty), the connection is made through
/tmp/.s.PGSQL.5432

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under
200 or less, but seen them with 300 and more)
2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at
least to see what's going on, but sometimes I get another error : "too many
users connected". Even restarting postmaster doesn't help. The postmaster
is running with -N810, and the role has connection limit of 1000. Yet, the
"too many" error starts creeping up only after 275 connections are opened
(counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race
condition between poll() and socket actually finishing the connection? If
running under strace, I never see EINPROGRESS returned from connect(), and
the only reason sendto() would result into ENOTCONN is when the connect
didn't finish, and the socket was deemed "connected" using
poll/getsockopt...

Thanks,
Pawel.

#2Craig Ringer
craig@2ndquadrant.com
In reply to: Pawel Veselov (#1)
Re: errors with high connections rate

On 07/03/2012 03:19 PM, Pawel Veselov wrote:

Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3
(Amazon AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then
written out to the DB. There is no connection pool, instead, each
worker thread maintains it's own connection that it uses to write data
to the database. The connections are kept pthread's "specific" data
blocks.

Hmm. To get that kind of TPS with that design are you running with
fsync=off or on storage that claims to flush I/O without actually doing
so? Have you checked your crash safety? Is it just fairly big hardware?

Why are you using so many connections? Unless you have truly monstrous
hardware your system should achieve considerably greater throughput by
reducing the connection count and queueing bursts of writes. You
wouldn't even need an external pool in your case, just switch to a
producer/consumer model where your accepting threads add work to
separate and much fewer writer threads for sending to the DB. Writer
threads could then do useful optimisations like multi-value-inserting or
COPYing data, doing small batches in transactions, etc.

I'm seriously impressed that your system is working under load at all
with 800 concurrent connections fighting to write all at once.

Can't connect to DB: could not send data to server: Transport endpoint
is not connected
could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource
limits like max proc count, anti-forkbomb measures, max file handles, etc?

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it,
at least to see what's going on, but sometimes I get another error :
"too many users connected". Even restarting postmaster doesn't help.
The postmaster is running with -N810, and the role has connection
limit of 1000. Yet, the "too many" error starts creeping up only after
275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?

See how many connections the *server* thinks exist by examining
pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting
operating system limits. Look for fork() failures, unexplained
segfaults, etc.

--
Craig Ringer

#3John R Pierce
pierce@hogranch.com
In reply to: Craig Ringer (#2)
Re: errors with high connections rate

On 07/03/12 12:34 AM, Craig Ringer wrote:

I'm seriously impressed that your system is working under load at all
with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual xeon
x5600 class systems, with 16 or 20 spindle raid10, I find somewherre
around 50 to 80 database connection threads has the highest overall
throughput (several thousand OLTP transactions/second). this hardware
has vastly better IO and CPU performance than any AWS virtual machine.

as craig suggested, your network threads could put the incoming requests
into queue(s), and run a tunable number of database connection threads
that take requests out of the queue and send them to the database, and
if neccessary, return results to the network thread. doing this will
give better CPU utilization, you can try different database worker
thread counts til you hit the optimal number for your hardware.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#4Pawel Veselov
pawel.veselov@gmail.com
In reply to: John R Pierce (#3)
Re: errors with high connections rate

On 07/03/2012 12:54 AM, John R Pierce wrote:

On 07/03/12 12:34 AM, Craig Ringer wrote:

I'm seriously impressed that your system is working under load at all
with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual
xeon x5600 class systems, with 16 or 20 spindle raid10, I find
somewherre around 50 to 80 database connection threads has the highest
overall throughput (several thousand OLTP transactions/second).
this hardware has vastly better IO and CPU performance than any AWS
virtual machine.

as craig suggested, your network threads could put the incoming
requests into queue(s), and run a tunable number of database
connection threads that take requests out of the queue and send them
to the database, and if neccessary, return results to the network
thread. doing this will give better CPU utilization, you can try
different database worker thread counts til you hit the optimal number
for your hardware.

Just to clear the air on this, this is almost exactly what I'm doing.
The number of 800 came out of experimenting with numbers (I'm sure it
took you some time to find the optimum of 50-80 for your configuration).
The number of "worker" threads are configurable, and they do receive
their work from a shared queue. By the way, on the operations that I'm
doing, postgres is performing very well, with average of less than 10ms
per transaction, with throughput of times over 600 tps.

However, writing data to postgres is not the only thing I need to do to
process the data. If the time to process rises for other reasons, low
number of threads may not be able to withstand constant stream of
incoming data, and I have to raise the worker thread number to
compensate. As I was doing this, I ran into the problem described in the
original email, and it puzzled me. However, only because I opened 800
connections, doesn't mean that all of the connections are being being
actively used concurrently (so not that much fighting). I indeed should
switch to a connection pool model in such a case, just to not over-fork
postgres, however, I don't see that postgres is consuming any
significant amount of system resources by forked server processes.

Thank you,
Pawel.

#5Pawel Veselov
pawel.veselov@gmail.com
In reply to: Craig Ringer (#2)
Re: errors with high connections rate

On 07/03/2012 12:34 AM, Craig Ringer wrote:

On 07/03/2012 03:19 PM, Pawel Veselov wrote:

Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3
(Amazon AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then
written out to the DB. There is no connection pool, instead, each
worker thread maintains it's own connection that it uses to write
data to the database. The connections are kept pthread's "specific"
data blocks.

[skipped, replied to separately]

Can't connect to DB: could not send data to server: Transport
endpoint is not connected
could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource
limits like max proc count, anti-forkbomb measures, max file handles, etc?

If accept() succeeded, and fork() failed, the socket would be closed by
the process (parent will close, child socket wouldn't even be forked),
wouldn't that result into ECONNRESET, and not ENOTCONN?

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce
it, at least to see what's going on, but sometimes I get another
error : "too many users connected". Even restarting postmaster
doesn't help. The postmaster is running with -N810, and the role has
connection limit of 1000. Yet, the "too many" error starts creeping
up only after 275 connections are opened (counted by successful
connect() from strace).

Any idea where should I dig?

See how many connections the *server* thinks exist by examining
pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting
operating system limits. Look for fork() failures, unexplained
segfaults, etc.

That's the thing, no segfaults (dmesg), nothing in the server logs.

It may as well be some sort of an anti-fork-bomb measure, only judging
by the fact that with enough attempts, things do clear out, though I
wish there would be some indication of that, and I'm still confused
about the error code being ENOTCONN.

#6Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Pawel Veselov (#5)
Re: errors with high connections rate

John R Pierce

On 07/03/12 12:34 AM, Craig Ringer wrote:

I'm seriously impressed that your system is working under load at
all with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual
xeon x5600 class systems, with 16 or 20 spindle raid10, I find
somewherre around 50 to 80 database connection threads has the
highest overall throughput (several thousand OLTP
transactions/second). this hardware has vastly better IO and CPU
performance than any AWS virtual machine.

as craig suggested, your network threads could put the incoming
requests into queue(s), and run a tunable number of database
connection threads that take requests out of the queue and send
them to the database, and if neccessary, return results to the
network thread. doing this will give better CPU utilization, you
can try different database worker thread counts til you hit the
optimal number for your hardware.

+1

We (at the Wisconsin courts) have definitely found that the best
model for us is to have a separate layer for running database
transactions, with one thread per database connection and each of
those threads pulling from a prioritized FIFO queue into which
*other* layers place requests.

This comes up so often that I threw together a Wiki page for it:

http://wiki.postgresql.org/wiki/Number_Of_Database_Connections

Of course, everyone should feel free to improve the page.

-Kevin

#7Craig Ringer
craig@2ndquadrant.com
In reply to: Pawel Veselov (#5)
Re: errors with high connections rate

On 07/03/2012 04:26 PM, Pawel S. Veselov wrote:

That's the thing, no segfaults (dmesg), nothing in the server logs.

It may as well be some sort of an anti-fork-bomb measure, only judging
by the fact that with enough attempts, things do clear out, though I
wish there would be some indication of that, and I'm still confused
about the error code being ENOTCONN.

I've managed to produce the endpoint not connected errors with a little
test I wrote here. Only once so far and only during an abnormal test run
where I signalled the test workers as they were starting up, so that's
not really very helpful.

I have no problem using a little Python test program to create 800
connections in about a second. It forks some workers (100 by default)
which grab enough connections each to reach the target connection count.

Ooh, handy. I just triggered it again now. The "Transport endpoint is
not connected" messages were intermixed with some "FATAL: sorry, too
many clients already" messages. The PostgreSQL log is full of FATAL:
sorry, too many clients already" messages intermixed with "LOG:
unexpected EOF on client connection" messages. Again it was an abnormal
run where I signalled my workers mid way through startup.

Interesting, that. I've never seen it on a run where I don't send a
signal. You know what that makes me think? You're using a multithreaded
approach, and there's something going wrong in your app's innards. Yes,
that's a lot of hot air and handwaving, but it fits - you're getting an
error saying that psql is trying to operate on a socket that isn't there.

The fact that there's nothing in the system logs or Pg logs just adds
weight to that. I'm guessing you have a threading bug, possibly signal
related.

--
Craig Ringer

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Craig Ringer (#7)
Re: errors with high connections rate

Here's the test program, btw:

https://github.com/ringerc/scrapcode/tree/master/scripts/pg_forktest

pgfork.py is a home rolled fork() horror.

pg_mp.py is the same thing done with Python's multiprocessing module.

--
Craig Ringer