Solaris source code

Started by Bruce Momjianover 24 years ago41 messages

pgman@candle.pha.pa.us

over 24 years ago

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Naomi Walker

nwalker@eldocomp.com

over 24 years ago

In reply to: Bruce Momjian (#1)

Re: Solaris source code

At 04:30 PM 7/5/01 -0400, Bruce Momjian wrote:

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

Bruce,

We are about to roll out PostgreSQL on Solaris, and I am interested in any
Solaris specific gotcha's. Do you have some specifics in mind, or was this
just general preventive maintenance type steps?
--
Naomi Walker
Chief Information Officer
Eldorado Computing, Inc.
602-604-3100 ext 242

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Naomi Walker (#2)

Re: Solaris source code

At 04:30 PM 7/5/01 -0400, Bruce Momjian wrote:

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

Bruce,

We are about to roll out PostgreSQL on Solaris, and I am interested in any
Solaris specific gotcha's. Do you have some specifics in mind, or was this
just general preventive maintenance type steps?

Preventative. I have heard Solaris has higher context switching and that
may effect us because we use processes instead of threads.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Noname

ncm@zembu.com

over 24 years ago

In reply to: Naomi Walker (#2)

Re: Solaris source code

On Thu, Jul 05, 2001 at 02:03:31PM -0700, Naomi Walker wrote:

We are about to roll out PostgreSQL on Solaris, and I am interested
in any Solaris specific gotcha's. Do you have some specifics in mind,
or was this just general preventive maintenance type steps?

There have been reports of trouble with Unix sockets on Solaris.
You can use TCP sockets, which might be slower; or change, in
src/backend/libpq/pqcomm.c, the line

listen(fd, SOMAXCONN);

listen(fd, 1024);

(Cf. Stevens, "Unix Network Programming, Volume 1", pp. 96 and 918.)

I don't know (and Stevens doesn't hint) of any reason not to fold
this change into the mainline sources. However, we haven't heard
from the people who had had trouble with Unix sockets whether this
change actually fixes their problems.

The effect of the change is to make it much less likely for a
connection request to be rejected when connections are being opened
very frequently.

Nathan Myers
ncm@zembu.com

Mathijs Brands

mathijs@ilse.nl

over 24 years ago

In reply to: Bruce Momjian (#1)

Re: Solaris source code

On Thu, Jul 05, 2001 at 04:30:40PM -0400, Bruce Momjian allegedly wrote:

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

Cool. It would be nice to know why the regression tests fail on Solaris when
using a UNIX socket.

Cheers,

Mathijs

Mathijs Brands

mathijs@ilse.nl

over 24 years ago

In reply to: Naomi Walker (#2)

Re: Solaris source code

On Thu, Jul 05, 2001 at 02:03:31PM -0700, Naomi Walker allegedly wrote:

At 04:30 PM 7/5/01 -0400, Bruce Momjian wrote:

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

Bruce,

We are about to roll out PostgreSQL on Solaris, and I am interested in any
Solaris specific gotcha's. Do you have some specifics in mind, or was this
just general preventive maintenance type steps?

PostgreSQL 7.1 fails the regression tests when using a UNIX socket,
which is faster than a TCP/IP socket (when both the client and the
server are running on the same machine). We're running a few small
PostgreSQL databases on Solaris and we're going to implement a bigger
one in the near future. If you connect via TCP/IP sockets, you should be
safe. We're using JDBC to connect to the database and JDBC always uses
a TCP/IP socket. So far we haven't run into any real problems, although
PostgreSQL did crash once, for unknown reasons (probably becase someone
was messing with it).

Not really helpful, I guess. Doing some testing of your own is highly
recommended ;)

Cheers,

Mathijs

Mathijs Brands

mathijs@ilse.nl

over 24 years ago

In reply to: Bruce Momjian (#1)

Re: Solaris source code

On Mon, Jul 09, 2001 at 02:03:16PM -0700, Nathan Myers allegedly wrote:

On Mon, Jul 09, 2001 at 02:24:17PM +0200, Mathijs Brands wrote:

On Thu, Jul 05, 2001 at 02:03:31PM -0700, Naomi Walker allegedly wrote:

At 04:30 PM 7/5/01 -0400, Bruce Momjian wrote:

I have purchased the Solaris source code from Sun for $80. (I could
have downloaded it for free after faxing them an 11 page contract, but I
decided I wanted the CD's.) See the slashdot story at:

http://slashdot.org/article.pl?sid=01/06/30/1224257&mode=thread

My hope is that I can use the source code to help debug Solaris
PostgreSQL problems. It includes source for the kernel and all user
programs. The code is similar to *BSD kernels. It is basically Unix
SvR4 with Sun's enhancements. It has both AT&T and Sun copyrights on
the files.

Bruce,

We are about to roll out PostgreSQL on Solaris, and I am interested in any
Solaris specific gotcha's. Do you have some specifics in mind, or was this
just general preventive maintenance type steps?

PostgreSQL 7.1 fails the regression tests when using a UNIX socket,
which is faster than a TCP/IP socket (when both the client and the
server are running on the same machine).

Have you tried increasing the argument to listen in libpq/pgcomm.c
from SOMAXCONN to 1024? I think many people would be very interested
in your results.

OK, I tried using 1024 (and later 128) instead of SOMAXCONN (defined to
be 5 on Solaris) in src/backend/libpq/pqcomm.c and ran a few regression
tests on two different Sparc boxes (Solaris 7 and 8). The regression
test still fails, but for a different reason. The abstime test fails;
not only on Solaris but also on FreeBSD (4.3-RELEASE).

*** ./expected/abstime.out  Thu May  3 21:00:37 2001
--- ./results/abstime.out Tue Jul 10 10:34:18 2001
***************
*** 47,56 ****
       | Sun Jan 14 03:14:21 1973 PST
       | Mon May 01 00:30:30 1995 PDT
       | epoch
-      | current
       | -infinity
       | Sat May 10 23:59:12 1947 PST
! (6 rows)

  SELECT '' AS six, ABSTIME_TBL.*
     WHERE ABSTIME_TBL.f1 > abstime '-infinity';
--- 47,55 ----
       | Sun Jan 14 03:14:21 1973 PST
       | Mon May 01 00:30:30 1995 PDT
       | epoch
       | -infinity
       | Sat May 10 23:59:12 1947 PST
! (5 rows)

SELECT '' AS six, ABSTIME_TBL.*
WHERE ABSTIME_TBL.f1 > abstime '-infinity';

======================================================================

I've checked the FreeBSD and Linux headers and they've got SOMAXCONN set
to 128.

Here's a snippet from the linux listen(2) manpage:

BUGS
If the socket is of type AF_INET, and the backlog argument
is greater than the constant SOMAXCONN (128 in Linux 2.0 &
2.2), it is silently truncated to SOMAXCONN. Don't rely
on this value in portable applications since BSD (and some
BSD-derived systems) limit the backlog to 5.

I've checked Solaris 2.6, 7 and 8 and the kernels have a default value
of 128 for the number of backlog connections. This number can be
increased to 1000 (maybe even larger). On Solaris 2.4 and 2.5 it is
appearently set to 32. Judging from Adrian Cockcrofts Solaris tuning
guide Sun has been using a default value of 128 from Solaris 2.5.1
on. You do need some patches for 2.5.1: patches 103582 & 103630 (SPARC)
or patches 103581 & 10361 (X86). Later versions of Solaris don't need
any patches. You can check (and set) the number of backlog connections
by using the following command:

Solaris 2.3, 2.4, 2.5 and unpatched 2.5.1:
/usr/sbin/ndd /dev/tcp tcp_conn_req_max (untested)

Solaris 2.5.1 (patched), 2.6, 7 and 8:
/usr/sbin/ndd /dev/tcp tcp_conn_req_max_q

It'd probably be a good idea to use a value of 128 for the number of
backlog connections and not SOMAXCONN. If the requested number of
backlog connections is bigger than the number the kernel allows, it
should be truncated. Of course, there's no guarantee that this won't
cause problems on arcane platforms such as Ultrix (if it is still
supported).

The Apache survival guide has more info on TCP/IP tuning for several
platforms and includes information on the listen backlog.

Cheers,

Mathijs

Ps. Just checking IRIX 6.5 - it's got the backlog set to 1000
connctions.
--
And the beast shall be made legion. Its numbers shall be increased a
thousand thousand fold. The din of a million keyboards like unto a great
storm shall cover the earth, and the followers of Mammon shall tremble.

Import Notes

Reply to msg id not found: 20010709140316.B23310@store.zembu.com

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Mathijs Brands (#7)

SOMAXCONN (was Re: Solaris source code)

Mathijs Brands <mathijs@ilse.nl> writes:

OK, I tried using 1024 (and later 128) instead of SOMAXCONN (defined to
be 5 on Solaris) in src/backend/libpq/pqcomm.c and ran a few regression
tests on two different Sparc boxes (Solaris 7 and 8). The regression
test still fails, but for a different reason. The abstime test fails;
not only on Solaris but also on FreeBSD (4.3-RELEASE).

The abstime diff is to be expected (if you look closely, the test is
comparing 'current' to 'June 30, 2001'. Ooops). If that's the only
diff then you are in good shape.

Based on this and previous discussions, I am strongly tempted to remove
the use of SOMAXCONN and instead use, say,

#define PG_SOMAXCONN 1000

defined in config.h.in. That would leave room for configure to twiddle
it, if that proves necessary. Does anyone know of a platform where this
would cause problems? AFAICT, all versions of listen(2) are claimed to
be willing to reduce the passed parameter to whatever they can handle.

regards, tom lane

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#8)

Re: SOMAXCONN (was Re: Solaris source code)

Mathijs Brands <mathijs@ilse.nl> writes:

OK, I tried using 1024 (and later 128) instead of SOMAXCONN (defined to
be 5 on Solaris) in src/backend/libpq/pqcomm.c and ran a few regression
tests on two different Sparc boxes (Solaris 7 and 8). The regression
test still fails, but for a different reason. The abstime test fails;
not only on Solaris but also on FreeBSD (4.3-RELEASE).

The abstime diff is to be expected (if you look closely, the test is
comparing 'current' to 'June 30, 2001'. Ooops). If that's the only
diff then you are in good shape.

Based on this and previous discussions, I am strongly tempted to remove
the use of SOMAXCONN and instead use, say,

#define PG_SOMAXCONN 1000

defined in config.h.in. That would leave room for configure to twiddle
it, if that proves necessary. Does anyone know of a platform where this
would cause problems? AFAICT, all versions of listen(2) are claimed to
be willing to reduce the passed parameter to whatever they can handle.

Could we test SOMAXCONN and set PG_SOMAXCONN to 1000 only if SOMAXCONN1
is less than 1000?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#10

Noname

ncm@zembu.com

over 24 years ago

In reply to: Bruce Momjian (#9)

Re: SOMAXCONN (was Re: Solaris source code)

On Tue, Jul 10, 2001 at 05:06:28PM -0400, Bruce Momjian wrote:

Mathijs Brands <mathijs@ilse.nl> writes:

OK, I tried using 1024 (and later 128) instead of SOMAXCONN (defined to
be 5 on Solaris) in src/backend/libpq/pqcomm.c and ran a few regression
tests on two different Sparc boxes (Solaris 7 and 8). The regression
test still fails, but for a different reason. The abstime test fails;
not only on Solaris but also on FreeBSD (4.3-RELEASE).

The abstime diff is to be expected (if you look closely, the test is
comparing 'current' to 'June 30, 2001'. Ooops). If that's the only
diff then you are in good shape.

Based on this and previous discussions, I am strongly tempted to remove
the use of SOMAXCONN and instead use, say,

#define PG_SOMAXCONN 1000

defined in config.h.in. That would leave room for configure to twiddle
it, if that proves necessary. Does anyone know of a platform where this
would cause problems? AFAICT, all versions of listen(2) are claimed to
be willing to reduce the passed parameter to whatever they can handle.

Could we test SOMAXCONN and set PG_SOMAXCONN to 1000 only if SOMAXCONN
is less than 1000?

All the OSes we know of fold it to 128, currently. We can jump it
to 10240 now, or later when there are 20GHz CPUs.

If you want to make it more complicated, it would be more useful to
be able to set the value lower for runtime environments where PG is
competing for OS resources with another daemon that deserves higher
priority.

Nathan Myers
ncm@zembu.com

#11

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#9)

Re: SOMAXCONN (was Re: Solaris source code)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Could we test SOMAXCONN and set PG_SOMAXCONN to 1000 only if SOMAXCONN1
is less than 1000?

Why bother?

If you've got some plausible scenario where 1000 is too small, we could
just as easily make it 10000. I don't see the need for yet another
configure test for this.

regards, tom lane

#12

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#11)

Re: SOMAXCONN (was Re: Solaris source code)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Could we test SOMAXCONN and set PG_SOMAXCONN to 1000 only if SOMAXCONN1
is less than 1000?

Why bother?

If you've got some plausible scenario where 1000 is too small, we could
just as easily make it 10000. I don't see the need for yet another
configure test for this.

I was thinking:

#if SOMAXCONN >= 1000
#define PG_SOMAXCONN SOMAXCONN
#else
#define PG_SOMAXCONN 1000
#endif

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#13

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#12)

Re: SOMAXCONN (was Re: Solaris source code)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I was thinking:

#if SOMAXCONN >= 1000
#define PG_SOMAXCONN SOMAXCONN
#else
#define PG_SOMAXCONN 1000
#endif

Not in config.h, you don't. Unless you want <sys/socket.h> (or
whichever header defines SOMAXCONN; how consistent is that across
platforms, anyway?) to be included by everything in the system ...

regards, tom lane

#14

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Noname (#10)

Re: SOMAXCONN (was Re: Solaris source code)

ncm@zembu.com (Nathan Myers) writes:

All the OSes we know of fold it to 128, currently. We can jump it
to 10240 now, or later when there are 20GHz CPUs.

If you want to make it more complicated, it would be more useful to
be able to set the value lower for runtime environments where PG is
competing for OS resources with another daemon that deserves higher
priority.

Hmm, good point. Does anyone have a feeling for the amount of kernel
resources that are actually sucked up by an accept-queue entry? If 128
is the customary limit, is it actually worth worrying about whether
we are setting it to 128 vs. something smaller?

regards, tom lane

#15

Noname

ncm@zembu.com

over 24 years ago

In reply to: Tom Lane (#14)

Re: SOMAXCONN (was Re: Solaris source code)

On Tue, Jul 10, 2001 at 06:36:21PM -0400, Tom Lane wrote:

ncm@zembu.com (Nathan Myers) writes:

All the OSes we know of fold it to 128, currently. We can jump it
to 10240 now, or later when there are 20GHz CPUs.

If you want to make it more complicated, it would be more useful to
be able to set the value lower for runtime environments where PG is
competing for OS resources with another daemon that deserves higher
priority.

Hmm, good point. Does anyone have a feeling for the amount of kernel
resources that are actually sucked up by an accept-queue entry? If 128
is the customary limit, is it actually worth worrying about whether
we are setting it to 128 vs. something smaller?

I don't think the issue is the resources that are consumed by the
accept-queue entry. Rather, it's a tuning knob to help shed load
at the entry point to the system, before significant resources have
been committed. An administrator would tune it according to actual
system and traffic characteristics.

It is easy enough for somebody to change, if they care, that it seems
to me we have already devoted it more time than it deserves right now.

Nathan Myers
ncm@zembu.com

#16

Ian Lance Taylor

ian@zembu.com

over 24 years ago

In reply to: Tom Lane (#14)

Re: SOMAXCONN (was Re: Solaris source code)

Tom Lane <tgl@sss.pgh.pa.us> writes:

ncm@zembu.com (Nathan Myers) writes:

If you want to make it more complicated, it would be more useful to
be able to set the value lower for runtime environments where PG is
competing for OS resources with another daemon that deserves higher
priority.

Hmm, good point. Does anyone have a feeling for the amount of kernel
resources that are actually sucked up by an accept-queue entry? If 128
is the customary limit, is it actually worth worrying about whether
we are setting it to 128 vs. something smaller?

Not much in the way of kernel resources is required by an entry on the
accept queue. Basically a socket structure and maybe a couple of
addresses, typically about 200 bytes or so.

But I wouldn't worry about it, and I wouldn't worry about Nathan's
suggestion for making the limit configurable, because Postgres
connections don't spend time on the queue. The postgres server will
be picking them off as fast as it can. If the server can't pick
processes off fast enough, then your system has other problems;
reducing the size of the queue won't help those problems. A large
queue will help when a large number of connections arrives
simultaneously--it will permit Postgres to deal them appropriately,
rather than causing the system to discard them on its terms.

(Matters might be different if the Postgres server were written to not
call accept when it had the maximum number of connections active, and
to just leave connections on the queue in that case. But that's not
how it works today.)

Ian

---------------------------(end of broadcast)---------------------------
TIP 842: "When the only tool you have is a hammer, you tend to treat
everything as if it were a nail."
-- Abraham Maslow

#17

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#14)

Re: SOMAXCONN (was Re: Solaris source code)

ncm@zembu.com (Nathan Myers) writes:

All the OSes we know of fold it to 128, currently. We can jump it
to 10240 now, or later when there are 20GHz CPUs.

If you want to make it more complicated, it would be more useful to
be able to set the value lower for runtime environments where PG is
competing for OS resources with another daemon that deserves higher
priority.

Hmm, good point. Does anyone have a feeling for the amount of kernel
resources that are actually sucked up by an accept-queue entry? If 128
is the customary limit, is it actually worth worrying about whether
we are setting it to 128 vs. something smaller?

All I can say is keep in mind that Solaris uses SVr4 streams, which are
quite a bit heavier than the BSD-based sockets. I don't know any
numbers.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#18

Mark

mark@ldssingles.com

over 24 years ago

In reply to: Noname (#10)

vacuum problems

Quick rundown of our configuration:
Red Hat 7.1 (no changes or extras added by us)
Postgresql 7.1.2 and CVS HEAD from 07/10/2001
3.8 gb database size

I included two pgsql versions because this happens on both.

Here's the problem we're having:

We run a vacuumdb from the server on the entire database. Some large tables
are vacuumed very quickly, but the vacuum process hangs or takes more than a
few hours on a specific table (we haven't let it finish before). The vacuum
process works quickly on a table (loginhistory) with 2.8 million records, but
is extremely slow on a table (inbox) with 1.1 million records (the table with
1.1 million records is actually larger in kb size than the other table).

We've tried to vacuum the inbox table seperately ('vacuum inbox' within
psql), but this still takes hours (again we have never let it complete, we
need to use the database for development as well).

We noticed 2 things that are significant to this situatoin:
The server logs the following:

DEBUG: --Relation msginbox--
DEBUG: Pages 129921: Changed 26735, reaped 85786, Empty 0, New 0; Tup
1129861: Vac 560327, Keep/VTL 0/0, Crash 0, UnUsed 51549, MinLen 100,
MaxLen 2032; Re-using: Free/Avail. Space 359061488/359059332;
EndEmpty/Avail. Pages 0/85785. CPU 11.18s/5.32u sec.
DEBUG: Index msginbox_pkey: Pages 4749; Tuples 1129861: Deleted 76360.
CPU 0.47s/6.70u sec.
DEBUG: Index msginbox_fromto: Pages 5978; Tuples 1129861: Deleted 0.
CPU 0.37s/6.15u sec.
DEBUG: Index msginbox_search: Pages 4536; Tuples 1129861: Deleted 0.
CPU 0.32s/6.30u sec.
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES

the last few lines (XLogWrite .....) repeat for ever and ever and ever. With
7.1.2 this never stops unless we run out of disk space or cancel the query.
With CVS HEAD this still continues, but the log files don't consume all disk
space, but we still have to cancel it or it might run forever.

Perhaps we need to let it run until it completes, but we thought that we
might be doing something wrong or have some data (we're converting data from
MS SQL Server) that isn't friendly.

The major issue we're facing with this is that any read or write access to
the table being vacuumed times out (obviously because the table is still
locked). We plan to use PostgreSQL in our production service, but we can't
until we get this resolved.

We're at a loss, not being familiar enough with PostgreSQL and it's source
code. Can anyone please offer some advice or suggestions?

Thanks,

Mark

#19

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Ian Lance Taylor (#16)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Ian Lance Taylor <ian@zembu.com> writes:

But I wouldn't worry about it, and I wouldn't worry about Nathan's
suggestion for making the limit configurable, because Postgres
connections don't spend time on the queue. The postgres server will
be picking them off as fast as it can. If the server can't pick
processes off fast enough, then your system has other problems;

Right. Okay, it seems like just making it a hand-configurable entry
in config.h.in is good enough for now. When and if we find that
that's inadequate in a real-world situation, we can improve on it...

regards, tom lane

#20

Peter Eisentraut

peter_e@gmx.net

over 24 years ago

In reply to: Tom Lane (#19)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Tom Lane writes:

Right. Okay, it seems like just making it a hand-configurable entry
in config.h.in is good enough for now. When and if we find that
that's inadequate in a real-world situation, we can improve on it...

Would anything computed from the maximum number of allowed connections
make sense?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#21

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Peter Eisentraut (#20)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

Right. Okay, it seems like just making it a hand-configurable entry
in config.h.in is good enough for now. When and if we find that
that's inadequate in a real-world situation, we can improve on it...

Would anything computed from the maximum number of allowed connections
make sense?

[ looks at code ... ] Hmm, MaxBackends is indeed set before we arrive
at the listen(), so it'd be possible to use MaxBackends to compute the
parameter. Offhand I would think that MaxBackends or at most
2*MaxBackends would be a reasonable value.

Question, though: is this better than having a hardwired constant?
The only case I can think of where it might not be is if some platform
out there throws an error from listen() when the parameter is too large
for it, rather than silently reducing the value to what it can handle.
A value set in config.h.in would be simpler to adapt for such a platform.

BTW, while I'm thinking about it: why doesn't pqcomm.c test for a
failure return from the listen() call? Is this just an oversight,
or is there a good reason to ignore errors?

regards, tom lane

#22

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#21)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

Right. Okay, it seems like just making it a hand-configurable entry
in config.h.in is good enough for now. When and if we find that
that's inadequate in a real-world situation, we can improve on it...

Would anything computed from the maximum number of allowed connections
make sense?

[ looks at code ... ] Hmm, MaxBackends is indeed set before we arrive
at the listen(), so it'd be possible to use MaxBackends to compute the
parameter. Offhand I would think that MaxBackends or at most
2*MaxBackends would be a reasonable value.

Don't we have maxbackends configurable at runtime. If so, any constant
we put in config.h will be inaccurate. Seems we have to track
maxbackends.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#23

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#22)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Don't we have maxbackends configurable at runtime.

Not after postmaster start, so passing it to the initial listen()
shouldn't be a problem.

The other concern I had could be addressed by making the listen
parameter be MIN(MaxBackends, PG_SOMAXCONN) where PG_SOMAXCONN
is set in config.h --- but now we could make the default value
really large, say 10000. The only reason to change it would be
if you had a kernel that barfed on large listen() parameters.

Have we beat this issue to death yet, or is it still twitching?

regards, tom lane

#24

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#23)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Don't we have maxbackends configurable at runtime.

Not after postmaster start, so passing it to the initial listen()
shouldn't be a problem.

The other concern I had could be addressed by making the listen
parameter be MIN(MaxBackends, PG_SOMAXCONN) where PG_SOMAXCONN
is set in config.h --- but now we could make the default value
really large, say 10000. The only reason to change it would be
if you had a kernel that barfed on large listen() parameters.

Sounds good to me.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#25

Peter Eisentraut

peter_e@gmx.net

over 24 years ago

In reply to: Tom Lane (#23)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Tom Lane writes:

The other concern I had could be addressed by making the listen
parameter be MIN(MaxBackends, PG_SOMAXCONN) where PG_SOMAXCONN
is set in config.h --- but now we could make the default value
really large, say 10000. The only reason to change it would be
if you had a kernel that barfed on large listen() parameters.

We'll never find that out if we don't try it. If you're concerned about
cooperating with other listen()ing processes, set it to MaxBackends * 2,
if you're not, set it to INT_MAX and watch.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#26

Mark

mark@ldssingles.com

over 24 years ago

In reply to: Mark (#18)

Re: vacuum problems

We increased shared memory in the linux kernel, which decreased the vacuumdb
time from 40 minutes to 14 minutes on a 450 mhz processor. We calculate that
on our dual 1ghz box with ghz ethernet san connection this will go down to
under 5 minutes. This is acceptable to us. Sorry about the unnecessary post.

Show quoted text

On Wednesday 11 July 2001 09:16, Mark wrote:

Quick rundown of our configuration:
Red Hat 7.1 (no changes or extras added by us)
Postgresql 7.1.2 and CVS HEAD from 07/10/2001
3.8 gb database size

I included two pgsql versions because this happens on both.

Here's the problem we're having:

We run a vacuumdb from the server on the entire database. Some large
tables are vacuumed very quickly, but the vacuum process hangs or takes
more than a few hours on a specific table (we haven't let it finish
before). The vacuum process works quickly on a table (loginhistory) with
2.8 million records, but is extremely slow on a table (inbox) with 1.1
million records (the table with 1.1 million records is actually larger in
kb size than the other table).

We've tried to vacuum the inbox table seperately ('vacuum inbox' within
psql), but this still takes hours (again we have never let it complete, we
need to use the database for development as well).

We noticed 2 things that are significant to this situatoin:
The server logs the following:

DEBUG: --Relation msginbox--
DEBUG: Pages 129921: Changed 26735, reaped 85786, Empty 0, New 0; Tup
1129861: Vac 560327, Keep/VTL 0/0, Crash 0, UnUsed 51549, MinLen 100,
MaxLen 2032; Re-using: Free/Avail. Space 359061488/359059332;
EndEmpty/Avail. Pages 0/85785. CPU 11.18s/5.32u sec.
DEBUG: Index msginbox_pkey: Pages 4749; Tuples 1129861: Deleted 76360.
CPU 0.47s/6.70u sec.
DEBUG: Index msginbox_fromto: Pages 5978; Tuples 1129861: Deleted 0.
CPU 0.37s/6.15u sec.
DEBUG: Index msginbox_search: Pages 4536; Tuples 1129861: Deleted 0.
CPU 0.32s/6.30u sec.
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES
DEBUG: XLogWrite: new log file created - consider increasing WAL_FILES

the last few lines (XLogWrite .....) repeat for ever and ever and ever.
With 7.1.2 this never stops unless we run out of disk space or cancel the
query. With CVS HEAD this still continues, but the log files don't consume
all disk space, but we still have to cancel it or it might run forever.

Perhaps we need to let it run until it completes, but we thought that we
might be doing something wrong or have some data (we're converting data
from MS SQL Server) that isn't friendly.

The major issue we're facing with this is that any read or write access to
the table being vacuumed times out (obviously because the table is still
locked). We plan to use PostgreSQL in our production service, but we can't
until we get this resolved.

We're at a loss, not being familiar enough with PostgreSQL and it's source
code. Can anyone please offer some advice or suggestions?

Thanks,

Mark

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#27

Noname

ncm@zembu.com

over 24 years ago

In reply to: Tom Lane (#21)

Re: Re: SOMAXCONN (was Re: Solaris source code)

On Wed, Jul 11, 2001 at 12:26:43PM -0400, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

Right. Okay, it seems like just making it a hand-configurable entry
in config.h.in is good enough for now. When and if we find that
that's inadequate in a real-world situation, we can improve on it...

Would anything computed from the maximum number of allowed connections
make sense?

[ looks at code ... ] Hmm, MaxBackends is indeed set before we arrive
at the listen(), so it'd be possible to use MaxBackends to compute the
parameter. Offhand I would think that MaxBackends or at most
2*MaxBackends would be a reasonable value.

Question, though: is this better than having a hardwired constant?
The only case I can think of where it might not be is if some platform
out there throws an error from listen() when the parameter is too large
for it, rather than silently reducing the value to what it can handle.
A value set in config.h.in would be simpler to adapt for such a platform.

The question is really whether you ever want a client to get a
"rejected" result from an open attempt, or whether you'd rather they
got a report from the back end telling them they can't log in. The
second is more polite but a lot more expensive. That expense might
really matter if you have MaxBackends already running.

I doubt most clients have tested either failure case more thoroughly
than the other (or at all), but the lower-level code is more likely
to have been cut-and-pasted from well-tested code. :-)

Maybe PG should avoid accept()ing connections once it has MaxBackends
back ends already running (as hinted at by Ian), so that the listen()
parameter actually has some meaningful effect, and excess connections
can be rejected more cheaply. That might also make it easier to respond
more adaptively to true load than we do now.

BTW, while I'm thinking about it: why doesn't pqcomm.c test for a
failure return from the listen() call? Is this just an oversight,
or is there a good reason to ignore errors?

The failure of listen() seems impossible. In the Linux, NetBSD, and
Solaris man pages, none of the error returns mentioned are possible
with PG's current use of the function. It seems as if the most that
might be needed now would be to add a comment to the call to socket()
noting that if any other address families are supported (besides
AF_INET and AF_LOCAL aka AF_UNIX), the call to listen() might need to
be looked at. AF_INET6 (which PG will need to support someday)
doesn't seem to change matters.

Probably if listen() did fail, then one or other of bind(), accept(),
and read() would fail too.

Nathan Myers
ncm@zembu.com

#28

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

over 24 years ago

In reply to: Noname (#27)

AW: Re: SOMAXCONN (was Re: Solaris source code)

The question is really whether you ever want a client to get a
"rejected" result from an open attempt, or whether you'd rather they
got a report from the back end telling them they can't log in. The
second is more polite but a lot more expensive. That expense might
really matter if you have MaxBackends already running.

One of us has probably misunderstood the listen parameter.
It only limits the number of clients that can connect concurrently.
It has nothing to do with the number of clients that are already connected.
It sort of resembles a maximum queue size for the accept loop.
Incoming connections fill the queue, accept frees the queue by taking the
connection to a newly forked backend.

Andreas

Import Notes

Resolved by subject fallback

#29

Noname

ncm@zembu.com

over 24 years ago

In reply to: Zeugswetter Andreas SB (#28)

Re: Re: SOMAXCONN (was Re: Solaris source code)

On Thu, Jul 12, 2001 at 10:14:44AM +0200, Zeugswetter Andreas SB wrote:

The question is really whether you ever want a client to get a
"rejected" result from an open attempt, or whether you'd rather they
got a report from the back end telling them they can't log in. The
second is more polite but a lot more expensive. That expense might
really matter if you have MaxBackends already running.

One of us has probably misunderstood the listen parameter.

I don't think so.

It only limits the number of clients that can connect concurrently.
It has nothing to do with the number of clients that are already
connected. It sort of resembles a maximum queue size for the accept
loop. Incoming connections fill the queue, accept frees the queue by
taking the connection to a newly forked backend.

The MaxBackends constant and the listen() parameter have no effect
until the number of clients already connected or trying to connect
and not yet noticed by the postmaster (respectively) exceed some
threshold. We would like to choose such thresholds so that we don't
promise service we can't deliver.

We can assume the administrator has tuned MaxBackends so that a
system with that many back ends running really _is_ heavily loaded.
(We have talked about providing a better measure of load than the
gross number of back ends; is that on the Todo list?)

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?
Then, the argument to listen() will determine how many attempts can
be in the queue before the network stack itself rejects them without
the postmaster involved.

As it is, the listen() queue limit is not useful. It could be made
useful with a slight change in postmaster behavior.

Nathan Myers
ncm@zembu.com

#30

Peter Eisentraut

peter_e@gmx.net

over 24 years ago

In reply to: Noname (#29)

Re: Re: SOMAXCONN (was Re: Solaris source code)

Nathan Myers writes:

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?

Because the new connection might be a cancel request.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#31

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

over 24 years ago

In reply to: Peter Eisentraut (#30)

AW: Re: SOMAXCONN (was Re: Solaris source code)

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?

Because the clients would time out ?

Then, the argument to listen() will determine how many attempts can
be in the queue before the network stack itself rejects them without
the postmaster involved.

You cannot change the argument to listen() at runtime, or are you suggesting
to close and reopen the socket when maxbackends is reached ? I think
that would be nonsense.

I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in
accepting more than your total allowed connections concurrently.

Andreas

Import Notes

Resolved by subject fallback

#32

mlw

markw@mohawksoft.com

over 24 years ago

In reply to: Zeugswetter Andreas SB (#31)

Re: AW: Re: SOMAXCONN (was Re: Solaris source code)

Zeugswetter Andreas SB wrote:

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?

Because the clients would time out ?

Then, the argument to listen() will determine how many attempts can
be in the queue before the network stack itself rejects them without
the postmaster involved.

You cannot change the argument to listen() at runtime, or are you suggesting
to close and reopen the socket when maxbackends is reached ? I think
that would be nonsense.

I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in
accepting more than your total allowed connections concurrently.

Andreas

I have been following this thread and I am confused why the queue argument to
listen() has anything to do with Max backends. All the parameter to listen does
is specify how long a list of sockets open and waiting for connection can be.
It has nothing to do with the number of back end sockets which are open.

If you have a limit of 128 back end connections, and you have 127 of them open.
A listen with queue size of 128 will still allow 128 sockets to wait for
connection before turning others away.

It should be a parameter based on the time out of a socket connection vs the
ability to answer connection requests within that period of time.

There are two was to think about this. Either you make this parameter tunable
to give a proper estimate of the usability of the system, i.e. tailor the
listen queue parameter to reject sockets when some number of sockets are
waiting, or you say no one should ever be denied, accept everyone and let them
time out if we are not fast enough.

This debate could go on, why not make it a parameter in the config file that
defaults to some system variable, i.e. SOMAXCONN.

BTW: on linux, the backlog queue parameter is silently truncated to 128 anyway.

#33

Noname

ncm@zembu.com

over 24 years ago

In reply to: Zeugswetter Andreas SB (#31)

Re: Re: SOMAXCONN

On Fri, Jul 13, 2001 at 10:36:13AM +0200, Zeugswetter Andreas SB wrote:

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?

Because the clients would time out ?

It takes a long time for half-open connections to time out, by default.
Probably most clients would time out, themselves, first, if PG took too
long to get to them. That would be a Good Thing.

Once the SOMAXCONN threshold is reached (which would only happen when
the system is very heavily loaded, because when it's not then nothing
stays in the queue for long), new connection attempts would fail
immediately, another Good Thing. When the system is very heavily
loaded, we don't want to spare attention for clients we can't serve.

Then, the argument to listen() will determine how many attempts can
be in the queue before the network stack itself rejects them without
the postmaster involved.

You cannot change the argument to listen() at runtime, or are you suggesting
to close and reopen the socket when maxbackends is reached ? I think
that would be nonsense.

Of course that would not work, and indeed nobody suggested it.

If postmaster behaved a little differently, not accept()ing when
the system is too heavily loaded, then it would be reasonable to
call listen() (once!) with PG_SOMAXCONN set to (e.g.) N=20.

Where the system is not too heavily-loaded, the postmaster accept()s
the connection attempts from the queue very quickly, and the number
of half-open connections never builds up to N. (This is how PG has
been running already, under light load -- except that on Solaris with
Unix sockets N has been too small.)

When the system *is* heavily loaded, the first N attempts would be
queued, and then the OS would automatically reject the rest. This
is better than accept()ing any number of attempts and then refusing
to authenticate. The N half-open connections in the queue would be
picked up by postmaster as existing back ends drop off, or time out
and give up if that happens too slowly.

I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no
use in accepting more than your total allowed connections concurrently.

That might not have the effect you imagine, where many short-lived
connections are being made. In some cases it would mean that clients
are rejected that could have been served after a very short delay.

Nathan Myers
ncm@zembu.com

#34

Noname

ncm@zembu.com

over 24 years ago

In reply to: mlw (#32)

Re: SOMAXCONN (was Re: Solaris source code)

On Fri, Jul 13, 2001 at 07:53:02AM -0400, mlw wrote:

Zeugswetter Andreas SB wrote:

I liked the idea of min(MaxBackends, PG_SOMAXCONN), since there is no use in
accepting more than your total allowed connections concurrently.

I have been following this thread and I am confused why the queue
argument to listen() has anything to do with Max backends. All the
parameter to listen does is specify how long a list of sockets open
and waiting for connection can be. It has nothing to do with the
number of back end sockets which are open.

Correct.

If you have a limit of 128 back end connections, and you have 127
of them open, a listen with queue size of 128 will still allow 128
sockets to wait for connection before turning others away.

Correct.

It should be a parameter based on the time out of a socket connection
vs the ability to answer connection requests within that period of
time.

It's not really meaningful at all, at present.

There are two was to think about this. Either you make this parameter
tunable to give a proper estimate of the usability of the system, i.e.
tailor the listen queue parameter to reject sockets when some number
of sockets are waiting, or you say no one should ever be denied,
accept everyone and let them time out if we are not fast enough.

This debate could go on, why not make it a parameter in the config
file that defaults to some system variable, i.e. SOMAXCONN.

With postmaster's current behavior there is no benefit in setting
the listen() argument to anything less than 1000. With a small
change in postmaster behavior, a tunable system variable becomes
useful.

But using SOMAXCONN blindly is always wrong; that is often 5, which
is demonstrably too small.

BTW: on linux, the backlog queue parameter is silently truncated to
128 anyway.

The 128 limit is common, applied on BSD and Solaris as well.
It will probably increase in future releases.

Nathan Myers
ncm@zembu.com

#35

Noname

ncm@zembu.com

over 24 years ago

In reply to: Peter Eisentraut (#30)

Re: Re: SOMAXCONN (was Re: Solaris source code)

On Thu, Jul 12, 2001 at 11:08:34PM +0200, Peter Eisentraut wrote:

Nathan Myers writes:

When the system is too heavily loaded (however measured), any further
login attempts will fail. What I suggested is, instead of the
postmaster accept()ing the connection, why not leave the connection
attempt in the queue until we can afford a back end to handle it?

Because the new connection might be a cancel request.

Supporting cancel requests seems like a poor reason to ignore what
load-shedding support operating systems provide.

To support cancel requests, it would suffice for PG to listen at
another socket dedicated to administrative requests. (It might
even ignore MaxBackends for connections on that socket.)

Nathan Myers
ncm@zembu.com

#36

mlw

markw@mohawksoft.com

over 24 years ago

In reply to: Zeugswetter Andreas SB (#31)

Re: SOMAXCONN (was Re: Solaris source code)

Nathan Myers wrote:

There are two was to think about this. Either you make this parameter
tunable to give a proper estimate of the usability of the system, i.e.
tailor the listen queue parameter to reject sockets when some number
of sockets are waiting, or you say no one should ever be denied,
accept everyone and let them time out if we are not fast enough.

This debate could go on, why not make it a parameter in the config
file that defaults to some system variable, i.e. SOMAXCONN.

With postmaster's current behavior there is no benefit in setting
the listen() argument to anything less than 1000. With a small
change in postmaster behavior, a tunable system variable becomes
useful.

But using SOMAXCONN blindly is always wrong; that is often 5, which
is demonstrably too small.

It is rumored that many BSD version are limited to 5.

BTW: on linux, the backlog queue parameter is silently truncated to
128 anyway.

The 128 limit is common, applied on BSD and Solaris as well.
It will probably increase in future releases.

This point I am trying to make is that the parameter passed to listen() is OS
dependent, on both what it means and its defaults. Trying to tie this to
maxbackends is not the right thought process. It has nothing to do, at all,
with maxbackends.

Passing listen(5) would probably be sufficient for Postgres. Will there ever be
5 sockets in the listen() queue prior to "accept()?" probably not. SOMAXCONN
is a system limit, setting a listen() value greater than this, is probably
silently adjusted down to the defined SOMAXCONN.

By making it a parameter, and defaulting to SOMAXCONN, this allows the maximum
number of connections a system can handle, while still allowing the DBA to fine
tune connection behavior on high load systems.

#37

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: mlw (#36)

Re: Re: SOMAXCONN (was Re: Solaris source code)

mlw <markw@mohawksoft.com> writes:

Nathan Myers wrote:

But using SOMAXCONN blindly is always wrong; that is often 5, which
is demonstrably too small.

It is rumored that many BSD version are limited to 5.

BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not*
to have such a small limit in the kernel. The real step forward that
we have made in this discussion is to realize that we cannot trust
<sys/socket.h> to tell us what the kernel limit actually is.

Passing listen(5) would probably be sufficient for Postgres.

It demonstrably is not sufficient. Set it that way in pqcomm.c
and run the parallel regression tests. Watch them fail.

regards, tom lane

#38

mlw

markw@mohawksoft.com

over 24 years ago

In reply to: Zeugswetter Andreas SB (#31)

Re: SOMAXCONN (was Re: Solaris source code)

Tom Lane wrote:

mlw <markw@mohawksoft.com> writes:

Nathan Myers wrote:

But using SOMAXCONN blindly is always wrong; that is often 5, which
is demonstrably too small.

It is rumored that many BSD version are limited to 5.

BSD systems tend to claim SOMAXCONN = 5 in the header files, but *not*
to have such a small limit in the kernel. The real step forward that
we have made in this discussion is to realize that we cannot trust
<sys/socket.h> to tell us what the kernel limit actually is.

Passing listen(5) would probably be sufficient for Postgres.

It demonstrably is not sufficient. Set it that way in pqcomm.c
and run the parallel regression tests. Watch them fail.

That's interesting, I would not have guessed that. I have written a number of
server applications which can handle, litterally, over a thousand
connection/operations a second, which only has a listen(5). (I do have it as a
configuration parameter, but have never seen a time when I have had to change
it.)

I figured the closest one could come to an expert in all things socket related
would have to be the Apache web server source. They have a different take on
the listen() parameter:

from httpd.h >>>>>>>>>>>

402 /* The maximum length of the queue of pending connections, as defined
403 * by listen(2). Under some systems, it should be increased if you
404 * are experiencing a heavy TCP SYN flood attack.
405 *
406 * It defaults to 511 instead of 512 because some systems store it
407 * as an 8-bit datatype; 512 truncated to 8-bits is 0, while 511 is
408 * 255 when truncated.
409 */
410
411 #ifndef DEFAULT_LISTENBACKLOG
412 #define DEFAULT_LISTENBACKLOG 511
413 #endif
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

I have not found any other location in which DEFAULT_LISTENBACKLOG is defined,
but it is a configuration parameter, and here is what the Apache docs claim:

http://httpd.apache.org/docs/mod/core.html >>>>>>>>>>>>

ListenBacklog directive

Syntax: ListenBacklog backlog
Default: ListenBacklog 511
Context: server config
Status: Core
Compatibility: ListenBacklog is only available in Apache versions after 1.2.0.

The maximum length of the queue of pending connections. Generally no tuning is
needed or desired, however on some systems it is desirable to increase this
when under a TCP SYN flood attack. See the backlog parameter to the listen(2)
system call.

This will often be limited to a smaller number by the operating system. This
varies from OS to OS. Also note that many OSes do not use exactly what is
specified as the backlog, but use a number based on (but normally larger than)
what is set.

<<<<<<<<<<<<<<<<<<<<<<<

Anyway, why not just do what apache does, set it to some extreme default
setting, which even when truncated, is still pretty big, and allow the end user
to change this value in postgresql.conf.

#39

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: mlw (#38)

Re: SOMAXCONN (was Re: Solaris source code)

mlw <markw@mohawksoft.com> writes:

Tom Lane wrote:

Passing listen(5) would probably be sufficient for Postgres.

It demonstrably is not sufficient. Set it that way in pqcomm.c
and run the parallel regression tests. Watch them fail.

That's interesting, I would not have guessed that. I have written a number of
server applications which can handle, litterally, over a thousand
connection/operations a second, which only has a listen(5).

The problem should be considerably reduced in latest sources, since
as of a week or three ago, the top postmaster process' outer loop is
basically just accept() and fork() --- client authentication is now
handled after the fork, instead of before. Still, we now know that
(a) SOMAXCONN is a lie on many systems, and (b) values as small as 5
are pushing our luck, even though it might not fail so easily anymore.

The state of affairs in current sources is that the listen queue
parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN
is a constant defined in config.h --- it's 10000, hence a non-factor,
by default, but could be reduced if you have a kernel that doesn't
cope well with large listen-queue requests. We probably won't know
if there are any such systems until we get some field experience with
the new code, but we could have "configure" select a platform-dependent
value if we find such problems.

I believe that this is fine and doesn't need any further tweaking,
pending field experience. What's still open for discussion is Nathan's
thought that the postmaster ought to stop issuing accept() calls once
it has so many children that it will refuse to fork any more. I was
initially against that, but on further reflection I think it might be
a good idea after all, because of another recent change related to the
authenticate-after-fork change. Since the top postmaster doesn't really
know which children have become working backends and which are still
engaged in authentication dialogs, it cannot enforce the MaxBackends
limit directly. Instead, MaxBackends is checked when the child process
is done with authentication and is trying to join the PROC pool in
shared memory. The postmaster will spawn up to 2 * MaxBackends child
processes before refusing to spawn more --- this allows there to be
up to MaxBackends children engaged in auth dialog but not yet working
backends. (It's reasonable to allow extra children since some may fail
the auth dialog, or an extant backend may have quit by the time they
finish auth dialog. Whether 2*MaxBackends is the best choice is
debatable, but that's what we're using at the moment.)

Furthermore, we intend to install a pretty tight timeout on the overall
time spent in auth phase (a few seconds I imagine, although we haven't
yet discussed that number either).

Given this setup, if the postmaster has reached its max-children limit
then it can be quite certain that at least some of those children will
quit within approximately the auth timeout interval. Therefore, not
accept()ing is a state that will probably *not* persist for long enough
to cause the new clients to timeout. By not accept()ing at a time when
we wouldn't fork, we can convert the behavior clients see at peak load
from quick rejection into a short delay before authentication dialog.

Of course, if you are at MaxBackends working backends, then the new
client is still going to get a "too many clients" error; all we have
accomplished with the change is to expend a fork() and an authentication
cycle before issuing the error. So if the intent is to reduce overall
system load then this isn't necessarily an improvement.

IIRC, the rationale for using 2*MaxBackends as the maximum child count
was to make it unlikely that the postmaster would refuse to fork; given
a short auth timeout it's unlikely that as many as MaxBackends clients
will be engaged in auth dialog at any instant. So unless we tighten
that max child count considerably, holding off accept() at max child
count is unlikely to change the behavior under any but worst-case
scenarios anyway. And in a worst-case scenario, shedding load by
rejecting connections quickly is probably just what you want to do.

So, having thought that through, I'm still of the opinion that holding
off accept is of little or no benefit to us. But it's not as simple
as it looks at first glance. Anyone have a different take on what the
behavior is likely to be?

regards, tom lane

#40

Noname

ncm@zembu.com

over 24 years ago

In reply to: Tom Lane (#39)

Re: Re: SOMAXCONN (was Re: Solaris source code)

On Sat, Jul 14, 2001 at 11:38:51AM -0400, Tom Lane wrote:

The state of affairs in current sources is that the listen queue
parameter is MIN(MaxBackends * 2, PG_SOMAXCONN), where PG_SOMAXCONN
is a constant defined in config.h --- it's 10000, hence a non-factor,
by default, but could be reduced if you have a kernel that doesn't
cope well with large listen-queue requests. We probably won't know
if there are any such systems until we get some field experience with
the new code, but we could have "configure" select a platform-dependent
value if we find such problems.

Considering the Apache comment about some systems truncating instead
of limiting... 10000&0xff is 16. Maybe 10239 would be a better choice,
or 16383.

So, having thought that through, I'm still of the opinion that holding
off accept is of little or no benefit to us. But it's not as simple
as it looks at first glance. Anyone have a different take on what the
behavior is likely to be?

After doing some more reading, I find that most OSes do not reject
connect requests that would exceed the specified backlog; instead,
they ignore the connection request and assume the client will retry
later. Therefore, it appears cannot use a small backlog to shed load
unless we assume that clients will time out quickly by themselves.

OTOH, maybe it's reasonable to assume that clients will time out,
and that in the normal case authentication happens quickly.

Then we can use a small listen() backlog, and never accept() if we
have more than MaxBackend back ends. The OS will keep a small queue
corresponding to our small backlog, and the clients will do our load
shedding for us.

Nathan Myers
ncm@zembu.com

#41

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Noname (#40)

Re: Re: SOMAXCONN (was Re: Solaris source code)

ncm@zembu.com (Nathan Myers) writes:

Considering the Apache comment about some systems truncating instead
of limiting... 10000&0xff is 16. Maybe 10239 would be a better choice,
or 16383.

Hmm. If the Apache comment is real, then that would not help on those
systems. Remember that the actual listen request is going to be
2*MaxBackends in practically all cases. The only thing that would save
you from getting an unexpectedly small backlog parameter in such a case
is to set PG_SOMAXCONN to 255.

Perhaps we should just do that and not worry about whether the Apache
info is accurate or not. But I'd kind of like to see chapter and verse,
ie, at least one specific system that demonstrably fails to perform the
clamp-to-255 for itself, before we lobotomize the code that way. ISTM a
conformant implementation of listen() would limit the given value to 255
before storing it into an 8-bit field, not just lose high order bits.

After doing some more reading, I find that most OSes do not reject
connect requests that would exceed the specified backlog; instead,
they ignore the connection request and assume the client will retry
later. Therefore, it appears cannot use a small backlog to shed load
unless we assume that clients will time out quickly by themselves.

Hm. newgate is a machine on my local net that's not currently up.

$ time psql -h newgate postgres
psql: could not connect to server: Connection timed out
Is the server running on host newgate and accepting
TCP/IP connections on port 5432?

real 1m13.33s
user 0m0.02s
sys 0m0.01s
$

That's on HPUX 10.20. On an old Linux distro, the same timeout
seems to be about 21 seconds, which is still pretty long by some
standards. Do the TCP specs recommend anything particular about
no-response-to-SYN timeouts?

regards, tom lane