2nd update on TOAST

Started by Nonameover 25 years ago36 messages
#1Noname
JanWieck@t-online.de
1 attachment(s)

Uh oh -

on one side I'm a happy camper. pg_dump already ignored pg_*
tables and since the TOAST tables are named pg_toast_...,
nothing to be done. I loaded my test DB with TOAST entries,
dumped and restored it. Anything is there, works perfectly.

But then I added the ALTER TABLE for unlimited rewrite rule
size to initdb and the problems started. I can create a table
with 500+ attributes. Also I can create a view on it (the
rules size is 170K - whow). Anything works pretty well, just
a pg_dump output is garbage.

Seems the dynamic string buffers used in pg_dump aren't as
bullet proof as they should. I'm still busy with other
things, so can someone please take a look on it? Attached is
an SQL script that creates the table and the view that I
cannot dump.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

Attachments:

megaview.sqltext/plain; charset=ISO-8859-1Download
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#1)
Re: 2nd update on TOAST

JanWieck@t-online.de (Jan Wieck) writes:

size to initdb and the problems started. I can create a table
with 500+ attributes. Also I can create a view on it (the
rules size is 170K - whow). Anything works pretty well, just
a pg_dump output is garbage.

Are you using current sources? Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...). Just want to know which pg_dump we're talking about.

regards, tom lane

#3Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#1)
Re: 2nd update on TOAST

At 22:01 5/07/00 +0200, Jan Wieck wrote:

Seems the dynamic string buffers used in pg_dump aren't as
bullet proof as they should. I'm still busy with other
things, so can someone please take a look on it? Attached is
an SQL script that creates the table and the view that I
cannot dump.

If the TOAST patch on the FTP site the most recent?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#4Philip Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#2)
Re: 2nd update on TOAST

At 17:18 5/07/00 -0400, Tom Lane wrote:

JanWieck@t-online.de (Jan Wieck) writes:

size to initdb and the problems started. I can create a table
with 500+ attributes. Also I can create a view on it (the
rules size is 170K - whow). Anything works pretty well, just
a pg_dump output is garbage.

Are you using current sources? Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...). Just want to know which pg_dump we're talking about.

It's good that any bugs had a chnace to come out ASAP.

However, I'd be interested to know what actually happens: I do recall that
pg_dump (both versions) have seomething like '#define COPY_BUFFER_SIZE
8192'; if I were to start looking, that's where I'd go first.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#5Philip Warner
pjw@rhyme.com.au
In reply to: Philip Warner (#4)
Re: 2nd update on TOAST

At 12:50 6/07/00 +1000, Philip Warner wrote:

Are you using current sources? Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...). Just want to know which pg_dump we're talking about.

It's good that any bugs had a chnace to come out ASAP.

However, I'd be interested to know what actually happens: I do recall that
pg_dump (both versions) have seomething like '#define COPY_BUFFER_SIZE
8192'; if I were to start looking, that's where I'd go first.

Further to this, looking at the code, it now uses 'archputs', which
replaces 'fputs', to output the copy buffer (assuming the error is in the
copy, not while dumping the definitions); if PQgetline fills the entire
buffer (no trailing \0), then I could imagine both would croak. The
simplest test would be to tell PQgetline that the buffer size if 1 byte
smaller, and see if it fixes the problem.

I'd be interested to hear from someone...at least to know a little more of
the circumstaces of the error.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#6Philip Warner
pjw@rhyme.com.au
In reply to: Philip Warner (#5)
Re: 2nd update on TOAST

At 13:15 6/07/00 +1000, Philip Warner wrote:

At 12:50 6/07/00 +1000, Philip Warner wrote:

Are you using current sources? Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...). Just want to know which pg_dump we're talking about.

OK, I've built from the latest CVS, run initdb etc, created a new DB,
created the megatable & megaview, and done a pg_dump. It works, except that
I introduced a but that caused the view to be dumped as a table as well. I
will keep looking, and submit a patch shortly. In the mean time, if anyone
can reproduce Jan's problem, that would help...

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#7Noname
JanWieck@t-online.de
In reply to: Philip Warner (#3)
Re: 2nd update on TOAST

Philip Warner wrote:

At 22:01 5/07/00 +0200, Jan Wieck wrote:

Seems the dynamic string buffers used in pg_dump aren't as
bullet proof as they should. I'm still busy with other
things, so can someone please take a look on it? Attached is
an SQL script that creates the table and the view that I
cannot dump.

If the TOAST patch on the FTP site the most recent?

No. It's all in the current CVS tree. I removed that patch
already.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#8Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#7)
Re: 2nd update on TOAST

At 11:09 6/07/00 +0200, Jan Wieck wrote:

No. It's all in the current CVS tree. I removed that patch
already.

OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
'make distclean' then a 'make' & 'make install' again, and now postmaster
wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
process dies so I cant get a backtrace.

I have now rebuilt with SYSLOG support, and get the following:

Jul 6 20:54:54 Cerberus2 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000038
Jul 6 20:54:54 Cerberus2 kernel: current->tss.cr3 = 02467000, %cr3 = 02467000
Jul 6 20:54:54 Cerberus2 kernel: *pde = 00000000
Jul 6 20:54:54 Cerberus2 kernel: Oops: 0000
Jul 6 20:54:54 Cerberus2 kernel: CPU: 0
Jul 6 20:54:54 Cerberus2 kernel: EIP: 0010:[fcntl_setlk+327/404]
Jul 6 20:54:54 Cerberus2 kernel: EFLAGS: 00000202
Jul 6 20:54:54 Cerberus2 kernel: eax: 00000000 ebx: c15485b0 ecx:
c2494000 edx: c0d49a50
Jul 6 20:54:54 Cerberus2 kernel: esi: bffff574 edi: 00000004 ebp:
fffffff7 esp: c2495f34
Jul 6 20:54:54 Cerberus2 kernel: ds: 0018 es: 0018 ss: 0018
Jul 6 20:54:54 Cerberus2 kernel: Process postmaster (pid: 5661, process
nr: 50, stackpage=c2495000)
Jul 6 20:54:54 Cerberus2 kernel: Stack: 00000000 c15485b0 c2495f40
00000001 00000000 00000000 4000bc74 00000000
Jul 6 20:54:54 Cerberus2 kernel: 00000000 00000000 00000000
00000000 c0d49a50 0000161d 00000000 c15485b0
Jul 6 20:54:54 Cerberus2 kernel: 00000101 00000000 7fffffff
00000000 00000000 00000000 c012c687 00000004
Jul 6 20:54:54 Cerberus2 kernel: Call Trace: [sys_fcntl+595/772]
[sys_open+94/124] [system_call+52/56]
Jul 6 20:54:54 Cerberus2 kernel: Code: 8b 50 38 85 d2 74 15 8d 44 24 24 50
ff 74 24 6c 53 ff d2 89

If anyone can give me some tips on tracking this down, I would appreciate
it....

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#9Noname
JanWieck@t-online.de
In reply to: Philip Warner (#8)
Re: 2nd update on TOAST

Philip Warner wrote:

At 11:09 6/07/00 +0200, Jan Wieck wrote:

No. It's all in the current CVS tree. I removed that patch
already.

OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
'make distclean' then a 'make' & 'make install' again, and now postmaster
wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
process dies so I cant get a backtrace.

Have the same symptom with a completely fresh cvs checkout.

If anyone can give me some tips on tracking this down, I would appreciate
it....

Bruce applied a patch to configure.in yesterday. Read the
comments from the cvslog. It tells that it triggers a bug in
the Linux kernels fcntl(SETLK) code when used with unix
domain sockets, and that the bug is present in Linux kernels
<= 2.2.16. I'm running a 2.2.12 here, and so it exactly dies
in pqcomm.c line 229 on fcntl() against the socket.

Undefining HAVE_FCNTL_SETLK in config.h did it for me
temporary. Don't know how to deal with it finally.

With this setup I did

initdb
createdb
psql <megaview.sql
pg_dump pgsql >megaview.dump

In the dump file, the first 2183 bytes look OK. What's
following then looks like internal tables where pg_dump holds
the info of the schema analyzing.

And don't worry that the view is dumped as table with a later
CREATE RULE. That's correct this way.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#10Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#9)
Re: 2nd update on TOAST

At 14:04 6/07/00 +0200, Jan Wieck wrote:

Philip Warner wrote:

In the dump file, the first 2183 bytes look OK. What's
following then looks like internal tables where pg_dump holds
the info of the schema analyzing.

Any chance you could mail it direct to me?

And don't worry that the view is dumped as table with a later
CREATE RULE. That's correct this way.

I figured this out the hard way!

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#11Philip Warner
pjw@rhyme.com.au
In reply to: Noname (#9)
Re: 2nd update on TOAST

At 14:04 6/07/00 +0200, Jan Wieck wrote:

With this setup I did

initdb
createdb
psql <megaview.sql
pg_dump pgsql >megaview.dump

In the dump file, the first 2183 bytes look OK. What's
following then looks like internal tables where pg_dump holds
the info of the schema analyzing.

Just in case there is some other factor, can you let me know what your
choices in 'configure' were?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#12Noname
JanWieck@t-online.de
In reply to: Philip Warner (#10)
1 attachment(s)
Re: 2nd update on TOAST

Philip Warner wrote:

At 14:04 6/07/00 +0200, Jan Wieck wrote:

Philip Warner wrote:

In the dump file, the first 2183 bytes look OK. What's
following then looks like internal tables where pg_dump holds
the info of the schema analyzing.

Any chance you could mail it direct to me?

Attached.

And don't worry that the view is dumped as table with a later
CREATE RULE. That's correct this way.

I figured this out the hard way!

:-)

Will be off after this until approx. 1:00 UCT.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

Attachments:

x2.dumptext/plainDownload
#13Noname
JanWieck@t-online.de
In reply to: Philip Warner (#11)
Re: 2nd update on TOAST

Philip Warner wrote:

At 14:04 6/07/00 +0200, Jan Wieck wrote:

With this setup I did

initdb
createdb
psql <megaview.sql
pg_dump pgsql >megaview.dump

In the dump file, the first 2183 bytes look OK. What's
following then looks like internal tables where pg_dump holds
the info of the schema analyzing.

Just in case there is some other factor, can you let me know what your
choices in 'configure' were?

--with-tcl
--enable-cassert
--enable-debug

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#14Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#9)
Re: 2nd update on TOAST

Bruce applied a patch to configure.in yesterday. Read the
comments from the cvslog. It tells that it triggers a bug in
the Linux kernels fcntl(SETLK) code when used with unix
domain sockets, and that the bug is present in Linux kernels
<= 2.2.16. I'm running a 2.2.12 here, and so it exactly dies
in pqcomm.c line 229 on fcntl() against the socket.
Undefining HAVE_FCNTL_SETLK in config.h did it for me
temporary. Don't know how to deal with it finally.

Thanks Jan for the workaround. I'll see if this gets me up and going
again.

- Thomas

#15Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: 2nd update on TOAST

JanWieck@t-online.de (Jan Wieck) writes:

size to initdb and the problems started. I can create a table
with 500+ attributes. Also I can create a view on it (the
rules size is 170K - whow). Anything works pretty well, just
a pg_dump output is garbage.

Are you using current sources? Bruce committed Philip Warner's
pg_dump rewrite a day or so ago (which I thought was way premature,
but anyway...). Just want to know which pg_dump we're talking about.

I committed pg_dump so people could see his changes and start making
additions. I can then have him supply incremental patches. Keeping
stuff out of the tree usually causes the author to get frustrated.

Of course, if it comes in too quickly, it can cause chaos if he needs to
make major changes.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#9)
Re: 2nd update on TOAST

OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
'make distclean' then a 'make' & 'make install' again, and now postmaster
wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
process dies so I cant get a backtrace.

Have the same symptom with a completely fresh cvs checkout.

If anyone can give me some tips on tracking this down, I would appreciate
it....

Bruce applied a patch to configure.in yesterday. Read the
comments from the cvslog. It tells that it triggers a bug in
the Linux kernels fcntl(SETLK) code when used with unix
domain sockets, and that the bug is present in Linux kernels
<= 2.2.16. I'm running a 2.2.12 here, and so it exactly dies
in pqcomm.c line 229 on fcntl() against the socket.

I thought when he said flock() bug, he meant only on the new IA64
platform, not on all Linux platforms. Yikes, I enable flock(), and it
breaks initdb for all the Linux users. This is a problem!

Tom was mentioning the configure check for flock() was broken recently,
so I was glad to fix it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17Noname
JanWieck@t-online.de
In reply to: Bruce Momjian (#16)
Re: 2nd update on TOAST

Bruce Momjian wrote:

OK, I've updated from CVS and rebuilt & it worked. The, to be sure, I did a
'make distclean' then a 'make' & 'make install' again, and now postmaster
wont start (SIGSEGV). I have rebuild with '-O0 -g', gone into gdb, but the
process dies so I cant get a backtrace.

Have the same symptom with a completely fresh cvs checkout.

If anyone can give me some tips on tracking this down, I would appreciate
it....

Bruce applied a patch to configure.in yesterday. Read the
comments from the cvslog. It tells that it triggers a bug in
the Linux kernels fcntl(SETLK) code when used with unix
domain sockets, and that the bug is present in Linux kernels
<= 2.2.16. I'm running a 2.2.12 here, and so it exactly dies
in pqcomm.c line 229 on fcntl() against the socket.

I thought when he said flock() bug, he meant only on the new IA64
platform, not on all Linux platforms. Yikes, I enable flock(), and it
breaks initdb for all the Linux users. This is a problem!

Not initdb, but postmaster. That's the one who tries (after a
successful initdb) to do the fcntl(F_SETLK) on the unix
domain socket. Causing the kernel saying "go to hell, go
directly, don't write a core, don't leave useful info in
gdb".

The only reason I see for the entire section is to detect if
it would be safe to unlink the socket because it's left by
another postmaster in case of abnormal termination. Tell me
if I've misread it. So why not doing it on the Linux
platform different, using a separate file like
.s.PGSQL.5432.LCK?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#18Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#17)
Re: 2nd update on TOAST

Not initdb, but postmaster. That's the one who tries (after a
successful initdb) to do the fcntl(F_SETLK) on the unix
domain socket. Causing the kernel saying "go to hell, go
directly, don't write a core, don't leave useful info in
gdb".

The only reason I see for the entire section is to detect if
it would be safe to unlink the socket because it's left by
another postmaster in case of abnormal termination. Tell me
if I've misread it. So why not doing it on the Linux
platform different, using a separate file like
.s.PGSQL.5432.LCK?

But how do you know if that file still belongs to an active postmaster?
What if it exited before removing the file. Seems we would have to
write the PID into the file, and do a kill() to see if it is running.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#18)
fcntl(SETLK) [was Re: 2nd update on TOAST]

The only reason I see for the entire section is to detect if
it would be safe to unlink the socket because it's left by
another postmaster in case of abnormal termination. Tell me
if I've misread it.

That's exactly what it's for. We need to tell whether there is
still another postmaster running on the same port number. Too
bad the kernel is not bright enough to unlink the socket file
automatically when it's no longer in use...

So why not doing it on the Linux
platform different, using a separate file like
.s.PGSQL.5432.LCK?

I think it's a bad idea to do it differently on Linux than other
platforms. If we fix this (other than by just disabling the fcntl
call again on old Linuxen) we should use the new method everywhere.

But how do you know if that file still belongs to an active postmaster?
What if it exited before removing the file. Seems we would have to
write the PID into the file, and do a kill() to see if it is running.

Well, if we wanted to continue to depend on fcntl(SETLK) then we could
use an empty plain file. I read the bug report as being that old Linux
kernels fail if fcntl(SETLK) is applied to a Unix-socket file. They'd
surely have noticed long before if the feature didn't work on plain
files.

But if we are going to change this at all, I'd vote for storing pids
in the lock files the way we are now doing in the data-directory pid
lock files. Then we wouldn't have to depend on fcntl at all, which
would be a Good Thing from a portability point of view.

However, I think it would be a really bad idea to keep the lock files
in /tmp --- that's way too open to accidental removals, not to mention
deliberate denial-of-service attacks. They need to be in a more secure
directory; but where? See the past discussions summarized in the
TODO.detail file.

regards, tom lane

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#19)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Tom Lane writes:

However, I think it would be a really bad idea to keep the lock files
in /tmp --- that's way too open to accidental removals, not to mention
deliberate denial-of-service attacks. They need to be in a more secure
directory; but where? See the past discussions summarized in the
TODO.detail file.

Quoth the file system standard:

`sharedstatedir'
The directory for installing architecture-independent data files
which the programs modify while they run. This should normally be
`/usr/local/com', but write it as `$(prefix)/com'. (If you are
using Autoconf, write it as `@sharedstatedir@'.)

The problem with this approach is making that directory writeable by the
server account. Solutions:

1) Making the postmaster executable as root but later drop root
privileges. (This looks to be the cleanest solution, but it is
probably a security problem waiting to happen.)

2) Making initdb executable as root but with some --user switch. Have it
create a subdirectory of $sharedstatedir writable by the server
account, possibly with sticky bit and whatnot. Use `su' to invoke
`postgres'.

This approach might be convenient also in terms of creating the data
directory.

3) Making "initialize lock file area" a separate initialization step,
possibly encapsulated into a shell script.

Btw., what would happen if we did start a second postmaster at the same
TCP port? Or more interestingly, what happens if some completely different
program already runs at that port? How do we protect against that? This
has something to do with SO_REUSEADDR, but I don't understand those things
too well.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#20)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Peter Eisentraut <peter_e@gmx.net> writes:

Quoth the file system standard:

`sharedstatedir'
The directory for installing architecture-independent data files
which the programs modify while they run. This should normally be
`/usr/local/com', but write it as `$(prefix)/com'. (If you are
using Autoconf, write it as `@sharedstatedir@'.)

The problem with this approach is making that directory writeable by the
server account.

The lock directory should certainly be one used only for Postgres locks,
owned by postgres user and writable only by postgres user.

2) Making initdb executable as root but with some --user switch. Have it
create a subdirectory of $sharedstatedir writable by the server
account, possibly with sticky bit and whatnot. Use `su' to invoke
`postgres'.

This approach might be convenient also in terms of creating the data
directory.

We could do that, or we could just say "you must have arranged for
creation of these directories before you run initdb". For the truly
lazy, a small script that could be executed as root could be provided.

Personally I'd be unwilling to run a script as complex as initdb as
root; what if it goes wrong? Keep the stuff that requires root
permission separate, and as small as possible.

BTW, regardless of where exactly the lock directory lives (and IIRC
there were several schools of thought on that), I believe that the
lock directory pathname has to be wired in at configure time. It
can't be an initdb argument because the whole locking thing is useless
unless all the PG installations on a machine agree on where the port
locks are.

Btw., what would happen if we did start a second postmaster at the same
TCP port? Or more interestingly, what happens if some completely different
program already runs at that port? How do we protect against that? This
has something to do with SO_REUSEADDR, but I don't understand those things
too well.

SO_REUSEADDR solves the problem for TCP sockets. The problem with Unix
sockets is that the kernel's detection of conflicts is pretty braindead:
if there is an existing socket file of the same name, you get an
"address in use" failure from bind(), regardless of whether anyone else
is actually using the socket. So, if the previous postmaster died
ungracefully and didn't delete its socket file, a new postmaster cannot
be started up until the old socket file is removed. What we're trying
to do here is automate that removal so the admin doesn't have to do it.
The trouble is we can't just unlink() the old socket file because
that'll succeed even if there is a postmaster actively using the socket!
So we need to find out whether the old postmaster is still alive
to decide whether it's OK to remove the old socket file or whether we
should abort startup.

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster! In other words, pretend to
be a client and see if your connection attempt is answered. (You don't
have to try to log in, just see if you get a connection.) This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded? But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file. Comments anyone?

regards, tom lane

#22Noname
JanWieck@t-online.de
In reply to: Tom Lane (#21)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Tom Lane wrote:

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster! In other words, pretend to
be a client and see if your connection attempt is answered. (You don't
have to try to log in, just see if you get a connection.) This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded? But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file. Comments anyone?

Like it.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#23Alfred Perlstein
bright@wintelcom.net
In reply to: Noname (#22)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

* Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:

Tom Lane wrote:

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster! In other words, pretend to
be a client and see if your connection attempt is answered. (You don't
have to try to log in, just see if you get a connection.) This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded? But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file. Comments anyone?

Like it.

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))
and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred

#24Mike Mascari
mascarm@mascari.com
In reply to: Tom Lane (#21)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Alfred Perlstein wrote:

* Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:

Tom Lane wrote:

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster! In other words, pretend to
be a client and see if your connection attempt is answered. (You don't
have to try to log in, just see if you get a connection.) This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded? But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file. Comments anyone?

Like it.

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))
and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred

I don't get this. Isn't there a race condition here?

Just curious,

Mike Mascari

#25Alfred Perlstein
bright@wintelcom.net
In reply to: Mike Mascari (#24)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

* Mike Mascari <mascarm@mascari.com> [000708 05:55] wrote:

Alfred Perlstein wrote:

* Jan Wieck <JanWieck@t-online.de> [000708 05:47] wrote:

Tom Lane wrote:

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster! In other words, pretend to
be a client and see if your connection attempt is answered. (You don't
have to try to log in, just see if you get a connection.) This might
also answer Peter's concern about socket files that belong to
non-Postgres programs, although I doubt that's really a big issue.

There are some potential pitfalls here, like what if the old postmaster
is there but overloaded? But on the whole it seems like it might be
a cleaner answer than fooling around with lockfiles, and certainly safer
than relying on fcntl(SETLK) to work on a socket file. Comments anyone?

Like it.

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))
and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred

I don't get this. Isn't there a race condition here?

Just curious,

Sure but it's handled, if there's a postmaster starting at this
exact instant, however since the script just runs postmaster
afterwards the conflict will make postmaster abort and I'll get an
error return from my invocation of postmaster.

-Alfred

#26Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mike Mascari (#24)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))
and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred

I don't get this. Isn't there a race condition here?

That's a good point. I don't think so because the socket will only
create for one user. Basically, we don't need something bulletproof
here. We just need something to prevent admins from accidentally
starting two postmasters on the same port.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#27Alfred Perlstein
bright@wintelcom.net
In reply to: Bruce Momjian (#26)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

* Bruce Momjian <pgman@candle.pha.pa.us> [000708 06:02] wrote:

my $pgsocket = "/tmp/.s.PGSQL.5432";

# try to connect to the postmaster
socket(SOCK, PF_UNIX, SOCK_STREAM, 0)
or die "unable to create unix domain socket: $!";

connect(SOCK, sockaddr_un($pgsocket))
and errexit("postmaster is running you must shut it down");

oh yeah... :)

-Alfred

I don't get this. Isn't there a race condition here?

That's a good point. I don't think so because the socket will only
create for one user. Basically, we don't need something bulletproof
here. We just need something to prevent admins from accidentally
starting two postmasters on the same port.

Actually I just remebered the issue here, if one wants to start
postmaster on an alternate port there will be no conflict and
all hell may break loose.

-Alfred

#28Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Alfred Perlstein (#27)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

That's a good point. I don't think so because the socket will only
create for one user. Basically, we don't need something bulletproof
here. We just need something to prevent admins from accidentally
starting two postmasters on the same port.

Actually I just remebered the issue here, if one wants to start
postmaster on an alternate port there will be no conflict and
all hell may break loose.

We already lock the /data directory. This is for the port lock.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#29Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#19)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

But how do you know if that file still belongs to an active postmaster?
What if it exited before removing the file. Seems we would have to
write the PID into the file, and do a kill() to see if it is running.

I believe we already do this (SetPidFile() in
utils/init/miscinit.c). Isn't it sufficient (1) to prevent starting a
new postmaster on the same data dir and (2) to unlink the accidently
left socket file?
--
Tatsuo Ishii

#30Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#29)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

But how do you know if that file still belongs to an active postmaster?
What if it exited before removing the file. Seems we would have to
write the PID into the file, and do a kill() to see if it is running.

I believe we already do this (SetPidFile() in
utils/init/miscinit.c). Isn't it sufficient (1) to prevent starting a
new postmaster on the same data dir and (2) to unlink the accidently
left socket file?

I noticed what I was missing after sending the mail. Sorry for the
confusion. Seems the idea trying to connect a postmaster looks good.
--
Tatsuo Ishii

#31Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#21)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Tom Lane writes:

Bruce and I were just talking by phone about this, and we realized that
there is a completely different approach to making that decision: if you
want to know whether there's an old postmaster connected to a socket
file, try to connect to the old postmaster!

It seems that that would completely reverse the assumption of risk.
Currently, the postmaster may fail to start because there's a stale socket
file lying around, out of respect to a running colleague. With this idea
it would be the running postmaster's job to "defend" his socket against
newly starting colleagues. That doesn't seem fair.

What are our problems?

There's a possible DoS attack when someone else comes first and creates a
file /tmp/.s.PGSQL.5432. But detecting whether there's another program
running on that socket (if it's a socket) isn't going to help because you
most likely won't be able to delete it anyway. The solution to this is to
make the path of the socket file configurable more easily so that the
administrator has the choice of putting it a safer place that he prepared
appropriately.

A complementary solution is of course to add an option to run without Unix
socket, since we don't rely on the socket file for data directory locking
anymore. In fact, does anybody mind if I add such an option? We can have

tcpip_socket = yes|no
unix_socket = yes|no

(Security-conscious users may choose to turn off both. :-))

The other problem is a socket file left behind by a crashed postmaster. I
don't consider this such a big problem; a crashed postmaster is not the
normal mode of operation. The friendly message we have right now seems
alright to me. And it's a way of tell that the postmaster crashed at all.

One idea to get the pid in there somewhere is creating a socket file
"/tmp/.s.PGSQL.port.pid" and making /tmp/.s.PGSQL.port a symlink to it.
Then clients don't know the difference, but the server knows the pid and
can take appropriate action. Or make the symlink the other way around, not
sure.

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#32Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Peter Eisentraut (#31)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

The other problem is a socket file left behind by a crashed postmaster. I
don't consider this such a big problem; a crashed postmaster is not the
normal mode of operation. The friendly message we have right now seems
alright to me. And it's a way of tell that the postmaster crashed at all.

One idea to get the pid in there somewhere is creating a socket file
"/tmp/.s.PGSQL.port.pid" and making /tmp/.s.PGSQL.port a symlink to it.
Then clients don't know the difference, but the server knows the pid and
can take appropriate action. Or make the symlink the other way around, not
sure.

The symlink is an interesting idea. lstat() on the normal name can give
the file name with pid.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#33Chris Bitmead
chris@bitmead.com
In reply to: Peter Eisentraut (#31)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

There's a possible DoS attack when someone else comes first and creates a
file /tmp/.s.PGSQL.5432. But detecting whether there's another program
running on that socket (if it's a socket) isn't going to help because you
most likely won't be able to delete it anyway. The solution to this is to
make the path of the socket file configurable more easily so that the
administrator has the choice of putting it a safer place that he prepared
appropriately.

If you are worried about DoS, I think the only solution is to figure out
a way to be using one of the reserved <1000 ports. I don't think there's
any way around that is there? Also presumably not using a reserved port
is a security risk. Not that I'm worried.

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mike Mascari (#24)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Mike Mascari <mascarm@mascari.com> writes:

I don't get this. Isn't there a race condition here?

Strictly speaking, there is, but the race window is only a couple
of kernel calls wide, and as Bruce pointed out we do not need something
that is absolutely gold-plated bulletproof. We are just trying to
prevent dbadmins from accidentally starting two postmasters on the
same port number.

The way this would work is that pqcomm.c would do something like

if (socketFileAlreadyExists) {
try to open connection to existing postmaster;
if (successful) {
report port conflict and die;
}
delete existing socket file;
}
bind(socket); // kernel creates new socket file here
listen();

The race condition here is that if newly-started postmaster A has
executed bind() but not yet listen(), then newly-started postmaster B
could come along, observe the existing socket file, try to open
connection, fail, delete socket file, proceed. AFAIK B will be allowed
to bind() and create a new socket file, and A ends up listening to a
port that's lost in hyperspace --- no one else can ever connect to it
because it has no visible representative in the filesystem.

But as soon as A has executed listen() it's safe --- even though it's
not really ready to accept connections yet, the attempted connect from
B will wait till it does. (We should, therefore, use a plain vanilla
connect attempt for the probe --- no non-blocking connect or anything
fancy.)

The bind-to-listen delay in pqcomm.c is currently several lines long,
but there's no reason they couldn't be successive kernel calls with
nothing but a test for bind() failure between.

That strikes me as plenty close enough...

regards, tom lane

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#31)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

Peter Eisentraut <peter_e@gmx.net> writes:

It seems that that would completely reverse the assumption of risk.
Currently, the postmaster may fail to start because there's a stale socket
file lying around, out of respect to a running colleague. With this idea
it would be the running postmaster's job to "defend" his socket against
newly starting colleagues. That doesn't seem fair.

True, it would reverse the most probable failure mode, but I'm not sure
that's a bad thing.

The other problem is a socket file left behind by a crashed postmaster. I
don't consider this such a big problem; a crashed postmaster is not the
normal mode of operation. The friendly message we have right now seems
alright to me. And it's a way of tell that the postmaster crashed at all.

No, actually this is a *big* problem. That friendly message is no help
to a system boot script that can't read it (the same point you've made
repeatedly w.r.t configure issues; surprised you don't see it here).

If I do a fast shutdown of my Unix system (the kind where shutdown does
a 'kill -9' on all user processes --- on HPUX systems this is invoked by
hitting the power switch or by the power supply overtemperature sensor)
then the postmaster doesn't get a chance to clean out its socket file.
After reboot, the postmaster fails to start up until I manually
intervene by removing the socket file. That's not robust and not
acceptable.

The way I currently get around this (and I believe it's a pretty popular
thing to do) is that my postmaster-start script unconditionally deletes
the socket file before launching the postmaster. That's actually far
riskier than what we are discussing, because there is *no* safety check
for an already-started postmaster. A connection check would be a big
improvement.

I consider failure-to-start during normal system bootup to be a far
graver risk than the possibility that a second postmaster will usurp
a first postmaster's Unix socket --- especially since the latter could
only happen if the first postmaster isn't answering connections, in
which case allowing it to keep the socket is of dubious value anyhow.
So reversing the presumption of innocence seems like a good idea to me.

... The solution to this is to make the path of the socket file
configurable more easily so that the administrator has the choice of
putting it a safer place that he prepared appropriately.

We talked about that in the original discussion (you might want to
review the flock pghackers thread from late August '98). The trouble is
that the socket file path is a critical part of the client-to-postmaster
protocol: change the path, and existing clients don't know where to
connect. Oops. So even though /tmp is obviously a pretty bogus place
to keep the socket, the compatibility headaches of moving it are so
great that no one really wants to bite the bullet.

We talked about compromises like keeping the real socket in some safer
directory, with a symlink from /tmp for old clients, and I think that's
what will happen eventually. But please note that if the socket file
path is "easily configurable" then the same problem comes right back
to bite you again. It's *not* "easy" to change your mind about where
the socket files live; on any given platform that decision had better be
graven on stone tablets, because you want all your clients of whatever
vintage to be able to find your postmaster(s). I'm inclined to think
that a configure option might be counterproductive --- nailing it down
in the per-OS template file seems much less likely to get screwed up.

The major problem with a hard-wired socket path that's not /tmp is
that you can't install the socket directory if you're not root, so the
ability to fire up a postmaster with no root privs whatever would no
longer exist. We could get around that if it were possible to run with
only TCP connection support, making Unix-domain connections an option
instead of the base requirement.

A complementary solution is of course to add an option to run without Unix
socket, since we don't rely on the socket file for data directory locking
anymore. In fact, does anybody mind if I add such an option? We can have
tcpip_socket = yes|no
unix_socket = yes|no

Yup, it would make a lot of sense to have an option for no Unix socket
connections (we already have that as an #ifdef for a couple of platforms
with no Unix socket support, but not as a postmaster start-time choice).

(Security-conscious users may choose to turn off both. :-))

Uh, not at the moment, because we use the port interlock(s) as a proxy
for a shared-memory interlock. Really there are three resources that
we must prevent concurrent postmasters from sharing:
* data directory;
* listen port number;
* shared-memory blocks (and semaphore sets).
We have a good solution in place now for locking the data directory, but
the port interlock still needs work. Currently we use the port number
to assign shmem/sema keys, and there is no separate interlock to guard
against shmem conflicts. I believe we had a discussion a few months ago
about rejiggering the shmem key assignment method so that shmem
conflicts would be detected and dealt with cleanly --- might be a good
idea to make that happen before we go too far with port interlock
changes.

regards, tom lane

#36Alfred Perlstein
bright@wintelcom.net
In reply to: Bruce Momjian (#28)
Re: fcntl(SETLK) [was Re: 2nd update on TOAST]

* Bruce Momjian <pgman@candle.pha.pa.us> [000708 06:40] wrote:

That's a good point. I don't think so because the socket will only
create for one user. Basically, we don't need something bulletproof
here. We just need something to prevent admins from accidentally
starting two postmasters on the same port.

Actually I just remebered the issue here, if one wants to start
postmaster on an alternate port there will be no conflict and
all hell may break loose.

We already lock the /data directory. This is for the port lock.

The whole process could be locked by an fcntl lock on a seperate file,
which I think was already mentioned, however I've deleted most of the
thread unfortunatly.

/tmp/.l.PGSQL.5432 <- fcntl lockfile, aquired first.
/tmp/.s.PGSQL.5432 <- socket.

--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."