tables > 1 gig

Started by Bruce Momjianover 26 years ago33 messages
#1Bruce Momjian
maillist@candle.pha.pa.us

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

/*
* RELSEG_SIZE is the maximum number of blocks allowed in one disk file.
* Thus, the maximum size of a single file is RELSEG_SIZE * BLCKSZ;
* relations bigger than that are divided into multiple files.
*
* CAUTION: RELSEG_SIZE * BLCKSZ must be less than your OS' limit on file
* size. This is typically 2Gb or 4Gb in a 32-bit operating system. By
* default, we make the limit 1Gb to avoid any possible integer-overflow
* problems within the OS. A limit smaller than necessary only means we
* divide a large relation into more chunks than necessary, so it seems
* best to err in the direction of a small limit. (Besides, a power-of-2
* value saves a few cycles in md.c.)
*
* CAUTION: you had best do an initdb if you change either BLCKSZ or
* RELSEG_SIZE.
*/
#define RELSEG_SIZE (0x40000000 / BLCKSZ)

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#1)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

regards, tom lane

#3Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

regards, tom lane

I am told we did not by Hiroshi. It was news to me too.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Bruce Momjian (#3)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

regards, tom lane

I am told we did not by Hiroshi. It was news to me too.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

I am told we did not by Hiroshi. It was news to me too.

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

regards, tom lane

#6Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#5)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

I am told we did not by Hiroshi. It was news to me too.

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

Why can't we increase it?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

Why can't we increase it?

Consider a 1.5-gig table. 6.5 will store it as one gig in file "table",
one-half gig in file "table.1". Now recompile with larger RELSEG_SIZE.
The file manager will now expect to find all blocks of the relation in
file "table", and will never go to "table.1" at all. Presto, you lost
a bunch of data.

Bottom line is just as it says in the config.h comments: you can't
change either BLCKSZ or RELSEG_SIZE without doing initdb.

regards, tom lane

#8Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#7)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

Why can't we increase it?

Consider a 1.5-gig table. 6.5 will store it as one gig in file "table",
one-half gig in file "table.1". Now recompile with larger RELSEG_SIZE.
The file manager will now expect to find all blocks of the relation in
file "table", and will never go to "table.1" at all. Presto, you lost
a bunch of data.

Bottom line is just as it says in the config.h comments: you can't
change either BLCKSZ or RELSEG_SIZE without doing initdb.

OK. I will reverse it out. I never thought that far ahead. Not sure
how we can fix this easily, nor do I understand why more people aren't
complaining about not being able to vacuum tables that are 1.5 gigs that
they used to be able to vacuum.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#8)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

... nor do I understand why more people aren't
complaining about not being able to vacuum tables that are 1.5 gigs that
they used to be able to vacuum.

Most likely, not very many people with tables that big have adopted 6.5
yet ... if I were running a big site, I'd probably wait for 6.5.1 on
general principles ;-)

I think what we ought to do is finish working out how to make mdtruncate
safe for concurrent backends, and then do it. That's the right
long-term answer anyway.

regards, tom lane

#10Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#9)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

... nor do I understand why more people aren't
complaining about not being able to vacuum tables that are 1.5 gigs that
they used to be able to vacuum.

Most likely, not very many people with tables that big have adopted 6.5
yet ... if I were running a big site, I'd probably wait for 6.5.1 on
general principles ;-)

I think what we ought to do is finish working out how to make mdtruncate
safe for concurrent backends, and then do it. That's the right
long-term answer anyway.

Problem is, no one knows how right now. I liked unlinking every
segment, but was told by Hiroshi that causes a problem with concurrent
access and vacuum because the old backends still think it is there.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#11Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#5)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Config.h has this. Does this need to be updated because we can't vacuum
multi-segment relations? I have changed it to 7F000000:

Why? I thought we'd fixed the mdtruncate issue.

I am told we did not by Hiroshi. It was news to me too.

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

No initdb for minor releases!

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#11)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

I think what we ought to do is finish working out how to make mdtruncate
safe for concurrent backends, and then do it. That's the right
long-term answer anyway.

Problem is, no one knows how right now. I liked unlinking every
segment, but was told by Hiroshi that causes a problem with concurrent
access and vacuum because the old backends still think it is there.

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

You could truncate to zero length *and* then unlink the files if you
had a mind to do that, but I can see why unlink without truncate would
not work reliably.

regards, tom lane

#13Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#12)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

I think what we ought to do is finish working out how to make mdtruncate
safe for concurrent backends, and then do it. That's the right
long-term answer anyway.

Problem is, no one knows how right now. I liked unlinking every
segment, but was told by Hiroshi that causes a problem with concurrent
access and vacuum because the old backends still think it is there.

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

You could truncate to zero length *and* then unlink the files if you
had a mind to do that, but I can see why unlink without truncate would
not work reliably.

That seems like the issue. The more complex problem is that when the
relation lookes a segment via vacuum, things go strange on the other
backends. Hiroshi seems to have a good testbed for this, and I thought
it was fixed, so I didn't notice.

Unlinking allows other backends to keep their open segments of the
tables, but that causes some problems with backends opening segments
they think still exist and they can't be opened.

Truncating segments causes problems because backends are still accessing
their own copies of the tables, and truncate modified what is seen in
their open file descriptors.

We basically have two methods, and both have problems under certain
circumstances. I wonder if we unlink the files, but then create
zero-length segments for the ones we unlink. If people think that may
fix the problems, it is easy to do that, and we can do it atomically
using the rename() system call. Create the zero-length file under a
temp name, then rename it to the segment file name. That may do the
trick of allowing existing file descriptors to stay active, while having
segments in place for those that need to see them.

Comments?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#14Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#12)
Re: [HACKERS] tables > 1 gig

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

Not sure about that. When we truncate single segment file, the table is
being destroyed, so we invalidate it in the catalog cache and tell other
backends. Also, we have a problem with DROP TABLE in a transaction
while others are using it as described by a bug report a few days ago,
so I don't think we have that 100% either.

You could truncate to zero length *and* then unlink the files if you
had a mind to do that, but I can see why unlink without truncate would
not work reliably.

That is interesting. I never thought of that. Hiroshi, can you test
that idea? If it is the non-existance of the file that other backends
are checking for, my earlier idea of rename() with truncated file kept
in place may be better.

Also, I see why we are not getting more bug reports. They only get this
when the table looses a segment, so it is OK to vacuum large tables as
long as the table doesn't loose a segment during the vacuum.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#15Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Bruce Momjian (#14)
Re: [HACKERS] tables > 1 gig

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

Not sure about that. When we truncate single segment file, the table is
being destroyed, so we invalidate it in the catalog cache and tell other
backends. Also, we have a problem with DROP TABLE in a transaction
while others are using it as described by a bug report a few days ago,
so I don't think we have that 100% either.

That is interesting. I never thought of that. Hiroshi, can you test
that idea? If it is the non-existance of the file that other backends
are checking for, my earlier idea of rename() with truncated file kept
in place may be better.

Also, I see why we are not getting more bug reports. They only get this
when the table looses a segment, so it is OK to vacuum large tables as
long as the table doesn't loose a segment during the vacuum.

OK, this is 100% wrong. We truncate from vacuum any time the table size
changes, and vacuum of large tables will fail even if not removing a
segment. I forgot vacuum does this to reduce disk table size.

I wonder if truncating a file to reduce its size will cause other table
readers to have problems. I though vacuum had an exlusive lock on the
table during vacuum, and if so, why are other backends having troubles?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#12)
RE: [HACKERS] tables > 1 gig

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, June 18, 1999 12:54 AM
To: Bruce Momjian
Cc: PostgreSQL-development; Inoue@tpf.co.jp
Subject: Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

I think what we ought to do is finish working out how to make

mdtruncate

safe for concurrent backends, and then do it. That's the right
long-term answer anyway.

Problem is, no one knows how right now. I liked unlinking every
segment, but was told by Hiroshi that causes a problem with concurrent
access and vacuum because the old backends still think it is there.

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put tuples into,
right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

You could truncate to zero length *and* then unlink the files if you
had a mind to do that, but I can see why unlink without truncate would
not work reliably.

Unlinking unused segments after truncating to zero length may cause
the result such as

Existent backends write to the truncated file to extend the relation
while new backends create a new segment file to extend the relation.

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#17Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#15)
RE: [HACKERS] tables > 1 gig

I haven't been paying much attention, but I imagine that what's really
going on here is that once vacuum has collected all the still-good
tuples at the front of the relation, it doesn't bother to go through
the remaining blocks of the relation and mark everything dead therein?
It just truncates the file after the last block that it put

tuples into,

right?

If this procedure works correctly for vacuuming a simple one-segment
table, then it would seem that truncation of all the later segments to
zero length should work correctly.

Not sure about that. When we truncate single segment file, the table is
being destroyed, so we invalidate it in the catalog cache and tell other
backends. Also, we have a problem with DROP TABLE in a transaction
while others are using it as described by a bug report a few days ago,
so I don't think we have that 100% either.

The problem is that (virtual) file descriptors,relcache entries ... etc
are local to each process. I don't know the certain way to tell other
processes just in time that target resources should be invalidated.

That is interesting. I never thought of that. Hiroshi, can you test
that idea? If it is the non-existance of the file that other backends
are checking for, my earlier idea of rename() with truncated file kept
in place may be better.

Also, I see why we are not getting more bug reports. They only get this
when the table looses a segment, so it is OK to vacuum large tables as
long as the table doesn't loose a segment during the vacuum.

OK, this is 100% wrong. We truncate from vacuum any time the table size
changes, and vacuum of large tables will fail even if not removing a
segment. I forgot vacuum does this to reduce disk table size.

I wonder if truncating a file to reduce its size will cause other table
readers to have problems.

Current implementation has a hidden bug.
Once the size of a segment reached RELSEG_SIZE,mdnblocks()
wouldn't check the real size of the segment any more.

I'm not sure such other bugs doesn't exist any more.
It's one of the reason why I don't recommend to apply my trial patch
to mdtruncate().

I though vacuum had an exlusive lock on the
table during vacuum, and if so, why are other backends having troubles?

We could not see any errors by unlinking segmented relations when
commands are executed sequentailly.
Vacuum calls RelationInvalidateHeapTuple() for a pg_class tuple and
and other backends could recognize that the relcachle entry must be
invalidated while executing StartTransaction() or CommandCounter
Increment().

Even though the target relation is locked exclusively by vacuum,other
backends could StartTransaction(),CommandCounterIncrement(),
parse,analyze,rewrite,optimize,start Executor Stage and open relations.
We could not rely on exclusive lock so much.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#18Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hiroshi Inoue (#16)
Re: [HACKERS] tables > 1 gig

Unlinking unused segments after truncating to zero length may cause
the result such as

Existent backends write to the truncated file to extend the relation
while new backends create a new segment file to extend the relation.

How about my idea of creating a truncated file, the renaming it to the
table file. That keeps the table open for other open file descriptors,
but put a zero-length file in place in an atomic manner.

Fact is that the current code is really bad, so I request you do your
best, and let's get it in there for people to review and improve if
necessary.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#19Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#18)
RE: [HACKERS] tables > 1 gig

-----Original Message-----
From: Bruce Momjian [mailto:maillist@candle.pha.pa.us]
Sent: Friday, June 18, 1999 11:31 AM
To: Hiroshi Inoue
Cc: PostgreSQL-development
Subject: Re: [HACKERS] tables > 1 gig

Unlinking unused segments after truncating to zero length may cause
the result such as

Existent backends write to the truncated file to extend

the relation

while new backends create a new segment file to extend the

relation.

How about my idea of creating a truncated file, the renaming it to the
table file. That keeps the table open for other open file descriptors,
but put a zero-length file in place in an atomic manner.

Sorry,I couldn't understand what you mean.
What is differenct from truncating existent files to zero length ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#20Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hiroshi Inoue (#19)
Re: [HACKERS] tables > 1 gig

Unlinking unused segments after truncating to zero length may cause
the result such as

Existent backends write to the truncated file to extend

the relation

while new backends create a new segment file to extend the

relation.

How about my idea of creating a truncated file, the renaming it to the
table file. That keeps the table open for other open file descriptors,
but put a zero-length file in place in an atomic manner.

Sorry,I couldn't understand what you mean.
What is differenct from truncating existent files to zero length ?

Glad to explain. Here is the pseudocode:

create temp file, make it zero length, call it 'zz'
rename(zz,tablename)

What this does is to create a zero length file, and the rename unlinks
the tablename file, and puts the zero-length file in it's place.
rename() is atomic, so there is no time that the table file does not
exist.

It allows backends that have the table open via a descriptor to keep the
table unchanged, while new backends see a zero-length file.

Does this help?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#21Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#20)
RE: [HACKERS] tables > 1 gig

How about my idea of creating a truncated file, the renaming it to the
table file. That keeps the table open for other open file

descriptors,

but put a zero-length file in place in an atomic manner.

Sorry,I couldn't understand what you mean.
What is differenct from truncating existent files to zero length ?

Glad to explain. Here is the pseudocode:

create temp file, make it zero length, call it 'zz'
rename(zz,tablename)

What this does is to create a zero length file, and the rename unlinks
the tablename file, and puts the zero-length file in it's place.
rename() is atomic, so there is no time that the table file does not
exist.

Let
i1 be the inode of zz
i2 be the inode of tablename
before rename().

Does this mean

New backends read/write i1 inode and
backends that have the table open read/write i2 inode ?

If so,it seems wrong.
All backends should see same data.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#22Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hiroshi Inoue (#21)
Re: [HACKERS] tables > 1 gig

What this does is to create a zero length file, and the rename unlinks
the tablename file, and puts the zero-length file in it's place.
rename() is atomic, so there is no time that the table file does not
exist.

Let
i1 be the inode of zz
i2 be the inode of tablename
before rename().

Does this mean

New backends read/write i1 inode and
backends that have the table open read/write i2 inode ?

If so,it seems wrong.
All backends should see same data.

Yes, I can see your point. It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file descriptors
of other backends for secondary segments. Why does truncating the file
not work? Any ideas?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#23Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#22)
RE: [HACKERS] tables > 1 gig

What this does is to create a zero length file, and the rename unlinks
the tablename file, and puts the zero-length file in it's place.
rename() is atomic, so there is no time that the table file does not
exist.

Let
i1 be the inode of zz
i2 be the inode of tablename
before rename().

Does this mean

New backends read/write i1 inode and
backends that have the table open read/write i2 inode ?

If so,it seems wrong.
All backends should see same data.

Yes, I can see your point. It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file descriptors
of other backends for secondary segments.

It seems DROP TABLE has a similar problem.
It has been already solved ?

Why does truncating the file
not work? Any ideas?

I have gotten no bug reports for my trial implementation.
AFAIK,only Ole Gjerde has tested my patch.
Is it sufficient ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#24Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hiroshi Inoue (#23)
Re: [HACKERS] tables > 1 gig

Yes, I can see your point. It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file descriptors
of other backends for secondary segments.

It seems DROP TABLE has a similar problem.
It has been already solved ?

Not solved. Someone reported it recently.

Why does truncating the file
not work? Any ideas?

I have gotten no bug reports for my trial implementation.
AFAIK,only Ole Gjerde has tested my patch.
Is it sufficient ?

Yes. We need something, and maybe after we add it, people can do
testing and find any problems. It is better to apply it than to leave
it as it currently exists, no?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#25Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Bruce Momjian (#24)
RE: [HACKERS] tables > 1 gig

Yes, I can see your point. It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file

descriptors

of other backends for secondary segments.

Why does truncating the file
not work? Any ideas?

I have gotten no bug reports for my trial implementation.
AFAIK,only Ole Gjerde has tested my patch.
Is it sufficient ?

Yes. We need something, and maybe after we add it, people can do
testing and find any problems. It is better to apply it than to leave
it as it currently exists, no?

OK,here is my patch for PostgreSQL6.5-release.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

*** storage/smgr/md.c.orig	Fri Jun 11 12:20:06 1999
--- storage/smgr/md.c	Fri Jun 18 15:10:54 1999
***************
*** 674,684 ****
  	segno = 0;
  	for (;;)
  	{
! 		if (v->mdfd_lstbcnt == RELSEG_SIZE
! 			|| (nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ)) == RELSEG_SIZE)
  		{
- 
- 			v->mdfd_lstbcnt = RELSEG_SIZE;
  			segno++;
  			if (v->mdfd_chain == (MdfdVec *) NULL)
--- 674,685 ----
  	segno = 0;
  	for (;;)
  	{
! 		nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ);
! 		if (nblocks > RELSEG_SIZE)
! 			elog(FATAL, "segment too big in mdnblocks!");
! 		v->mdfd_lstbcnt = nblocks;
! 		if (nblocks == RELSEG_SIZE)
  		{
  			segno++;

if (v->mdfd_chain == (MdfdVec *) NULL)
***************
*** 711,732 ****
MdfdVec *v;

#ifndef LET_OS_MANAGE_FILESIZE
! int curnblk;

curnblk = mdnblocks(reln);
! if (curnblk / RELSEG_SIZE > 0)
! {
! elog(NOTICE, "Can't truncate multi-segments relation %s",
! reln->rd_rel->relname.data);
! return curnblk;
! }
#endif

fd = RelationGetFile(reln);
v = &Md_fdvec[fd];

if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
return -1;

return nblocks;

--- 712,766 ----
  	MdfdVec    *v;

#ifndef LET_OS_MANAGE_FILESIZE
! int curnblk,
! i,
! oldsegno,
! newsegno,
! lastsegblocks;
! MdfdVec **varray;

curnblk = mdnblocks(reln);
! if (nblocks > curnblk)
! return -1;
! oldsegno = curnblk / RELSEG_SIZE;
! newsegno = nblocks / RELSEG_SIZE;
!
#endif

fd = RelationGetFile(reln);
v = &Md_fdvec[fd];

+ #ifndef LET_OS_MANAGE_FILESIZE
+ 	varray = (MdfdVec **)palloc((oldsegno + 1) * sizeof(MdfdVec *));
+ 	for (i = 0; i <= oldsegno; i++)
+ 	{
+ 		if (!v)
+ 			elog(ERROR,"segment isn't open in mdtruncate!");
+ 		varray[i] = v;
+ 		v = v->mdfd_chain;
+ 	}
+ 	for (i = oldsegno; i > newsegno; i--)
+ 	{
+ 		v = varray[i];
+ 		if (FileTruncate(v->mdfd_vfd, 0) < 0)
+ 		{
+ 			pfree(varray);
+ 			return -1;
+ 		}
+ 		v->mdfd_lstbcnt = 0;
+ 	}
+ 	/* Calculate the # of blocks in the last segment */
+ 	lastsegblocks = nblocks - (newsegno * RELSEG_SIZE);
+ 	v = varray[i];
+ 	pfree(varray);
+ 	if (FileTruncate(v->mdfd_vfd, lastsegblocks * BLCKSZ) < 0)
+ 		return -1;
+ 	v->mdfd_lstbcnt = lastsegblocks;
+ #else
  	if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
  		return -1;
+ 	v->mdfd_lstbcnt = nblocks;
+ #endif

return nblocks;

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#25)
Re: [HACKERS] tables > 1 gig

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Unlinking unused segments after truncating to zero length may cause
the result such as
Existent backends write to the truncated file to extend the relation
while new backends create a new segment file to extend the relation.

Ooh, good point. So, unless we want to invent some way for the process
that's running vacuum to force other backends to close their FDs for
segment files, the *only* correct solution is to truncate to zero length
but leave the files in place.

I still don't quite see why there is such a big problem, however, unless
you're asserting that vacuum is broken for single-segment tables too.
Surely vacuum acquires a lock over the whole table, not just a segment?

regards, tom lane

#27Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Hiroshi Inoue (#25)
Re: [HACKERS] tables > 1 gig

Thank you. Applied.

[Charset iso-8859-1 unsupported, filtering to ASCII...]

Yes, I can see your point. It would show them different views of the
table.

So, as you were saying, we have no way of invalidating file

descriptors

of other backends for secondary segments.

Why does truncating the file
not work? Any ideas?

I have gotten no bug reports for my trial implementation.
AFAIK,only Ole Gjerde has tested my patch.
Is it sufficient ?

Yes. We need something, and maybe after we add it, people can do
testing and find any problems. It is better to apply it than to leave
it as it currently exists, no?

OK,here is my patch for PostgreSQL6.5-release.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

*** storage/smgr/md.c.orig	Fri Jun 11 12:20:06 1999
--- storage/smgr/md.c	Fri Jun 18 15:10:54 1999
***************
*** 674,684 ****
segno = 0;
for (;;)
{
! 		if (v->mdfd_lstbcnt == RELSEG_SIZE
! 			|| (nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ)) == RELSEG_SIZE)
{
- 
- 			v->mdfd_lstbcnt = RELSEG_SIZE;
segno++;
if (v->mdfd_chain == (MdfdVec *) NULL)
--- 674,685 ----
segno = 0;
for (;;)
{
! 		nblocks = _mdnblocks(v->mdfd_vfd, BLCKSZ);
! 		if (nblocks > RELSEG_SIZE)
! 			elog(FATAL, "segment too big in mdnblocks!");
! 		v->mdfd_lstbcnt = nblocks;
! 		if (nblocks == RELSEG_SIZE)
{
segno++;

if (v->mdfd_chain == (MdfdVec *) NULL)
***************
*** 711,732 ****
MdfdVec *v;

#ifndef LET_OS_MANAGE_FILESIZE
! int curnblk;

curnblk = mdnblocks(reln);
! if (curnblk / RELSEG_SIZE > 0)
! {
! elog(NOTICE, "Can't truncate multi-segments relation %s",
! reln->rd_rel->relname.data);
! return curnblk;
! }
#endif

fd = RelationGetFile(reln);
v = &Md_fdvec[fd];

if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
return -1;

return nblocks;

--- 712,766 ----
MdfdVec    *v;

#ifndef LET_OS_MANAGE_FILESIZE
! int curnblk,
! i,
! oldsegno,
! newsegno,
! lastsegblocks;
! MdfdVec **varray;

curnblk = mdnblocks(reln);
! if (nblocks > curnblk)
! return -1;
! oldsegno = curnblk / RELSEG_SIZE;
! newsegno = nblocks / RELSEG_SIZE;
!
#endif

fd = RelationGetFile(reln);
v = &Md_fdvec[fd];

+ #ifndef LET_OS_MANAGE_FILESIZE
+ 	varray = (MdfdVec **)palloc((oldsegno + 1) * sizeof(MdfdVec *));
+ 	for (i = 0; i <= oldsegno; i++)
+ 	{
+ 		if (!v)
+ 			elog(ERROR,"segment isn't open in mdtruncate!");
+ 		varray[i] = v;
+ 		v = v->mdfd_chain;
+ 	}
+ 	for (i = oldsegno; i > newsegno; i--)
+ 	{
+ 		v = varray[i];
+ 		if (FileTruncate(v->mdfd_vfd, 0) < 0)
+ 		{
+ 			pfree(varray);
+ 			return -1;
+ 		}
+ 		v->mdfd_lstbcnt = 0;
+ 	}
+ 	/* Calculate the # of blocks in the last segment */
+ 	lastsegblocks = nblocks - (newsegno * RELSEG_SIZE);
+ 	v = varray[i];
+ 	pfree(varray);
+ 	if (FileTruncate(v->mdfd_vfd, lastsegblocks * BLCKSZ) < 0)
+ 		return -1;
+ 	v->mdfd_lstbcnt = lastsegblocks;
+ #else
if (FileTruncate(v->mdfd_vfd, nblocks * BLCKSZ) < 0)
return -1;
+ 	v->mdfd_lstbcnt = nblocks;
+ #endif

return nblocks;

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#28Ole Gjerde
gjerde@icebox.org
In reply to: Bruce Momjian (#27)
Re: [HACKERS] tables > 1 gig

On Fri, 18 Jun 1999, Bruce Momjian wrote:
[snip - mdtruncate patch]

While talking about this whole issue, there is one piece missing.
Currently there is no way to dump a database/table over 2 GB.
When it hits the 2GB OS limit, it just silently stops and gives no
indication that it didn't finish.

It's not a problem for me yet, but I'm getting very close. I have one
database with 3 tables over 2GB(in postgres space), but they still come
out under 2GB after a dump. I can't do a pg_dump on the whole database
however, which would be very nice.

I suppose it wouldn't be overly hard to have pg_dump/pg_dumpall do
something similar to what postgres does with segments. I haven't looked
at it yet however, so I can't say for sure.

Comments?

Ole Gjerde

#29Hannu Krosing
hannu@trust.ee
In reply to: Ole Gjerde (#28)
Re: [HACKERS] tables > 1 gig

Ole Gjerde wrote:

On Fri, 18 Jun 1999, Bruce Momjian wrote:
[snip - mdtruncate patch]

While talking about this whole issue, there is one piece missing.
Currently there is no way to dump a database/table over 2 GB.
When it hits the 2GB OS limit, it just silently stops and gives no
indication that it didn't finish.

It's not a problem for me yet, but I'm getting very close. I have one
database with 3 tables over 2GB(in postgres space), but they still come
out under 2GB after a dump. I can't do a pg_dump on the whole database
however, which would be very nice.

I suppose it wouldn't be overly hard to have pg_dump/pg_dumpall do
something similar to what postgres does with segments. I haven't looked
at it yet however, so I can't say for sure.

Comments?

As pg_dump writes to stdout, you can just use standard *nix tools:

1. use compressed dumps

pg_dump really_big_db | gzip > really_big_db.dump.gz

reload with

gunzip -c really_big_db.dump.gz | psql newdb
or
cat really_big_db.dump.gz | gunzip | psql newdb

2. use split

pg_dump really_big_db | split -b 1m - really_big_db.dump.

reload with

cat really_big_db.dump.* | pgsql newdb

-----------------------
Hannu

#30Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tom Lane (#7)
Re: [HACKERS] tables > 1 gig

Bruce Momjian <maillist@candle.pha.pa.us> writes:

Then we'd better fix the underlying problem. We can't change
RELSEG_SIZE for a minor release, unless you want to give up the
principle of not forcing initdb at minor releases.

Why can't we increase it?

Consider a 1.5-gig table. 6.5 will store it as one gig in file "table",
one-half gig in file "table.1". Now recompile with larger RELSEG_SIZE.
The file manager will now expect to find all blocks of the relation in
file "table", and will never go to "table.1" at all. Presto, you lost
a bunch of data.
Bottom line is just as it says in the config.h comments: you can't
change either BLCKSZ or RELSEG_SIZE without doing initdb.

Sorry for backing up so far on this thread...

Would it be possible to make BLCKSZ and/or RELSEG_SIZE (the latter
perhaps the most important, and perhaps the easiest?) a configurable
parameter which is read out of a global variable for each database? If
so, we could later think about moving it, along with things like
"default character set", to pg_database as per-db information, and
make it an option on CREATE DATABASE. That kind of thing might make
in-place upgrades easier too.

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Lockhart (#30)
Re: [HACKERS] tables > 1 gig

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

Would it be possible to make BLCKSZ and/or RELSEG_SIZE (the latter
perhaps the most important, and perhaps the easiest?) a configurable
parameter which is read out of a global variable for each database?

Doable, perhaps, but I'm not quite sure why it's worth the trouble...
there doesn't seem to be that much value in running different DBs with
different values inside a single installation. Tweaking BLCKSZ, in
particular, will become fairly uninteresting once we solve the tuples-
bigger-than-a-block problem.

regards, tom lane

#32Zeugswetter Andreas IZ5
Andreas.Zeugswetter@telecom.at
In reply to: Tom Lane (#31)
AW: [HACKERS] tables > 1 gig

I suppose it wouldn't be overly hard to have pg_dump/pg_dumpall do
something similar to what postgres does with segments. I haven't looked
at it yet however, so I can't say for sure.

I would not integrate such functionality into pg_dump, since it is not
necessary.
A good thing though would be a little HOWTO on splitting and/or compressing
pg_dump output.

The principle is:

backup:
mkfifo tapepipe
( gzip --fast -c < tapepipe | split -b512m - database.dump.gz. ) &
pg_dump -f tapepipe regression
rm tapepipe

restore:
createdb regression
cat database.dump.gz.* | gzip -cd | psql regression

Instead of gzip you could use a faster compressor like lzop, but you get the
idea :-)

Andreas

#33Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Zeugswetter Andreas IZ5 (#32)
Re: AW: [HACKERS] tables > 1 gig

A good thing though would be a little HOWTO on splitting and/or
compressing pg_dump output.

I already snarfed the description from Hannu Krosing and have put it
into manage-ag.sgml (which I've not yet committed to the web site;
will do so soon).

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California