database 1.2G, pg_dump 73M?!

Started by Ross Boylanabout 18 years ago11 messagesgeneral

RossBoylan@stanfordalumni.org

about 18 years ago

I have a postgres server for which du reports
1188072 /var/lib/postgresql/8.2/main
on Linux system.
The server has only one real database, which is for bacula. When I dump
the database, it's 73Mg.

This is immediately after I did a full vacuum and restarted the server.

It seems very strange to me that there is such a difference in size
between the dump and the database: the data store is almost 15 time
larger than the dump.

Is this to be expected (e.g., from the indices taking up disk space)?
Is there anything I can do to reclaim some disk space?

Thanks.
Ross

Joris Dobbelsteen

Joris@familiedobbelsteen.nl

about 18 years ago

In reply to: Ross Boylan (#1)

Re: database 1.2G, pg_dump 73M?!

Ross Boylan wrote:

I have a postgres server for which du reports
1188072 /var/lib/postgresql/8.2/main
on Linux system.
The server has only one real database, which is for bacula. When I dump
the database, it's 73Mg.

This is immediately after I did a full vacuum and restarted the server.

Also,
bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
limit 15;
relname | relpages
---------------------------------+----------
file_jpfid_idx | 27122
file_pathid_idx | 17969
file_jobid_idx | 17948
file_pkey | 14580
file_fp_idx | 12714
file | 11558
file_filenameid_idx | 9806
filename | 3958
filename_name_idx | 2510
filename_pkey | 1367
path | 966
path_name_idx | 950
path_pkey | 151
pg_attribute_relid_attnam_index | 46
pg_proc | 45

It seems very strange to me that there is such a difference in size
between the dump and the database: the data store is almost 15 time
larger than the dump.

Is this to be expected (e.g., from the indices taking up disk space)?
Is there anything I can do to reclaim some disk space

There are a few factors you need to take into account:

* Data storage in the database is packed into blocks and contains
header data. Since data needs to be put into blocks there is a
potential for waisting space. If you are unlucky it can become
nearly a single row in the worst case.
* You need to vacuum often, to ensure obsolete rows are removed and
space can be reused.
* Tables are not reduced in size and only grown. I thinks cluster
and vacuum full will reduce the size of your table.
* Indexes are not in the backup, they are derived from the table
data on a restore.
If you remove the indexes you are left with 150~200 MB of data (I
guessed).
Doing reindex will rebuild the index and get rid of all the bloat
it has been collected during use. _I recommend you try this_, as
your indexes on the file table look quite huge.

But the most important factor for you will be the following:

* Backups are compressed. Since you store filenames and paths these
will have a very high amount of regularity and therefore are very
good targets for compression. This can save a huge amount of data.
If you take a compression factor of 50%~70% you will reach your 70 MB.

Ow, server restarts will not help reduce your database size. In fact,
nothing at all should change, except lower performance until sufficient
cached data is back in the cache again.

Hope this helps...

- Joris

Ross Boylan

RossBoylan@stanfordalumni.org

about 18 years ago

In reply to: Joris Dobbelsteen (#2)

Re: database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 20:27 +0200, Joris Dobbelsteen wrote:

Ross Boylan wrote:

I have a postgres server for which du reports
1188072 /var/lib/postgresql/8.2/main
on Linux system.
The server has only one real database, which is for bacula. When I dump
the database, it's 73Mg.

This is immediately after I did a full vacuum and restarted the server.

Also,
bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
limit 15;
relname | relpages
---------------------------------+----------
file_jpfid_idx | 27122
file_pathid_idx | 17969
file_jobid_idx | 17948
file_pkey | 14580
file_fp_idx | 12714
file | 11558
file_filenameid_idx | 9806
filename | 3958
filename_name_idx | 2510
filename_pkey | 1367
path | 966
path_name_idx | 950
path_pkey | 151
pg_attribute_relid_attnam_index | 46
pg_proc | 45

It seems very strange to me that there is such a difference in size
between the dump and the database: the data store is almost 15 time
larger than the dump.

Is this to be expected (e.g., from the indices taking up disk space)?
Is there anything I can do to reclaim some disk space

There are a few factors you need to take into account:

* Data storage in the database is packed into blocks and contains
header data. Since data needs to be put into blocks there is a
potential for waisting space. If you are unlucky it can become
nearly a single row in the worst case.
* You need to vacuum often, to ensure obsolete rows are removed and
space can be reused.
* Tables are not reduced in size and only grown. I thinks cluster
and vacuum full will reduce the size of your table.
* Indexes are not in the backup, they are derived from the table
data on a restore.
If you remove the indexes you are left with 150~200 MB of data (I
guessed).
Doing reindex will rebuild the index and get rid of all the bloat
it has been collected during use. _I recommend you try this_, as
your indexes on the file table look quite huge.

Thanks so much!

I guess I need to figure out how to reindex automatically.

But the most important factor for you will be the following:

* Backups are compressed. Since you store filenames and paths these
will have a very high amount of regularity and therefore are very
good targets for compression. This can save a huge amount of data.
If you take a compression factor of 50%~70% you will reach your 70 MB.

I don't see how this is relevant, since my dump file was plain text
(sql).

Ow, server restarts will not help reduce your database size. In fact,
nothing at all should change, except lower performance until sufficient
cached data is back in the cache again.

Hope this helps...

It was a huge help.

Show quoted text

- Joris

Joris Dobbelsteen

Joris@familiedobbelsteen.nl

about 18 years ago

In reply to: Ross Boylan (#3)

Re: database 1.2G, pg_dump 73M?!

Ross Boylan wrote:

On Sun, 2008-03-30 at 20:27 +0200, Joris Dobbelsteen wrote:

Ross Boylan wrote:

I have a postgres server for which du reports
1188072 /var/lib/postgresql/8.2/main
on Linux system.
The server has only one real database, which is for bacula. When I dump
the database, it's 73Mg.

This is immediately after I did a full vacuum and restarted the server.

Also,
bacula=> SELECT relname, relpages FROM pg_class ORDER BY relpages DESC
limit 15;
relname | relpages
---------------------------------+----------
file_jpfid_idx | 27122
file_pathid_idx | 17969
file_jobid_idx | 17948
file_pkey | 14580
file_fp_idx | 12714
file | 11558
file_filenameid_idx | 9806
filename | 3958
filename_name_idx | 2510
filename_pkey | 1367
path | 966
path_name_idx | 950
path_pkey | 151
pg_attribute_relid_attnam_index | 46
pg_proc | 45

It seems very strange to me that there is such a difference in size
between the dump and the database: the data store is almost 15 time
larger than the dump.

Is this to be expected (e.g., from the indices taking up disk space)?
Is there anything I can do to reclaim some disk space

There are a few factors you need to take into account:

* Data storage in the database is packed into blocks and contains
header data. Since data needs to be put into blocks there is a
potential for waisting space. If you are unlucky it can become
nearly a single row in the worst case.
* You need to vacuum often, to ensure obsolete rows are removed and
space can be reused.
* Tables are not reduced in size and only grown. I thinks cluster
and vacuum full will reduce the size of your table.
* Indexes are not in the backup, they are derived from the table
data on a restore.
If you remove the indexes you are left with 150~200 MB of data (I
guessed).
Doing reindex will rebuild the index and get rid of all the bloat
it has been collected during use. _I recommend you try this_, as
your indexes on the file table look quite huge.

reindexing had a huge effect. After reindex the top tables (file,
filename and path) I now see
SELECT relname, relpages FROM pg_class ORDER BY relpages DESC limit 15;
relname | relpages
---------------------------------+----------
file | 11558
filename | 3958
filename_name_idx | 2383
file_jpfid_idx | 2145
file_fp_idx | 1787
file_jobid_idx | 1427
file_pathid_idx | 1427
file_pkey | 1427
file_filenameid_idx | 1427
filename_pkey | 1367
path | 966
path_name_idx | 871
path_pkey | 151
pg_attribute_relid_attnam_index | 46
pg_proc | 45
and du now reports 451M. That still seems a bit large, given the size
of the sql dump, but it's almost 2/3 lower than it was before.

Thanks so much!

I guess I need to figure out how to reindex automatically.

Take a threshold, e.g. look which indexes are towards the table size, or
something. The bloat is mostly causes by continues updates to the
indexes on every insert, update and delete command. The index needs to
split pages that might be merged back some time later. Doing frequent
vacuums might, or might not, prevent this. Even in theory you will see
that algorithms allow trees to grow quite large up to a certain constant
factor. This is in order to have a good limit on the amount of work that
must be done on a operation on the index.

But the most important factor for you will be the following:

* Backups are compressed. Since you store filenames and paths these
will have a very high amount of regularity and therefore are very
good targets for compression. This can save a huge amount of data.
If you take a compression factor of 50%~70% you will reach your 70 MB.

I don't see how this is relevant, since my dump file was plain text
(sql).

From the top contenders, about half are indexes, so you are stuck with
~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know those values,
but they are in the range of 24 bytes per tuple, I believe. Secondly a
block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than the overhead from
postgresql can be a big factor. If you are huge you loose on portions of
unoccupied space in blocks. I believe pg_statistics will provide this
information.
Another factor is representation in the SQL dump might be more efficient
than in the database, but this highly depends on your data set. For
example, a int8 takes 8 bytes in a table, while it takes between 1 and
~20 in a SQL dump.

How the plain SQL dump becomes this small I cannot explain without much
much more details.

Hope this helps...

It was a huge help.

Glad it was,

- Joris

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Ross Boylan (#3)

Re: database 1.2G, pg_dump 73M?!

Ross Boylan <RossBoylan@stanfordalumni.org> writes:

reindexing had a huge effect.

So the indexes were indeed bloated. There are some known usage patterns
in which regular vacuum isn't very good at reclaiming space in b-tree
indexes. For example if you make daily entries in an index by date and
later remove all but the last-of-the-month entry --- this leaves a few
entries on every index page and we don't have code to collapse that,
short of reindexing.

However what seems more likely is that you're getting burnt by excessive
use of VACUUM FULL. V.F., far from shrinking indexes, tends to bloat
them. Recommended practice is to use plain VACUUM often enough that you
don't need VACUUM FULL.

regards, tom lane

Ross Boylan

RossBoylan@stanfordalumni.org

about 18 years ago

In reply to: Joris Dobbelsteen (#4)

Re: database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:

From the top contenders, about half are indexes, so you are stuck
with
~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know those values,
but they are in the range of 24 bytes per tuple, I believe. Secondly
a
block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than the overhead
from
postgresql can be a big factor. If you are huge you loose on portions
of
unoccupied space in blocks. I believe pg_statistics will provide this
information.

There is a pg_statistic (no "s") table, but I don't know how to get
tuple size from it--the documentation refers to the source code to
figure out the codes. Backing up a step, I don't know what a tuple is
in Postgres, and don't see an entry for it in the index.

Is a tuple just a row? That's what the docs say for the following
report:
# select distinct relname, reltuples, relpages from pg_class where
relkind='r' and substring(relname, 1, 3) != 'pg_';

relname | reltuples | relpages
-------------------------+-----------+----------
basefiles | 0 | 0
cdimages | 0 | 0
client | 2 | 1
counters | 1 | 1
device | 0 | 0
file | 650659 | 11558
filename | 623012 | 3958
fileset | 22 | 1
job | 384 | 10
jobmedia | 596 | 7
location | 0 | 0
locationlog | 0 | 0
log | 0 | 0
media | 245 | 9
mediatype | 2 | 1
path | 67908 | 966
pool | 5 | 1
sql_features | 439 | 6
sql_implementation_info | 12 | 1
sql_languages | 4 | 1
sql_packages | 10 | 1
sql_parts | 9 | 1
sql_sizing | 23 | 1
sql_sizing_profiles | 0 | 0
status | 19 | 1
storage | 2 | 1
unsavedfiles | 0 | 0
version | 1 | 1

Another factor is representation in the SQL dump might be more
efficient
than in the database, but this highly depends on your data set. For
example, a int8 takes 8 bytes in a table, while it takes between 1
and
~20 in a SQL dump.

How the plain SQL dump becomes this small I cannot explain without
much
much more details.

On Tom's point, bacula regularly inserts entries into the tables and
then, days to months later, deletes them. As far as I know, the VACUUM
FULLs I just did were the first ever; I did do several of them because I
kept getting messages about needing more fsm_pages.

I am still trying to figure out if the database was getting any
automatic vacuuming at all. The Postgres documentation (the database is
8.2, though I'm moving to 8.3 soon) sounds as if it's on automatically,
but the Debian-specific documentation suggests I may need to do some
additional things to enable it.

Probably the fsm_pages being low also hurt disk useage, since the
message accompanying the vacuum said that's what happens if fsm_pages
is low. It was 20k; vacuum said I needed 56k, and I upped it to 80k.
I'm not sure if my recent cleaning has brought the needed fsm_pages
down.

I've only been doing partial backups for the last few months, so there's
probably less info in the tables than under normal conditions. I
suppose it's possible the space I gained was just a temporary win.

Ross

Joris Dobbelsteen

Joris@familiedobbelsteen.nl

about 18 years ago

In reply to: Ross Boylan (#1)

Re: database 1.2G, pg_dump 73M?!

-----Original Message-----
From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
Sent: Sunday, 30 March 2008 23:43
To: Joris Dobbelsteen
Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:

From the top contenders, about half are indexes, so you are stuck
with ~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know

those values,

but they are in the range of 24 bytes per tuple, I believe.

Secondly a

block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than the overhead
from postgresql can be a big factor. If you are huge you loose on
portions of unoccupied space in blocks. I believe pg_statistics will
provide this information.

There is a pg_statistic (no "s") table, but I don't know how
to get tuple size from it--the documentation refers to the
source code to figure out the codes. Backing up a step, I
don't know what a tuple is in Postgres, and don't see an entry
for it in the index.

It was pg_stats.
You get avg_width. It gives this per column.

So probably you want
SELECT tablename, SUM(avg_width)
FROM pg_stats
WHERE schemaname = 'public'
GROUP BY tablename;

Is a tuple just a row? That's what the docs say for the following
report:

Yes.
Also where I typed "block" it will refer to "page".

# select distinct relname, reltuples, relpages from pg_class
where relkind='r' and substring(relname, 1, 3) != 'pg_';

relname | reltuples | relpages
-------------------------+-----------+----------
basefiles | 0 | 0
cdimages | 0 | 0
client | 2 | 1
counters | 1 | 1
device | 0 | 0
file | 650659 | 11558
filename | 623012 | 3958
fileset | 22 | 1
job | 384 | 10
jobmedia | 596 | 7
location | 0 | 0
locationlog | 0 | 0
log | 0 | 0
media | 245 | 9
mediatype | 2 | 1
path | 67908 | 966
pool | 5 | 1
sql_features | 439 | 6
sql_implementation_info | 12 | 1
sql_languages | 4 | 1
sql_packages | 10 | 1
sql_parts | 9 | 1
sql_sizing | 23 | 1
sql_sizing_profiles | 0 | 0
status | 19 | 1
storage | 2 | 1
unsavedfiles | 0 | 0
version | 1 | 1

Do relpages * 8096 / reltuples.
Default installs have 8 KB pages/blocks by default.

For file it should be ~144 bytes/tuple.
For filename it should be ~51 bytes/tuple.

Probably you will get some signficant differences here.

Another factor is representation in the SQL dump might be more
efficient than in the database, but this highly depends on your data
set. For example, a int8 takes 8 bytes in a table, while it takes
between 1 and ~20 in a SQL dump.

How the plain SQL dump becomes this small I cannot explain without
much much more details.

On Tom's point, bacula regularly inserts entries into the
tables and then, days to months later, deletes them. As far
as I know, the VACUUM FULLs I just did were the first ever; I
did do several of them because I kept getting messages about
needing more fsm_pages.

I am still trying to figure out if the database was getting
any automatic vacuuming at all. The Postgres documentation
(the database is 8.2, though I'm moving to 8.3 soon) sounds as
if it's on automatically, but the Debian-specific
documentation suggests I may need to do some additional things
to enable it.

Probably the fsm_pages being low also hurt disk useage, since
the message accompanying the vacuum said that's what happens
if fsm_pages is low. It was 20k; vacuum said I needed 56k,
and I upped it to 80k.
I'm not sure if my recent cleaning has brought the needed
fsm_pages down.

I've only been doing partial backups for the last few months,
so there's probably less info in the tables than under normal
conditions. I suppose it's possible the space I gained was
just a temporary win.

Looks like some configuration changes are needed to tune your
installation to better suite you needs. fsm_pages can be modified, but I
don't know if any other parameters might need change for that.
Documentation can help and probably a lot of people here can do and tell
it blindfolded.

- Joris

Ross Boylan

RossBoylan@stanfordalumni.org

about 18 years ago

In reply to: Joris Dobbelsteen (#7)

Re: database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:

-----Original Message-----
From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
Sent: Sunday, 30 March 2008 23:43
To: Joris Dobbelsteen
Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:

From the top contenders, about half are indexes, so you are stuck
with ~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the tuples in a
block and overhead for each block and row. I don't know

those values,

but they are in the range of 24 bytes per tuple, I believe.

Secondly a

block is 8 KB by default and tuples cannot be stored into multiple
blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than the overhead
from postgresql can be a big factor. If you are huge you loose on
portions of unoccupied space in blocks. I believe pg_statistics will
provide this information.

There is a pg_statistic (no "s") table, but I don't know how
to get tuple size from it--the documentation refers to the
source code to figure out the codes. Backing up a step, I
don't know what a tuple is in Postgres, and don't see an entry
for it in the index.

It was pg_stats.
You get avg_width. It gives this per column.

So probably you want
SELECT tablename, SUM(avg_width)
FROM pg_stats
WHERE schemaname = 'public'

[Thanks; I didn't know about using schemaname to limit it to interesting
tables]

GROUP BY tablename [RB added] ORDER BY tablename;

So, for example, if each tuple has 24 bytes of overhead, the overhead
more than doubles the size of the file table (18 bytes), which has a big
record count. Am I following correctly?

Between the space taken up by indices and the other overhead, the size
difference between the sql dump and the db disk useage is starting to
seem more reasonable.

The database uses SQL-ASCII encoding, so I'd expect the textual parts
(filenames and paths) to take up the same space (just for the basic
storage, not counting overhead/indices) in Postgres as on the dump, 1
byte/character.

I'm not sure what "If you are huge you loose on portions of unoccupied
space in blocks" means.

Is a tuple just a row? That's what the docs say for the following
report:

Yes.
Also where I typed "block" it will refer to "page".

# select distinct relname, reltuples, relpages from pg_class
where relkind='r' and substring(relname, 1, 3) != 'pg_';

relname | reltuples | relpages
-------------------------+-----------+----------
basefiles | 0 | 0
cdimages | 0 | 0
client | 2 | 1
counters | 1 | 1
device | 0 | 0
file | 650659 | 11558
filename | 623012 | 3958
fileset | 22 | 1
job | 384 | 10
jobmedia | 596 | 7
location | 0 | 0
locationlog | 0 | 0
log | 0 | 0
media | 245 | 9
mediatype | 2 | 1
path | 67908 | 966
pool | 5 | 1
sql_features | 439 | 6
sql_implementation_info | 12 | 1
sql_languages | 4 | 1
sql_packages | 10 | 1
sql_parts | 9 | 1
sql_sizing | 23 | 1
sql_sizing_profiles | 0 | 0
status | 19 | 1
storage | 2 | 1
unsavedfiles | 0 | 0
version | 1 | 1

Do relpages * 8096 / reltuples.
Default installs have 8 KB pages/blocks by default.

For file it should be ~144 bytes/tuple.
For filename it should be ~51 bytes/tuple.

Those tuple size estimates seem consistent with the tuple sizes reported
earlier when the overhead is added in. I'm impressed!

Probably you will get some signficant differences here.

Another factor is representation in the SQL dump might be more
efficient than in the database, but this highly depends on your data
set. For example, a int8 takes 8 bytes in a table, while it takes
between 1 and ~20 in a SQL dump.

How the plain SQL dump becomes this small I cannot explain without
much much more details.

On Tom's point, bacula regularly inserts entries into the
tables and then, days to months later, deletes them. As far
as I know, the VACUUM FULLs I just did were the first ever; I
did do several of them because I kept getting messages about
needing more fsm_pages.

I am still trying to figure out if the database was getting
any automatic vacuuming at all. The Postgres documentation
(the database is 8.2, though I'm moving to 8.3 soon) sounds as
if it's on automatically, but the Debian-specific
documentation suggests I may need to do some additional things
to enable it.

Probably the fsm_pages being low also hurt disk useage, since
the message accompanying the vacuum said that's what happens
if fsm_pages is low. It was 20k; vacuum said I needed 56k,
and I upped it to 80k.
I'm not sure if my recent cleaning has brought the needed
fsm_pages down.

I've only been doing partial backups for the last few months,
so there's probably less info in the tables than under normal
conditions. I suppose it's possible the space I gained was
just a temporary win.

Looks like some configuration changes are needed to tune your
installation to better suite you needs. fsm_pages can be modified,

done

but I
don't know if any other parameters might need change for that.

I didn't seem to bump into any system limits; I had already upped the
overall shared memory limit a bit.

Documentation can help and probably a lot of people here can do and tell
it blindfolded.

I'm not a DB admin; I only play one on my computer. I clearly need to
figure out how to get regular vacuum, analyze, and reindexing going (if
they aren't going already).

Thanks for all your help.
Ross

Joris Dobbelsteen

Joris@familiedobbelsteen.nl

about 18 years ago

In reply to: Ross Boylan (#1)

Re: database 1.2G, pg_dump 73M?!

-----Original Message-----
From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
Sent: Monday, 31 March 2008 0:23
To: Joris Dobbelsteen
Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
Subject: RE: [GENERAL] database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 22:59 +0100, Joris Dobbelsteen wrote:

-----Original Message-----
From: Ross Boylan [mailto:RossBoylan@stanfordalumni.org]
Sent: Sunday, 30 March 2008 23:43
To: Joris Dobbelsteen
Cc: RossBoylan@stanfordalumni.org; pgsql-general@postgresql.org
Subject: Re: [GENERAL] database 1.2G, pg_dump 73M?!

On Sun, 2008-03-30 at 21:22 +0200, Joris Dobbelsteen wrote:

From the top contenders, about half are indexes, so you

are stuck

with ~200 MB of data in the tables.
Postgresql has some wasted space due to placement of the

tuples in

a block and overhead for each block and row. I don't know

those values,

but they are in the range of 24 bytes per tuple, I believe.

Secondly a

block is 8 KB by default and tuples cannot be stored into

multiple

blocks (thats what toast to work around).

All in all: Lookup tuple sizes, if they are small than

the overhead

from postgresql can be a big factor. If you are huge you loose on
portions of unoccupied space in blocks. I believe pg_statistics
will provide this information.

There is a pg_statistic (no "s") table, but I don't know

how to get

tuple size from it--the documentation refers to the source code to
figure out the codes. Backing up a step, I don't know what a tuple
is in Postgres, and don't see an entry for it in the index.

It was pg_stats.
You get avg_width. It gives this per column.

So probably you want
SELECT tablename, SUM(avg_width)
FROM pg_stats
WHERE schemaname = 'public'

[Thanks; I didn't know about using schemaname to limit it to
interesting tables]

GROUP BY tablename [RB added] ORDER BY tablename;

tablename | sum
-----------+-----
client | 62
counters | 25
file | 109
filename | 18
fileset | 53
job | 149
jobmedia | 52
media | 226
mediatype | 16
path | 82
pool | 179
status | 29
storage | 23
version | 4

So, for example, if each tuple has 24 bytes of overhead, the
overhead more than doubles the size of the file table (18
bytes), which has a big record count. Am I following correctly?

Yes.
(Note not to pin down on the 24 bytes, it varies between versions. I
think this is close enough however).

Between the space taken up by indices and the other overhead,
the size difference between the sql dump and the db disk
useage is starting to seem more reasonable.

The database uses SQL-ASCII encoding, so I'd expect the
textual parts (filenames and paths) to take up the same space
(just for the basic storage, not counting overhead/indices) in
Postgres as on the dump, 1 byte/character.

That's the idea.

I'm not sure what "If you are huge you loose on portions of
unoccupied space in blocks" means.

A tuple has to be in exactly 1 page (or block). It cannot span multiple
pages. Hence if your tuples happens to be a half page size or slightly
larger, than only a single one will fit in a single page. So you waste
half the page. You don't have to worry about this, as your tuples are
significantly smaller than that.

(As a reference only: if a tuple is larger than a blocksize some
attributes will be evicted to the toast table or compressed. So that is
to overcome the limitations of 8000 bytes per tuple.)

[snip]

# select distinct relname, reltuples, relpages from pg_class where
relkind='r' and substring(relname, 1, 3) != 'pg_';

relname | reltuples | relpages
-------------------------+-----------+----------
basefiles | 0 | 0
cdimages | 0 | 0
client | 2 | 1
counters | 1 | 1
device | 0 | 0
file | 650659 | 11558
filename | 623012 | 3958
fileset | 22 | 1
job | 384 | 10
jobmedia | 596 | 7
location | 0 | 0
locationlog | 0 | 0
log | 0 | 0
media | 245 | 9
mediatype | 2 | 1
path | 67908 | 966
pool | 5 | 1
sql_features | 439 | 6
sql_implementation_info | 12 | 1
sql_languages | 4 | 1
sql_packages | 10 | 1
sql_parts | 9 | 1
sql_sizing | 23 | 1
sql_sizing_profiles | 0 | 0
status | 19 | 1
storage | 2 | 1
unsavedfiles | 0 | 0
version | 1 | 1

Do relpages * 8096 / reltuples.
Default installs have 8 KB pages/blocks by default.

For file it should be ~144 bytes/tuple.
For filename it should be ~51 bytes/tuple.

Those tuple size estimates seem consistent with the tuple
sizes reported earlier when the overhead is added in. I'm impressed!

[snip]

- Joris

#10

Dragan Zubac

zubac@vlayko.tv

about 18 years ago

In reply to: Ross Boylan (#8)

Re: database 1.2G, pg_dump 73M?!

Ross Boylan wrote:

I'm not a DB admin; I only play one on my computer. I clearly need to
figure out how to get regular vacuum, analyze, and reindexing going (if
they aren't going already).

Thanks for all your help.
Ross

1. optimize your sql queries and 'understand' index usage,don't index
everything because 'somebody told You indexes are good to speed things
up',for example

db=> explain analyze select email from users where email='abc@abc.com';
QUERY
PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..652.40 rows=1 width=42) (actual
time=3.467..3.467 rows=0 loops=1)
Filter: (email = 'abc@abc.com'::text)
Total runtime: 3.497 ms

now we add index on table 'users' column 'email':

db=> create index users_email_idx on users (email);

repeat the query:

db=> explain analyze select email from users where email='abc@abc.com';
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------
Index Scan using us_email on users (cost=0.00..8.05 rows=1 width=42)
(actual time=0.113..0.113 rows=0 loops=1)
Index Cond: (email = 'abc@abc.com'::text)
Total runtime: 0.152 ms
(3 rows)

compare only 'Total runtime' for start: 0.152ms from second query
comparing to 3.497ms from first query,guess we have some improvements ?

Always try to use 'explain analyze' commands to see if there's any
difference in any change You do in Your database

2. do 'vacuum verbose analyze table' occasionally,depending on Your
database usage,meanining if You're deleting/updating stuff like that
frequently,You will have to do more often 'vaccum'. If You import all
Your data and You basically do 'select',meaning reading data,do 'vacuum'
on New Years Eve :)

3. You want to know the size of You tables,indexes ? You want to know
about their size before and after 'vacuum' command ? Example:

db=> select * from pg_size_pretty(pg_relation_size('users'));
pg_size_pretty
----------------
5496 kB
(1 row)

then we do the ordinary 'vacuum':

sms=> vacuum users;
VACUUM
db=> select * from pg_size_pretty(pg_relation_size('users'));
pg_size_pretty
----------------
5520 kB
(1 row)

not much improvement ,ok we're going do the full 'vacuum':

db=> vacuum full users;
VACUUM
db=> select * from pg_size_pretty(pg_relation_size('users'));
pg_size_pretty
----------------
80 kB
(1 row)

How about indexes ? Example:

db=> select * from pg_size_pretty(pg_relation_size('users_pkey'));
pg_size_pretty
----------------
192 kB
(1 row)

for indexes there're not 'vacuum',but 'reindex' command,like :

db=> REINDEX INDEX users_pkey ;
REINDEX

we look at the index size after 'reindex' command:

db=> select * from pg_size_pretty(pg_relation_size('users_pkey'));
pg_size_pretty
----------------
16 kB
(1 row)

4. don't bother to 'analyze' anything :) if all those numbers and stuff
doesn't mean anything to You,guess Your life would be easier than trying
to apprehend all those 'analyzes' :) just do the regularly 'clean-up' on
Your DB,and You should be fine.

Unfortunately,PostgreSQL is not a database meant for people looking for
a 'black box' database solution,meaning plug-in-electricity-and-forget.
But,if You belong to that group of people who are willing to understand
better what their database solutions is up to and maximize the
proportion 'what do I get'/'for how much money',You're welcome by all
means :)

Sincerely

Dragan

#11

Anton Melser

melser.anton@gmail.com

about 18 years ago

In reply to: Ross Boylan (#6)

Re: database 1.2G, pg_dump 73M?!

I am still trying to figure out if the database was getting any
automatic vacuuming at all. The Postgres documentation (the database is
8.2, though I'm moving to 8.3 soon) sounds as if it's on automatically,
but the Debian-specific documentation suggests I may need to do some
additional things to enable it.

This is probably far too late but anyway. You do indeed have to enable
autovacuum with 8.2, as it isn't enabled by default, at least with
most distributions. 8.3 it's by default with most distributions.
# show autovacuum;
will tell you if it's on.
Cheers
Anton