Limitations on PGSQL

Started by Balaji Venkatesanabout 24 years ago26 messages
#1Balaji Venkatesan
balaji.venkatesan@megasoft.com

Hi,
Can someone plz to do specify the features and more important the limitations in using
PostgreSQL. More info regarding performace etc shall be of immense help
Regards
Bv :-)

#2Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Balaji Venkatesan (#1)

At 12:04 05/11/01 +0530, you wrote:

Hi,
Can someone plz to do specify the features and more important the
limitations in using
PostgreSQL. More info regarding performace etc shall be of immense help
Regards
Bv :-)

Hello Balaji,

There are no real limitations when using PostgreSQL smart programming
features: views, triggers, rules, types and plpgsql server-side language.

For example:

1) FAST READINGS: triggers can store display values instead of performing
several LEFT JOINS or calling PL/pgSQL functions. Similarly, you can use
triggers to perform complex initialization or maintain consistency when
adding/modifying a record. Cron jobs and functions can perform queries and
store results for instant results (ex: statistics tables).This makes your
database very fast in complex readings (ex: web environment). This concept
of storing values is the base of optimization.
2) SAFETY: postgreSQL is a real transactional system. When using a
combination of views and rules, you can control data modification very
neatly. Example: you can define a sub-select of a table and control the
scope of queries. This is very important in a commercial environment when
you data is valuable and must not be deleted or modified given a set of rules.
3) CODING: server-side coding is mainly performed in PL/pgSQL, a very easy
and powerful server-side language.

This is paradise if you are a programmer. IMHO, the only few drawbacks are:

1) TABLE DEFINITION: it is Impossible to delete a column or to
promote/demote a column type. You have to drop the table and import old
values into a new table. This makes life harder when working on large
databases. You are always afraid of loosing your data. Even with backups,
it is always 'heart breaking' to modify a table. You have to perform tests
to ensure all data is there and safe.

2) VIEWS/TRIGGERS cannot be modified. You have to drop them and create them
again. This makes programming a little bit tricky. Further more, if you
create a view, let's say "SELECT table1.*, table2.* FROM table1 a LEFT JOIN
table2 b on a.oid=b.oida", the resulting view displays all fields, hence
making it harder for a non programmer to read view content.

This is very little drawback compared to power and reliability of PostgreSQL.

Best regards,
Jean-Michel POURE

#3Aasmund Midttun Godal
postgresql@envisity.com
In reply to: Jean-Michel POURE (#2)
Re: Limitations on PGSQL

IMHO Postgres' drawbacks are the following:

Severely limited access/grants system - postgres gives little or no control over anything beyond controlling access to whole tables. -Yes you can create views but views have a couple of drawbacks too... This is especially a problem with regard to functions (no trusted functions).

Long connection time - if you are using the web you will have to use some sort of persistant scheme e.g. Apache::DBI otherwise you will handle around 5 requests per sec on a decent computer. I wonder whether it would be possible for it to either reconnect, keeping the connection to a new database or user, or reuse it's kids - like Apache.

No schema/tablespaces/cross-database access (- And it's listed on EXOTIC :()
- You can emulate some of these features yet it's not the same.

Error messages take a long time to get used to and generally figuring things out may take some time (at least for me)

If you create a function/trigger/view/rule etc. which accesses a table, and then you drop that table, and recreate it, you may have to recreate the function etc.

It's advantages are:

Runs on practically any platform (I run OpenBSD so it matters).

Supports triggers, rules (statement level triggers), views and stored procedures!

fast - my queries - which may be quite complex at times, are generally fast, and if they are not I can always speed them up with EXPLAIN, indexes, triggers creating derived tables and so on.

Did I say stored procedures?

License - Do ANYTHING you want with it (more or less) not as communistic as the obiquitous GPL.

Price - Depending on your internet connection generally less than $0.02...

Great community - Does not mind answering questions and seems to forgive quickly as well.

Write Ahead logging, and many other functions I haven't really exploited yet.

Regards,

Aasmund

On Mon, 05 Nov 2001 11:33:48 +0100, Jean-Michel POURE <jm.poure@freesurf.fr> wrote:

At 12:04 05/11/01 +0530, you wrote:

Hello Balaji,

There are no real limitations when using PostgreSQL smart programming
features: views, triggers, rules, types and plpgsql server-side language.

For example:

1) FAST READINGS: triggers can store display values instead of performing
several LEFT JOINS or calling PL/pgSQL functions. Similarly, you can use
triggers to perform complex initialization or maintain consistency when
adding/modifying a record. Cron jobs and functions can perform queries and
store results for instant results (ex: statistics tables).This makes your
database very fast in complex readings (ex: web environment). This concept
of storing values is the base of optimization.
2) SAFETY: postgreSQL is a real transactional system. When using a
combination of views and rules, you can control data modification very
neatly. Example: you can define a sub-select of a table and control the
scope of queries. This is very important in a commercial environment when
you data is valuable and must not be deleted or modified given a set of rules.
3) CODING: server-side coding is mainly performed in PL/pgSQL, a very easy
and powerful server-side language.

This is paradise if you are a programmer. IMHO, the only few drawbacks are:

1) TABLE DEFINITION: it is Impossible to delete a column or to
promote/demote a column type. You have to drop the table and import old
values into a new table. This makes life harder when working on large
databases. You are always afraid of loosing your data. Even with backups,
it is always 'heart breaking' to modify a table. You have to perform tests
to ensure all data is there and safe.

2) VIEWS/TRIGGERS cannot be modified. You have to drop them and create them
again. This makes programming a little bit tricky. Further more, if you
create a view, let's say "SELECT table1.*, table2.* FROM table1 a LEFT JOIN
table2 b on a.oid=b.oida", the resulting view displays all fields, hence
making it harder for a non programmer to read view content.

This is very little drawback compared to power and reliability of PostgreSQL.

Best regards,
Jean-Michel POURE

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Aasmund Midttun Godal

aasmund@godal.com - http://www.godal.com/
+47 40 45 20 46

#4Balaji Venkatesan
balaji.venkatesan@megasoft.com
In reply to: Balaji Venkatesan (#1)
Re: Limitations on PGSQL

Hi Jeff, Poure, Godal,
Sorry Jeff i was not too sure about where to post.
Shall hereafter post questions like these in pgsql-general. Sorry for the
inconvenience.
And Poure and Godal really thnx for the timely info that u ppl have given
me, hope this
shall be of great help
Cheers
Bv :-)

----- Original Message -----
From: "Jeff Davis" <list-pgsql-hackers@dynworks.com>
To: "Balaji Venkatesan" <balaji.venkatesan@megasoft.com>
Sent: Monday, November 05, 2001 4:22 PM
Subject: Re: [HACKERS] Limitations on PGSQL

You'll want to ask this type of question on pgsql-general not -hackers.

More

importantly, you should narrow your question because what you ask below
leaves too much for us to comment on.

There are great docs at postgresql.org that cover most of what you need to
know. Try reading a few entries in the docs (not every word, just see if

you

can find some general answers to your questions) and then see what

questions

you still have. Another approach is to ask whether PostgreSQL will work

for

your needs (but nobody knows what those needs are).

PostgreSQL is an awesome database. I hope it works out great for you.

Regards,
Jeff Davis

On Sunday 04 November 2001 10:34 pm, you wrote:

Hi,
Can someone plz to do specify the features and more important the
limitations in using PostgreSQL. More info regarding performace etc

shall

Show quoted text

be of immense help Regards
Bv :-)

#5Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Aasmund Midttun Godal (#3)
Re: Limitations on PGSQL

Long connection time - if you are using the web you will have to use some
sort of persistant scheme e.g. Apache::DBI otherwise you will handle
around 5 requests per sec on a decent computer. I wonder whether it would
be possible for it to either reconnect, keeping the connection to a new
database or user, or reuse it's kids - like Apache.

Php allows persistent connections. Don't you think?

http://uk.php.net/manual/en/configuration.php#ini.sect.pgsql
Postgres Configuration Directives
pgsql.allow_persistent boolean
Whether to allow persistent Postgres connections.
pgsql.max_persistent integer
The maximum number of persistent Postgres connections per process.
pgsql.max_links integer
The maximum number of Postgres connections per process, including
persistent connections.

Best regards,
Jean-Michel POURE

#6Sam Cao
scao@verio.net
In reply to: Jean-Michel POURE (#2)
Foreign Key?

First time user of Postgresql....
After created the database, how do I check what foreign keys (constraint
references in Postgresql term) were created? I looked around using
"psql" and "pgaccess", but no success,

Thanks for the help,

Sam,

#7Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Sam Cao (#6)
Re: Foreign Key?

On Mon, 21 Jan 2002, Sam Cao wrote:

First time user of Postgresql....
After created the database, how do I check what foreign keys (constraint
references in Postgresql term) were created? I looked around using
"psql" and "pgaccess", but no success,

Best thing to look at probably is the "Referential Integrity Tutorial &
Hacking the Referential Integrity Tables" tutorial at
http://techdocs.postgresql.org/

I believe that includes a view definition that gets alot of that
information out.

#8jtp
john@akadine.com
In reply to: Stephan Szabo (#7)
killed select?

I'm running postgres 7.1.2 on a freebsd machine -- Celeron 500 with 128
megs of ram (256 swap). Not the best for a real gut wrenching machine, but
what was around to get the feel of what was wanted.

A question was asked which i through to the database to see how it was
able to handle the question at hand and it failed . . . after 50 minutes
of processing it flopped to the ground killed: out of swap space.

Granted the query was a large one (explanations below) but a few
questions..

Is there a way to predict the requirements a system would need to handle a
query of specific size / complexity? (and how?)

Is there a way to pull this type of query off on this system? (is there a
solution other than throw more ram / swap at it?) (one would easily be to
handle it in chunks, but other suggestions are welcome)

What would this type of query need to execute? How about to execute well?

Table and query explanations follow...

The query was joining three tables, which i know is not quite a good idea,
but didn't see much of another way. The question was posed to find all
the subcategories all customers have ordered from a company.

The history table (history of orders) contains the id, date, cost,
and orderid and has 838500 records.

The ordered table (line items of orders) contains the orderid and a sku
and has 2670000 records

The subcategories table has the sku and subcategory and has 20000 records.

each customer can have many orders which can have many items which can
have many subcategories.

the query was posed as:
SELECT history.id, sub
FROM insub
WHERE history.orderid = ordered.orderid
AND ordered.items = insub.sku
ORDER BY ID;

Any help would be greatly appreciated.
Thanks in advance.

.jtp

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: jtp (#8)
Re: killed select?

jtp <john@akadine.com> writes:

A question was asked which i through to the database to see how it was
able to handle the question at hand and it failed . . . after 50 minutes
of processing it flopped to the ground killed: out of swap space.

My guess is that what actually bombed out was psql, which tries to
buffer the entire result of a query. (Well, actually it's libpq not
psql that does that, but anyway the client side is what's failing.)

I suspect that your query is insufficiently constrained and will return
many millions of rows --- are you sure you have the WHERE clauses right?

If you actually do need to process a query that returns gazillions of
rows, the best bet is to declare a cursor so you can fetch the result
in bite-size chunks, say a few hundred rows at a time.

regards, tom lane

#10jtp
john@akadine.com
In reply to: jtp (#8)
general design question

Hi, just a general design question and wondering how postgres would handle
either situation.

I have a gobb of information (400000+ records) on individual accounts. I
need to store all of their personal information (name, adress, etc) as
well as all of their more dynamic company information (last purchase,
times ordered, etc).

One: All their dynamic information can be rebuilt from other tables,
but it will be called upon rather frequently, so the redundency so as to
not have to rebuild on every call seems acceptable by me. (smack me if i'm
wrong)

Two: There is only a one to one ration between an account (personal
information) and that account's account information (makes sense,
eh?). But does it make sense to keep this information in the same table or
to break it up? I estimate about 20 fields in two separate tables or 40
in one big one. The personal information will almost always be index
searched by name or zipcode. Whereas the other information they (they
proverbial they) will probably want sorted in weirdass ways that the
design was never intended for. Basically, it will be be subjected to more
sequential scans than something with close to a half million records
should be. My basic question ends up being: does postgres handle
sequntial scans across tables with fewer fields better? Is there any
performance increase by separating this into two tables?

Thanks for any hints you could give me.
.jtp

#11Curt Sampson
cjs@cynic.net
In reply to: jtp (#10)
Re: general design question

On Fri, 19 Apr 2002, jtp wrote:

One: All their dynamic information can be rebuilt from other tables,
but it will be called upon rather frequently, so the redundency so as to
not have to rebuild on every call seems acceptable by me. (smack me if i'm
wrong)

It's quite reasonable to keep a summary table of information for
fast reference. The only difficulty you have to deal with is how
you keep it up to date. (Update every time the summarized data
change? Update once an hour? Once a day? That kind of thing. It
depends on your application.)

My basic question ends up being: does postgres handle
sequntial scans across tables with fewer fields better?

Definitely. Given the same number of rows, a narrower table (fewer
columns, shorter data types, that kind of thing) will always be
scanned faster than a wider one simply because you need to read
less data from the disk. This is database-independent, in fact.

Since vacuuming also effectively involves a sequential scan, you'll
also vacuum faster on a narrower table. So it makes sense to separate
frequently updated data from less frequently updated data, and
vacuum the frequently updated table more often, I would think.

However, for tables that are already narrow, you may get little
performance gain, or in some cases performance may even get worse,
not to mention your data size blowing up bigger. Postgres has a
quite high per-tuple overhead (31 bytes or more) so splitting small
tables can actually cause growth and make things slower, if you
frequently access both tables.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curt Sampson (#11)
Re: general design question

Curt Sampson <cjs@cynic.net> writes:

However, for tables that are already narrow, you may get little
performance gain, or in some cases performance may even get worse,
not to mention your data size blowing up bigger. Postgres has a
quite high per-tuple overhead (31 bytes or more) so splitting small
tables can actually cause growth and make things slower, if you
frequently access both tables.

Right. The *minimum* row overhead in Postgres is 36 bytes (32-byte
tuple header plus 4-byte line pointer). More, the actual data space
will be rounded up to the next MAXALIGN boundary, either 4 or 8 bytes
depending on your platform. On an 8-byte-MAXALIGN platform like mine,
a table containing a single int4 column will actually occupy 44 bytes
per row. Ouch. So database designs involving lots of narrow tables
are not to be preferred over designs with a few wide tables.

AFAIK, all databases have nontrivial per-row overheads; PG might be
a bit worse than average, but this is a significant issue no matter
which DB you use.

regards, tom lane

#13Curt Sampson
cjs@cynic.net
In reply to: Tom Lane (#12)
Re: general design question

On Fri, 19 Apr 2002, Tom Lane wrote:

Right. The *minimum* row overhead in Postgres is 36 bytes (32-byte
tuple header plus 4-byte line pointer).

Ah, right! The line pointer is four bytes because it includes the length
of the tuple.

But I'm not sure why we need this length, possibly because I don't
understand the function of the LP_USED and LP_DELETED flags in the line
pointer. (I'm guessing that if LP_USED is not set, the line pointer does
not point to any data, and that if LP_DELETED is set, it points to a
chunk of free space.)

Why could we not just make all unallocated space be pointed to by
LP_DELETED pointers, and then when we need space, use it from those
(splitting and joining as necessary)? That gets rid of the need for
a length. Then we could declare that all tuples must be aligned on a
four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
address, and the bottom two bits for the LP_USED and LP_DELETED flag.
This would slightly simplify the code for determining the flags, and
incidently boost the maximum page size to 64K.

If you're willing to use a mask and shift to determine the address,
rather than just a mask, you could make the maximum page size 128K,
use the top 15 bits of the line pointer as the address, and use the
remaining bit as the LP_USED flag, since I don't see why we would then
need the LP_DELETED flag at all.

Or am I smoking crack here?

AFAIK, all databases have nontrivial per-row overheads; PG might be
a bit worse than average, but this is a significant issue no matter
which DB you use.

For certain types of tables, such the sort of table joining two
others for which I forget the proper term:

CREATE TABLE folder_contents (
folder_id int NOT NULL,
item_id int NOT NULL,
PRIMARY KEY (folder_id, item_id))

some databases are much better. In MS SQL server, for example, since
there are no variable length columns, the tuple format will be:

1 byte status bits A
1 byte status bits B
2 bytes fixed-length columns data length
4 bytes DATA: folder_id
4 bytes DATA: item_id
2 bytes number of columns
1 byte null bitmap (unfortunately doesn't go away in SQL
server even when there are no nullable columns)

(If there were variable length columns, you would have after this:
two bytes for the number of columns, 2 bytes per column for the
data offsets within the tuple, and then the variable data.)

So in Postgres this would take, what, 44 bytes per tuple? But in
SQL Server this takes 17 bytes per tuple (including the two byte
line pointer in what they call the page's "row offset array), or
about 40% of the space.

Needless to say, in my last job, where I was dealing with a table
like this with 85 million rows, I was happy for this to be a 1.3
GB table instead of a 3.5 GB table. Not that this made much
performance difference in that application anyway, since, with a
clustered index and typical folder sizes at a couple of dozen to
a hundred or so items, I was basically never going to read more
than one or two pages from disk to find the contents of a folder.

Hm. I guess this really should be on hackers, shouldn't it?

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curt Sampson (#13)
Re: general design question

Curt Sampson <cjs@cynic.net> writes:

... Then we could declare that all tuples must be aligned on a
four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
address, and the bottom two bits for the LP_USED and LP_DELETED flag.
This would slightly simplify the code for determining the flags, and
incidently boost the maximum page size to 64K.

Hmm. Maybe, but the net effect would only be to reduce the minimum row
overhead from 36 to 34 bytes. Not sure it's worth worrying about.
Eliminating redundancy from the item headers has its downside, too,
in terms of ability to detect problems.

... I don't see why we would then
need the LP_DELETED flag at all.

I believe we do want to distinguish three states: live tuple, dead
tuple, and empty space. Otherwise there will be cases where you're
forced to move data immediately to collapse empty space, when there's
not a good reason to except that your representation can't cope.

Hm. I guess this really should be on hackers, shouldn't it?

Yup...

regards, tom lane

#15Martijn van Oosterhout
kleptog@svana.org
In reply to: Curt Sampson (#13)
Re: general design question

On Sat, Apr 20, 2002 at 01:55:38PM +0900, Curt Sampson wrote:

AFAIK, all databases have nontrivial per-row overheads; PG might be
a bit worse than average, but this is a significant issue no matter
which DB you use.

For certain types of tables, such the sort of table joining two
others for which I forget the proper term:

CREATE TABLE folder_contents (
folder_id int NOT NULL,
item_id int NOT NULL,
PRIMARY KEY (folder_id, item_id))

some databases are much better. In MS SQL server, for example, since
there are no variable length columns, the tuple format will be:

1 byte status bits A
1 byte status bits B
2 bytes fixed-length columns data length
4 bytes DATA: folder_id
4 bytes DATA: item_id
2 bytes number of columns
1 byte null bitmap (unfortunately doesn't go away in SQL
server even when there are no nullable columns)

Where is the information needed to determine visibility for transactions? In
Postgres that's at least 16 bytes (cmin,cmax,xmin,xmax). How does SQL server
do that?

(If there were variable length columns, you would have after this:
two bytes for the number of columns, 2 bytes per column for the
data offsets within the tuple, and then the variable data.)

In postgres, variable length columns don't cost anything if you don't use
them. An int is always 4 bytes, even if there are variable length columns
elsewhere. The only other overhead is 4 bytes for the OID and 6 bytes for
the CTID, which I guess may be unnecessary.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Canada, Mexico, and Australia form the Axis of Nations That
Are Actually Quite Nice But Secretly Have Nasty Thoughts About America

#16Curt Sampson
cjs@cynic.net
In reply to: Tom Lane (#14)
On-disk Tuple Size

[I've moved this discussion about changing the line pointer from four
bytes to two from -general to -hackers, since it's fairly technical.
The entire message Tom is responding to is appended to this one.]

On Sat, 20 Apr 2002, Tom Lane wrote:

Curt Sampson <cjs@cynic.net> writes:

... Then we could declare that all tuples must be aligned on a
four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
address, and the bottom two bits for the LP_USED and LP_DELETED flag.
This would slightly simplify the code for determining the flags, and
incidently boost the maximum page size to 64K.

Hmm. Maybe, but the net effect would only be to reduce the minimum row
overhead from 36 to 34 bytes. Not sure it's worth worrying about.

Well, unless the implementation is hideously complex, I'd say that
every byte is worth worrying about, given the amount of overhead that's
currently there. 36 to 34 bytes could give something approaching a 5%
performance increase for tables with short rows. (Actually, do we prefer
the tables/rows or relations/tuples terminology here? I guess I kinda
tend to use the latter for physical stuff.)

If we could drop the OID from the tuple when it's not being used,
that would be another four bytes, bringing the performance increase
up towards 15% on tables with short rows.

Of course I understand that all this is contingent not only on such
changes being acceptable, but someone actually caring enough to
write them.

While we're at it, would someone have the time to explain to me
how the on-disk CommandIds are used? A quick look at the code
indicates that this is used for cursor consistency, among other
things, but it's still a bit mysterious to me.

... I don't see why we would then
need the LP_DELETED flag at all.

I believe we do want to distinguish three states: live tuple, dead
tuple, and empty space. Otherwise there will be cases where you're
forced to move data immediately to collapse empty space, when there's
not a good reason to except that your representation can't cope.

I don't understand this. Why do you need to collapse empty space
immediately? Why not just wait until you can't find an empty fragment
in the page that's big enough, and then do the collapse?

Oh, on a final unrelated note, <john@akadine.com>, you're bouncing
mail from my host for reasons not well explained ("550 Access
denied.") I tried postmaster at your site, but that bounces mail
too. If you want to work out the problem, drop me e-mail from some
address at which you can be responded to.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

------- Previous Message --------

From cjs@cynic.net Sat Apr 20 16:56:29 2002

Date: Sat, 20 Apr 2002 13:55:38 +0900 (JST)
From: Curt Sampson <cjs@cynic.net>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: jtp <john@akadine.com>, pgsql-general@postgresql.org
Subject: Re: [GENERAL] general design question

On Fri, 19 Apr 2002, Tom Lane wrote:

Right. The *minimum* row overhead in Postgres is 36 bytes (32-byte
tuple header plus 4-byte line pointer).

Ah, right! The line pointer is four bytes because it includes the length
of the tuple.

But I'm not sure why we need this length, possibly because I don't
understand the function of the LP_USED and LP_DELETED flags in the line
pointer. (I'm guessing that if LP_USED is not set, the line pointer does
not point to any data, and that if LP_DELETED is set, it points to a
chunk of free space.)

Why could we not just make all unallocated space be pointed to by
LP_DELETED pointers, and then when we need space, use it from those
(splitting and joining as necessary)? That gets rid of the need for
a length. Then we could declare that all tuples must be aligned on a
four-byte boundary, use the top 14 bits of a 16-bit line pointer as the
address, and the bottom two bits for the LP_USED and LP_DELETED flag.
This would slightly simplify the code for determining the flags, and
incidently boost the maximum page size to 64K.

If you're willing to use a mask and shift to determine the address,
rather than just a mask, you could make the maximum page size 128K,
use the top 15 bits of the line pointer as the address, and use the
remaining bit as the LP_USED flag, since I don't see why we would then
need the LP_DELETED flag at all.

Or am I smoking crack here?

AFAIK, all databases have nontrivial per-row overheads; PG might be
a bit worse than average, but this is a significant issue no matter
which DB you use.

For certain types of tables, such the sort of table joining two
others for which I forget the proper term:

CREATE TABLE folder_contents (
folder_id int NOT NULL,
item_id int NOT NULL,
PRIMARY KEY (folder_id, item_id))

some databases are much better. In MS SQL server, for example, since
there are no variable length columns, the tuple format will be:

1 byte status bits A
1 byte status bits B
2 bytes fixed-length columns data length
4 bytes DATA: folder_id
4 bytes DATA: item_id
2 bytes number of columns
1 byte null bitmap (unfortunately doesn't go away in SQL
server even when there are no nullable columns)

(If there were variable length columns, you would have after this:
two bytes for the number of columns, 2 bytes per column for the
data offsets within the tuple, and then the variable data.)

So in Postgres this would take, what, 44 bytes per tuple? But in
SQL Server this takes 17 bytes per tuple (including the two byte
line pointer in what they call the page's "row offset array), or
about 40% of the space.

Needless to say, in my last job, where I was dealing with a table
like this with 85 million rows, I was happy for this to be a 1.3
GB table instead of a 3.5 GB table. Not that this made much
performance difference in that application anyway, since, with a
clustered index and typical folder sizes at a couple of dozen to
a hundred or so items, I was basically never going to read more
than one or two pages from disk to find the contents of a folder.

Hm. I guess this really should be on hackers, shouldn't it?

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#17Curt Sampson
cjs@cynic.net
In reply to: Martijn van Oosterhout (#15)
Re: On-Disk Tuple Size

[Moved from general to -hackers.]

On Sat, 20 Apr 2002, Martijn van Oosterhout wrote:

In MS SQL server, for example....

Where is the information needed to determine visibility for transactions? In
Postgres that's at least 16 bytes (cmin,cmax,xmin,xmax). How does SQL server
do that?

SQL Server doesn't use MVCC; it uses locking. (This is not necessarially
less advanced, IMHO; it has the nice properties of saving a bunch of
space and ensuring that, when transaction isolation is serializable,
commits won't fail due to someone else doing updates. But it has costs,
too, as we all know.)

(If there were variable length columns, you would have after this:
two bytes for the number of columns, 2 bytes per column for the
data offsets within the tuple, and then the variable data.)

In postgres, variable length columns don't cost anything if you don't use
them.

Right; just as in SQL server. This was just sort of a side note
for those who are curious.

An int is always 4 bytes, even if there are variable length columns
elsewhere. The only other overhead is 4 bytes for the OID....

Which would be good to get rid of, if we can.

...and 6 bytes for the CTID, which I guess may be unnecessary.

Really? How would things work without it?

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#18Martijn van Oosterhout
kleptog@svana.org
In reply to: Curt Sampson (#17)
Re: On-Disk Tuple Size

On Sat, Apr 20, 2002 at 05:22:20PM +0900, Curt Sampson wrote:

...and 6 bytes for the CTID, which I guess may be unnecessary.

Really? How would things work without it?

Well, from my examination of the on-disk data the CTID stored there is the
same as its location in the file, so it could just be filled in while
reading.

Unless I'm misunderstanding its purpose.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Canada, Mexico, and Australia form the Axis of Nations That
Are Actually Quite Nice But Secretly Have Nasty Thoughts About America

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#18)
Re: On-Disk Tuple Size

Martijn van Oosterhout <kleptog@svana.org> writes:

Well, from my examination of the on-disk data the CTID stored there is the
same as its location in the file, so it could just be filled in while
reading.

Nope. CTID is used as a forward link from an updated tuple to its newer
version.

regards, tom lane

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curt Sampson (#16)
Re: On-disk Tuple Size

Curt Sampson <cjs@cynic.net> writes:

While we're at it, would someone have the time to explain to me
how the on-disk CommandIds are used?

To determine visibility of tuples for commands within a transaction.
Just as you don't want your transaction's effects to become visible
until you commit, you don't want an individual command's effects to
become visible until you do CommandCounterIncrement. Among other
things this solves the Halloween problem for us (how do you stop
an UPDATE from trying to re-update the tuples it's already emitted,
should it chance to hit them during its table scan).

The command IDs aren't interesting anymore once the originating
transaction is over, but I don't see a realistic way to recycle
the space ...

I believe we do want to distinguish three states: live tuple, dead
tuple, and empty space. Otherwise there will be cases where you're
forced to move data immediately to collapse empty space, when there's
not a good reason to except that your representation can't cope.

I don't understand this.

I thought more about this in the shower this morning, and realized the
fundamental drawback of the scheme you are suggesting: it requires the
line pointers and physical storage to be in the same order. (Or you
could make it work in reverse order, by looking at the prior pointer
instead of the next one to determine item size; that would actually
work a little better. But in any case line pointer order and physical
storage order are tied together.)

This is clearly a loser for index pages: most inserts would require
a data shuffle. But it is also a loser for heap pages, and the reason
is that on heap pages we cannot change a tuple's index (line pointer
number) once it's been created. If we did, it'd invalidate CTID
forward links, index entries, and heapscan cursor positions for open
scans. Indeed, pretty much the whole point of having the line pointers
is to provide a stable ID for a tuple --- if we didn't need that we
could just walk through the physical storage.

When VACUUM removes a dead tuple, it compacts out the physical space
and marks the line pointer as unused. (Of course, it makes sure all
references to the tuple are gone first.) The next time we want to
insert a tuple on that page, we can recycle the unused line pointer
instead of allocating a new one from the end of the line pointer array.
However, the physical space for the new tuple should come from the
main free-space pool in the middle of the page. To implement the
pointers-without-sizes representation, we'd be forced to shuffle data
to make room for the tuple between the two adjacent-by-line-number tuples.

The three states of a line pointer that I referred to are live
(pointing at a good tuple), dead (pointing at storage that used
to contain a good tuple, doesn't anymore, but hasn't been compacted
out yet), and empty (doesn't point at storage at all; the space it
used to describe has been merged into the middle-of-the-page free
pool). ISTM a pointers-only representation can handle the live and
dead cases nicely, but the empty case is going to be a real headache.

In short, a pointers-only representation would give us a lot less
flexibility in free space management. It's an interesting idea but
I doubt that saving two bytes per row is worth the extra overhead.

regards, tom lane

#21Curt Sampson
cjs@cynic.net
In reply to: Tom Lane (#20)
Re: On-disk Tuple Size

On Sat, 20 Apr 2002, Tom Lane wrote:

Curt Sampson <cjs@cynic.net> writes:

While we're at it, would someone have the time to explain to me
how the on-disk CommandIds are used?

To determine visibility of tuples for commands within a transaction.
Just as you don't want your transaction's effects to become visible
until you commit, you don't want an individual command's effects to
become visible until you do CommandCounterIncrement. Among other
things this solves the Halloween problem for us (how do you stop
an UPDATE from trying to re-update the tuples it's already emitted,
should it chance to hit them during its table scan).

The command IDs aren't interesting anymore once the originating
transaction is over, but I don't see a realistic way to recycle
the space ...

Ah, I see. So basically, it's exactly parallel to the transaction IDs
except it's for commands instead of transactions?

So this seems to imply to me that the insert command ID fields are of
interest only to the transaction that did the insert. In other words, if
your transaction ID is not the one listed in t_xmin, the t_cmin field is
always ignored. And the same goes for t_cmax and t_xmax, right?

If this is the case, would it be possible to number the commands
per-transaction, rather than globally? Then the t_cmin for a particular
tuple might be say, 7, but though there might be many transactions that
have processed or will process command number 7, we would know which
transaction this belongs to by the t_xmin field.

Does this work for cursors, which currently seem to rely on a global
command ID? If you keep track of the transaction ID as well, I think so,
right?

Having per-transaction command IDs might allow us to reduce the range of
the t_cmin and t_cmax fields. Unfortunately, probably by not all that
much, since one doesn't want to limit the number of commands within a
single transaction to something as silly as 65536.

But perhaps we don't need to increment the command ID for every command.
If I do an insert, but I know that the previous command was also an
insert, I know that there were no intervening reads in this transaction,
so can I use the previous command's ID? Could it be that we need to
increment the command ID only when we switch from writing to reading
or vice versa? There could still be transactions that would run into
problems, of course, but these might all be rather pathological cases.

Or is everybody wishing they had some of whatever I'm smoking? :-)

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#22Curt Sampson
cjs@cynic.net
In reply to: Tom Lane (#20)
Re: On-disk Tuple Size

On Sat, 20 Apr 2002, Tom Lane wrote:

I believe we do want to distinguish three states: live tuple, dead
tuple, and empty space. Otherwise there will be cases where you're
forced to move data immediately to collapse empty space, when there's
not a good reason to except that your representation can't cope.

I don't understand this.

I thought more about this in the shower this morning, and realized the
fundamental drawback of the scheme you are suggesting: it requires the
line pointers and physical storage to be in the same order.
...But in any case line pointer order and physical storage order are
tied together.)

I thought that for a while too. As you point out, you want an list of
line pointers ordered by address in the block to build your map of space
after the free space in the middle of the page because you get the end
of the current block from the start address of the next block. (We know
where the free space in the middle is from the end of the line pointer
array.)

However, there's no reason it has to be stored in this order on the
disk. You can build a sorted list of line pointers in a separate area of
memory after you read the page.

Yes, this uses a bit more CPU, but I think it's going to be a pretty
trivial amount. It's a short list, and since you're touching the data
anyway, it's going to be in the CPU cache. The real cost you'll pay is
in the time to access the area of memory where you're storing the sorted
list of line pointers. But the potential saving here is up to 5% in I/O
costs (due to using less disk space).

The three states of a line pointer that I referred to are live
(pointing at a good tuple), dead (pointing at storage that used
to contain a good tuple, doesn't anymore, but hasn't been compacted
out yet), and empty (doesn't point at storage at all; the space it
used to describe has been merged into the middle-of-the-page free
pool).

Right. I now realize that we still do still need the three states,
which are in my case:

live: points to tuple data in use

free space: points to unused space in the page, i.e., a dead tuple.

unused: a line pointer that doesn't point to anything at all.

ISTM a pointers-only representation can handle the live and
dead cases nicely, but the empty case is going to be a real headache.

This doesn't need a separate flag, since we can just have the line
pointer point to something obviously invalid, such as the page
header. (0 seems quite convenient for this.)

In the header, we need a count of the number of line pointers
(line_id_count above), but we can drop the beginning/end of free
space pointers, since we know that data space starts after the last
line pointer, and ends at the beginnning of special space.

So here's an example of a page layout. Sizes are arbitrary ones I
picked for the sake of the example, except for the line_id sizes.

Address Size Item

0 24 page header (line_id_count = 6)

24 2 line_id: 7751 (free space 1)
26 2 line_id: 7800 (tuple 1)
28 2 line_id: 0 (unused)
30 2 line_id: 7600 (tuple 2)
32 2 line_id: 8000 (tuple 3)
34 2 line_id: 7941 (free space 2)

36 7564 free space in the middle of the page

7600 150 tuple 2
7750 50 free space 1
7800 100 tuple 1
7940 60 free space 2
8000 96 tuple 3
8096 96 special space

Note above that the free space pointers have the LSB set to indicate
that they point to free space, not tuples. So the first line_id
actually points to 7750.

When I do an insert, the first thing I do is scan for a free line
pointer. Finding a free one at 28, I decide to re-use that. Then
I look for the smallest block of free space that will hold the data
that I need to insert. If it fits, exactly, I use it. If not, I
need to extend the line pointer array by one and make that point
to the remaining free space in the block of free space I used.

If a big enough block of free space doesn't exist, I compact the
page and try again.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

#23Rod Taylor
rbt@zort.ca
In reply to: Curt Sampson (#21)
Re: On-disk Tuple Size

Having per-transaction command IDs might allow us to reduce the

range of

the t_cmin and t_cmax fields. Unfortunately, probably by not all

that

much, since one doesn't want to limit the number of commands within

a

single transaction to something as silly as 65536.

If you can figure out how to make that roll over sure, but thats a
very small number.

Consider users who do most of their stuff via functions (one
transaction). Now consider the function that builds reports, stats,
etc. for some department. It's likley these work on a per account
basis.

We have a function making invoices that would wrap around that atleast
10x.

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curt Sampson (#21)
Re: On-disk Tuple Size

Curt Sampson <cjs@cynic.net> writes:

If this is the case, would it be possible to number the commands
per-transaction, rather than globally?

They are.

regards, tom lane

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Curt Sampson (#22)
Re: On-disk Tuple Size

Curt Sampson <cjs@cynic.net> writes:

Yes, this uses a bit more CPU, but I think it's going to be a pretty
trivial amount. It's a short list, and since you're touching the data
anyway, it's going to be in the CPU cache. The real cost you'll pay is
in the time to access the area of memory where you're storing the sorted
list of line pointers. But the potential saving here is up to 5% in I/O
costs (due to using less disk space).

At this point you're essentially arguing that it's faster to recompute
the list of item sizes than it is to read it off disk. Given that the
recomputation would require sorting the list of item locations (with
up to a couple hundred entries --- more than that if blocksize > 8K)
I'm not convinced of that.

Another difficulty is that we'd lose the ability to record item sizes
to the exact byte. What we'd reconstruct from the item locations are
sizes rounded up to the next MAXALIGN boundary. I am not sure that
this is a problem, but I'm not sure it's not either.

The part of this idea that I actually like is overlapping the status
bits with the low order part of the item location, using the assumption
that MAXALIGN is at least 4. That would allow us to support BLCKSZ up
to 64K, and probably save a cycle or two in fetching/storing the item
fields as well. The larger BLCKSZ limit isn't nearly as desirable
as it used to be, because of TOAST, and in fact it could be a net loser
because of increased WAL traffic. But it'd be interesting to try it
and see.

regards, tom lane

#26Curt Sampson
cjs@cynic.net
In reply to: Tom Lane (#25)
Re: On-disk Tuple Size

On Sun, 21 Apr 2002, Tom Lane wrote:

At this point you're essentially arguing that it's faster to recompute
the list of item sizes than it is to read it off disk. Given that the
recomputation would require sorting the list of item locations (with
up to a couple hundred entries --- more than that if blocksize > 8K)
I'm not convinced of that.

No, not at all. What I'm arguing is that the I/O savings gained from
removing two bytes from the tuple overhead will more than compensate for
having to do a little bit more computation after reading the block.

How do I know? Well, I have very solid figures. I know because I pulled
them straight out of my....anyway. :-) Yeah, it's more or less instinct
that says to me that this would be a win. If others don't agree, there's
a pretty reasonable chance that I'm wrong here. But I think it might
be worthwile spending a bit of effort to see what we can do to reduce
our tuple overhead. After all, there is a good commerical DB that has
much, much lower overhead, even if it's not really comparable because it
doesn't use MVCC. The best thing really would be to see what other good
MVCC databases do. I'm going to go to the bookshop in the next few days
and try to find out what Oracle's physical layout is.

Another difficulty is that we'd lose the ability to record item sizes
to the exact byte. What we'd reconstruct from the item locations are
sizes rounded up to the next MAXALIGN boundary. I am not sure that
this is a problem, but I'm not sure it's not either.

Well, I don't see any real problem with it, but yeah, I might well be
missing something here.

The larger BLCKSZ limit isn't nearly as desirable as it used to be,
because of TOAST, and in fact it could be a net loser because of
increased WAL traffic. But it'd be interesting to try it and see.

Mmmm, I hadn't thought about the WAL side of things. In an ideal world,
it wouldn't be a problem because WAL writes would be related only to
tuple size, and would have nothing to do with block size. Or so it seems
to me. But I have to go read the WAL code a bit before I care to make
any real assertions there.

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC