Win2K Questions

Started by Unknown Userover 23 years ago21 messagesgeneral
Jump to latest
#1Unknown User
unknown@unknown.user

A company I am currently doing work for has MS SQL Server 2000 running on a
dedicated host. In the near future hosting of the website and other services
is going to be brought internal. With that they face a very large cost for
purchasing SQL Server and the various licenses. I heard a few people mention
PostgreSQL as a possible solution, and at the outset it seems to support
many of the things that would be required, however, I do have a couple
questions.

How does the performance of PostgreSQL stack up to SQL Server 2000 or
Oracle? Is it even aimed at being an enterprise level database?

What is VACUUM? I have seen this mentioned in various places, saying that it
needs to be run at times. What is it exactly, why does it need to be run,
and is there any way to automate it's running?

Lastly, where can I download an executable for Win2K so I can go ahead and
take it for a spin? I already found the pgAdmin app for administration
purposes, so I just need the database engine itself. Also, any tips for
installation on 2K would be most appreciated.

Thanks,
Steve

#2Richard Huxton
dev@archonet.com
In reply to: Unknown User (#1)
Re: Win2K Questions

Please don't tinker with your email address if you're posting to lists. Grab a
hotmail account or something.

On Thursday 07 Nov 2002 4:24 pm, SV wrote:

How does the performance of PostgreSQL stack up to SQL Server 2000 or
Oracle? Is it even aimed at being an enterprise level database?

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example. There are plenty of people who have replaced MSSQL or Oracle
with PG so for many people they are comparable. Don't forget some of the
money you save on licencing can go on better hardware.

In terms of features, replication needs work and we're still waiting on nested
transactions. An "official" replication system is due in 7.4 I think. Other
than that the 7.x versions are very mature, the imminent 7.3 has added
schemas and functions that can return a set of results (among other goodies)

What is VACUUM? I have seen this mentioned in various places, saying that
it needs to be run at times. What is it exactly, why does it need to be
run, and is there any way to automate it's running?

There are two types of vacuuming - one recovers used space, since PG uses MVCC
an update is equivalent to a delete and insert and deletions are just marked
as such, without the files being compacted.
The second is analysing the tables to keep the stats up to date. This helps PG
determine when it should use an index vs a scan.

Lastly, where can I download an executable for Win2K so I can go ahead and
take it for a spin? I already found the pgAdmin app for administration
purposes, so I just need the database engine itself. Also, any tips for
installation on 2K would be most appreciated.

Look for something called cygwin. I think you can get a binary download of it
and postgresql. It's probably easier to run it on Linux/*BSD though - it's
been running on unix-like systems for years.

--
Richard Huxton

#3Charles H. Woloszynski
chw@clearmetrix.com
In reply to: Unknown User (#1)
Re: Win2K Questions

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from something obvious to a more obscure format just to work around an optimizer issue.

Not sure if there is an equivalent query to make count() work faster

Charlie

--

Charles H. Woloszynski

ClearMetrix, Inc.
115 Research Drive
Bethlehem, PA 18015

tel: 610-419-2210 x400
fax: 240-371-3256
web: www.clearmetrix.com

#4Bruce Momjian
bruce@momjian.us
In reply to: Richard Huxton (#2)
Re: Win2K Questions

Richard Huxton wrote:

In terms of features, replication needs work and we're still
waiting on nested transactions. An "official" replication system
is due in 7.4 I think. Other than that the 7.x versions are very
mature, the imminent 7.3 has added schemas and functions that
can return a set of results (among other goodies)

I may do nested transactions for 7.4.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5Bruce Momjian
bruce@momjian.us
In reply to: Charles H. Woloszynski (#3)
Re: Win2K Questions

Charles H. Woloszynski wrote:

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise
syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from
something obvious to a more obscure format just to work around
an optimizer issue.

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#6Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Bruce Momjian (#5)
Re: Win2K Questions

Here is a suggestion.

When a count(*) is computed (for all records) store that value and
unvalidate it if there is a later insert or delete on the table. Next
improvement would be to maintain a count per active transaction.

Bruce Momjian wrote:

Show quoted text

Charles H. Woloszynski wrote:

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise
syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from
something obvious to a more obscure format just to work around
an optimizer issue.

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#7scott.marlowe
scott.marlowe@ihs.com
In reply to: Jean-Luc Lachance (#6)
Re: Win2K Questions

but how do you handle the case where two people have two different
connections, and one starts a serializable transaction and adds n rows to
the table. For that transaction, there are x+n rows in the table, while
for the transaction started before his, there are only x rows. which is
the "right" answer?

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Show quoted text

Here is a suggestion.

When a count(*) is computed (for all records) store that value and
unvalidate it if there is a later insert or delete on the table. Next
improvement would be to maintain a count per active transaction.

Bruce Momjian wrote:

Charles H. Woloszynski wrote:

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise
syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from
something obvious to a more obscure format just to work around
an optimizer issue.

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#8Jean-Luc Lachance
jllachan@nsd.ca
In reply to: scott.marlowe (#7)
Re: Win2K Questions

Scott,

You answered the question yourself. The operative keyword her is
*before* the transaction started.
You store the global count before the transaction. While in a
transaction, you save the number of inserted and deleted records. When
*all* parallel transactions are commited, you update the global count
with the total of of updated and deleted records. If a connection start
a new transaction before the other transactions have been
commited you take the global count plus the adjustment from the previous
transaction.

JLL

"scott.marlowe" wrote:

Show quoted text

but how do you handle the case where two people have two different
connections, and one starts a serializable transaction and adds n rows to
the table. For that transaction, there are x+n rows in the table, while
for the transaction started before his, there are only x rows. which is
the "right" answer?

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Here is a suggestion.

When a count(*) is computed (for all records) store that value and
unvalidate it if there is a later insert or delete on the table. Next
improvement would be to maintain a count per active transaction.

Bruce Momjian wrote:

Charles H. Woloszynski wrote:

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise
syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from
something obvious to a more obscure format just to work around
an optimizer issue.

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#9scott.marlowe
scott.marlowe@ihs.com
In reply to: Jean-Luc Lachance (#8)
Re: Win2K Questions

Only helps the case where you're getting a total count though, and
requires that there be a "count" variable for each table for each
transaction in progress, since each can have a different count. But it
doesn't help at all for

select count(*) from table where id >10000;

which is also pretty common. I think the real problem is that this is one
of those things that is quite hard to optimize in an MVCC database.

This solution may be best implemented in userland, by having a seperate
table that stores the counts of the tables you're interested in, and uses
the MVCC system to provide different counts to each transaction.

But the performance of updating that secondary table may be worse than
just running a count(*).

I doubt the black (gray??? :-) magic needed to do this will be put into
the backend of postgresql any time soon. But the userland solution is
something that could be quite useful.

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Show quoted text

Scott,

You answered the question yourself. The operative keyword her is
*before* the transaction started.
You store the global count before the transaction. While in a
transaction, you save the number of inserted and deleted records. When
*all* parallel transactions are commited, you update the global count
with the total of of updated and deleted records. If a connection start
a new transaction before the other transactions have been
commited you take the global count plus the adjustment from the previous
transaction.

JLL

"scott.marlowe" wrote:

but how do you handle the case where two people have two different
connections, and one starts a serializable transaction and adds n rows to
the table. For that transaction, there are x+n rows in the table, while
for the transaction started before his, there are only x rows. which is
the "right" answer?

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Here is a suggestion.

When a count(*) is computed (for all records) store that value and
unvalidate it if there is a later insert or delete on the table. Next
improvement would be to maintain a count per active transaction.

Bruce Momjian wrote:

Charles H. Woloszynski wrote:

Richard Huxton wrote:

Depends on usage patterns and how you build your application. There are a
couple of oddities with workarounds: count() and max() aren't very optimised
for example.

You can 'fix' the max() SNAFU with a new query of the form
"select field from tbl limit 1 order by field desc" (not precise
syntax, but the idea is correct)

I call it a SNAFU since it I hate to have to change queries from
something obvious to a more obscure format just to work around
an optimizer issue.

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#10Jean-Luc Lachance
jllachan@nsd.ca
In reply to: scott.marlowe (#9)
Re: Win2K Questions

Scott,

unless id is indexed there is nothing that can be done with
select count(*) from table where id >10000;
Otherwise, the index should be scanned, not the table.

And, scanning a large table to get count(*) will always be worst than
maintaining your own count.

JLL

"scott.marlowe" wrote:

Show quoted text

[...]

select count(*) from table where id >10000;

[...]

But the performance of updating that secondary table may be worse than
just running a count(*).

I doubt the black (gray??? :-) magic needed to do this will be put into
the backend of postgresql any time soon. But the userland solution is
something that could be quite useful.

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Scott,

You answered the question yourself. The operative keyword her is
*before* the transaction started.
You store the global count before the transaction. While in a
transaction, you save the number of inserted and deleted records. When
*all* parallel transactions are commited, you update the global count
with the total of of updated and deleted records. If a connection start
a new transaction before the other transactions have been
commited you take the global count plus the adjustment from the previous
transaction.

JLL

"scott.marlowe" wrote:

but how do you handle the case where two people have two different
connections, and one starts a serializable transaction and adds n rows to
the table. For that transaction, there are x+n rows in the table, while
for the transaction started before his, there are only x rows. which is
the "right" answer?

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Here is a suggestion.

When a count(*) is computed (for all records) store that value and
unvalidate it if there is a later insert or delete on the table. Next
improvement would be to maintain a count per active transaction.

#11Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Jean-Luc Lachance (#10)
Re: Win2K Questions

On Fri, 8 Nov 2002, Jean-Luc Lachance wrote:

Scott,

unless id is indexed there is nothing that can be done with
select count(*) from table where id >10000;
Otherwise, the index should be scanned, not the table.

Maybe, maybe not. If id>10000 is most of the table,
you're effectively scanning the whole table (in fact in that
case it'll probably opt to do a seq scan anyway) since you
don't know if the tuple is live until you can actually see
it.

And, scanning a large table to get count(*) will always be worst than
maintaining your own count.

The select may be faster, but in overall speed you may lose if there's
alot of contention on changing the count relative to the frequency of
reading the count.

It could be a good thing, but you'd have to make sure that you
could accurately reproduce the count for all the various visibility
rules that it might be executed in. At the very least it'd have to
give the right results for both base serializable and read committed
(note that in the latter you may see rows that were committed by
a transaction that was not committed at the start).

#12Neil Conway
neilc@samurai.com
In reply to: Jean-Luc Lachance (#10)
Re: Win2K Questions

Jean-Luc Lachance <jllachan@nsd.ca> writes:

unless id is indexed there is nothing that can be done with
select count(*) from table where id >10000;
Otherwise, the index should be scanned, not the table.

Indexes don't store heap tuple visibility information; you'd need to
scan the heap as well in order to determine which tuples your
transaction can see.

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#13Richard Huxton
dev@archonet.com
In reply to: Bruce Momjian (#5)
Re: Win2K Questions

On Friday 08 Nov 2002 5:21 pm, Bruce Momjian wrote:

Charles H. Woloszynski wrote:

Not sure if there is an equivalent query to make count() work
faster

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.
Does oracle do it? Maybe by looking their redo segements. We don't
have those because redo is stored in the main table.

The only way I could model it when I thought about it some time ago was as
though you had a separate table "pg_table_counts" with columns (tableoid,
count) - every insert/delete would also update this table. Then the standard
transaction-id semantics would work re: visibility of the "current" value.

Of course, this only helps in the scenario of count(*) for a real table and
nothing more complicated (count distinct, views etc). I can also imagine a
fair performance hit unless you optimised quite heavily.

--
Richard Huxton

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Huxton (#13)
Re: Win2K Questions

Richard Huxton <dev@archonet.com> writes:

On Friday 08 Nov 2002 5:21 pm, Bruce Momjian wrote:

The problem with optimizing COUNT() is that different backends have
different tuple views, meaning the count from one backend could be
different than from another backend. I can't see how to optimize that.

The only way I could model it when I thought about it some time ago was as
though you had a separate table "pg_table_counts" with columns (tableoid,
count) - every insert/delete would also update this table.

The problem with that is that it would create a serialization
bottleneck: if transaction A has done an insert into table X, then every
other transaction B that wants to insert or delete in X has to wait for
A to commit or abort before B can update X's row in pg_table_counts.
That is exactly the scenario that MVCC was designed to avoid.

What it comes down to is that you can optimize "select count(*) from
foo" at the expense of slowing down *every* kind of database-update
operation. We don't think that's a win.

regards, tom lane

#15Jean-Luc Lachance
jllachan@nsd.ca
In reply to: scott.marlowe (#9)
Re: Win2K Questions

This explains it all.

What would be involved in adding version and visibility to the index?

It would allow for scanning the index instead of the whole table for
many of the count() request.

JLL

Neil Conway wrote:

Show quoted text

Jean-Luc Lachance <jllachan@nsd.ca> writes:

unless id is indexed there is nothing that can be done with
select count(*) from table where id >10000;
Otherwise, the index should be scanned, not the table.

Indexes don't store heap tuple visibility information; you'd need to
scan the heap as well in order to determine which tuples your
transaction can see.

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jean-Luc Lachance (#15)
Re: Win2K Questions

Jean-Luc Lachance <jllachan@nsd.ca> writes:

What would be involved in adding version and visibility to the index?

* Index bloat. An index entry is currently 8 bytes plus the index key,
eg 12 bytes for an int4 index. Version info would add 12 bytes.
Doubling the size of indexes would double the time for index scans.

* Update costs. Instead of one place to update when a row is updated,
now all the associated index entries would have to be updated too.

regards, tom lane

#17Jean-Luc Lachance
jllachan@nsd.ca
In reply to: scott.marlowe (#9)
Re: Win2K Questions

Tom Lane wrote:

Jean-Luc Lachance <jllachan@nsd.ca> writes:

What would be involved in adding version and visibility to the index?

* Index bloat. An index entry is currently 8 bytes plus the index key,
eg 12 bytes for an int4 index. Version info would add 12 bytes.
Doubling the size of indexes would double the time for index scans.

That is true for for small keys, but for varchar(20) the impact is less.

* Update costs. Instead of one place to update when a row is updated,
now all the associated index entries would have to be updated too.

The index has to be updated anyhow to reflect the new record. Doesn't
it?

Show quoted text

regards, tom lane

#18Bruce Momjian
bruce@momjian.us
In reply to: Jean-Luc Lachance (#17)
Re: Win2K Questions

Jean-Luc Lachance wrote:

Tom Lane wrote:

Jean-Luc Lachance <jllachan@nsd.ca> writes:

What would be involved in adding version and visibility to the index?

* Index bloat. An index entry is currently 8 bytes plus the index key,
eg 12 bytes for an int4 index. Version info would add 12 bytes.
Doubling the size of indexes would double the time for index scans.

That is true for for small keys, but for varchar(20) the impact is less.

* Update costs. Instead of one place to update when a row is updated,
now all the associated index entries would have to be updated too.

The index has to be updated anyhow to reflect the new record. Doesn't
it?

Actually no. Index scans can go from the index to the heap, see the
tuple is dead, and move on to the next one. We do have some code in 7.3
which updates the index tuple status bit so we know not to look again.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#18)
Re: Win2K Questions

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Jean-Luc Lachance wrote:

The index has to be updated anyhow to reflect the new record. Doesn't
it?

Actually no. Index scans can go from the index to the heap, see the
tuple is dead, and move on to the next one.

More specifically: an UPDATE operation has to insert *new* index entries
pointing at the new version of the row. It does not presently have to
touch the index entries for the prior version of the row. Similarly,
DELETE need not modify index entries at all. To maintain version status
in index entries, both those operations would have to get slower.
(The eventual cleanup of the dead index entries is handled by VACUUM,
which we hope is not critical to interactive performance.)

I also think that Jean-Luc is underestimating the significance of the
index-bloat issue. The primary reason to have an index at all is that
it's much smaller than the table it indexes, and therefore is
considerably cheaper to scan. Increasing the size of index entries
is a fundamental blow to their usefulness.

regards, tom lane

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#19)
Re: Win2K Questions

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Jean-Luc Lachance wrote:

The index has to be updated anyhow to reflect the new record. Doesn't
it?

Actually no. Index scans can go from the index to the heap, see the
tuple is dead, and move on to the next one.

More specifically: an UPDATE operation has to insert *new* index entries
pointing at the new version of the row. It does not presently have to
touch the index entries for the prior version of the row. Similarly,
DELETE need not modify index entries at all. To maintain version status
in index entries, both those operations would have to get slower.
(The eventual cleanup of the dead index entries is handled by VACUUM,
which we hope is not critical to interactive performance.)

Also, consider how hard it is to find the index entries matching a given
heap row being updated. Being able to skip that step is a big win for
UPDATE and DELETE. The nice thing is that it is updated later when
someone accesses it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#21Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Bruce Momjian (#18)