proposals for LLL, part 1

Started by Vadim Mikheevalmost 28 years ago12 messageshackers
Jump to latest
#1Vadim Mikheev
vadim@krs.ru

Ok, I'm not sure that LLL will appear in 6.4 but it's good time to
discuss about it.

First, PostgreSQL is multi-version system due to its
non-overwriting storage manager. And so, first proposal is use
this feature (multi-versioning) in LLL implementation.

In multi-version systems access methods don't use locks to read
consistent data and so readers don't block writers, writers don't
block readers and only the same-row writers block writers. In such
systems access methods returns snapshot of data as they were in
_some_ point in time. For read committed isolation level this
moment is the time when statement began. For serialized isolation
level this is the time when current transaction began.

Oracle uses rollback segments to reconstract blocks that were
changed after statement/transaction began and so statement sees
only data that were committed by then.

In our case we have to analyze tuple xmin/xmax to determine _when_
corresponding transaction was committed in regard to the last
transaction (LCX) that was committed when statement/transaction
began.

If xmin/xmax was committed before LCX then tuple
insertion/deletion is visible to statement, else - not visible.

To achieve this, the second proposal is to use special SCN -
System Change Number (C) Oracle :) - that will be incremented by 1
by each transaction commit. Each commited transaction will have
corresponding SCN (4 bytes -> equal to sizeof XID).

We have to keep XID --> SCN mapping as long as there is running
transaction that is "interested" in XID: when transaction begins
it will determine the first (the oldest) running transaction XID
and this will be the minimum XID whose SCN transaction would like
to know.

Access methods will have to determine SCN for xmin/xmax only if
FRX <= xmin/xmax <= LSX, where FRX is XID of first (oldest)
running transactions and LSX is last started transaction - in the
moment when statement (for read committed) or transaction (for
serialized) began. For such xmin/xmax their SCNs will be compared
with SCN determined in the moment of statement/transaction
begin...

Changes made by xmin/xmax < FRX are visible to
statement/transaction, and changes made by xmin/xmax > LSX are not
visible. Without xmin/xmax SCN lookup.

For XID --> SCN mapping I propose to use the simplest schema:
ordered queue of SCNs (or something like this) - i.e. keep SCNs
for all transactions from the first one whose SCN could be
required by some running transaction to the last started.

This queue must be shared!

The size of this queue and average number of commits/aborts per
second will define how long transactions will be able to run. 30
xacts/sec and 400K of queue will enable 30 - 60 minuts running
transactions...

Keeping queue in shared memmory may be unacceptable in some
cases... mmap or shared buffer pool could be used to access queue.
We'll see...

Also note that Oracle has special READ ONLY transactions mode.
READ ONLY transactions are disallowed to change anything in the
database. This is good mode for pg_dump (etc) long running
applications. Because of no one will be "interested" in SCN of
READ ONLY transactions - such transactions can make private copy
of the queue part and after this queue could be truncated...

Having 4 bytes per SCN enable to use special values to mark
corresponding transaction as running or aborted and avoid pg_log
lookup when we need in both SCN and state of transaction.

...Well, it's time to sleep :)

To be continued...

Comments ?

Vadim

#2Bruce Momjian
bruce@momjian.us
In reply to: Vadim Mikheev (#1)
Re: [HACKERS] proposals for LLL, part 1

I am retaining your entire message here for reference.

I have a good solution for this. It will require only 4k of shared
memory, and will have no restrictions on the age or number of
transactions.

First, I think we only want to implement "read committed isolation
level", not serialized. Not sure why someone would want serialized.

OK, when a backend is looking at a row that has been committed, it must
decide if the row was committed before or after my transaction started.
If the transaction commit id(xmin) is greater than our current xid, we
know we should not look at it because it is for a transaction that
started after our own transaction.

The problem is for transactions started before our own (have xmin's less
than our own), and may have committed before or after our transaction.

Here is my idea. We add a field to the shared memory Proc structure
that can contain up to 32 transaction ids. When a transaction starts,
we spin though all other open Proc structures, and record all
currently-running transaction ids in our own Proc field used to store up
to 32 transaction ids. While we do this, we remember the lowest of
these open transaction ids.

This is our snapshot of current transactions at the time our transaction
starts. While analyzing a row, if it is greater than our transaction
id, then the transaction was not even started before our transaction.
If the xmin is lower than the min transaction id that we remembered from
the Proc structures, it was committed before our transaction started.
If it is greater than or equal to the min remembered transaction id, we
must spin through our stored transaction ids. If it is in the stored
list, we don't look at the row, because that transaction was not
committed when we started our transaction. If it is not in the list, it
must have been committed before our transaction started. We know this
because if any backend starting a transaction after ours would get a
transaction id higher than ours.

Comments?

Ok, I'm not sure that LLL will appear in 6.4 but it's good time to
discuss about it.

First, PostgreSQL is multi-version system due to its
non-overwriting storage manager. And so, first proposal is use
this feature (multi-versioning) in LLL implementation.

In multi-version systems access methods don't use locks to read
consistent data and so readers don't block writers, writers don't
block readers and only the same-row writers block writers. In such
systems access methods returns snapshot of data as they were in
_some_ point in time. For read committed isolation level this
moment is the time when statement began. For serialized isolation
level this is the time when current transaction began.

Oracle uses rollback segments to reconstract blocks that were
changed after statement/transaction began and so statement sees
only data that were committed by then.

In our case we have to analyze tuple xmin/xmax to determine _when_
corresponding transaction was committed in regard to the last
transaction (LCX) that was committed when statement/transaction
began.

If xmin/xmax was committed before LCX then tuple
insertion/deletion is visible to statement, else - not visible.

To achieve this, the second proposal is to use special SCN -
System Change Number (C) Oracle :) - that will be incremented by 1
by each transaction commit. Each commited transaction will have
corresponding SCN (4 bytes -> equal to sizeof XID).

We have to keep XID --> SCN mapping as long as there is running
transaction that is "interested" in XID: when transaction begins
it will determine the first (the oldest) running transaction XID
and this will be the minimum XID whose SCN transaction would like
to know.

Access methods will have to determine SCN for xmin/xmax only if
FRX <= xmin/xmax <= LSX, where FRX is XID of first (oldest)
running transactions and LSX is last started transaction - in the
moment when statement (for read committed) or transaction (for
serialized) began. For such xmin/xmax their SCNs will be compared
with SCN determined in the moment of statement/transaction
begin...

Changes made by xmin/xmax < FRX are visible to
statement/transaction, and changes made by xmin/xmax > LSX are not
visible. Without xmin/xmax SCN lookup.

For XID --> SCN mapping I propose to use the simplest schema:
ordered queue of SCNs (or something like this) - i.e. keep SCNs
for all transactions from the first one whose SCN could be
required by some running transaction to the last started.

This queue must be shared!

The size of this queue and average number of commits/aborts per
second will define how long transactions will be able to run. 30
xacts/sec and 400K of queue will enable 30 - 60 minuts running
transactions...

Keeping queue in shared memmory may be unacceptable in some
cases... mmap or shared buffer pool could be used to access queue.
We'll see...

Also note that Oracle has special READ ONLY transactions mode.
READ ONLY transactions are disallowed to change anything in the
database. This is good mode for pg_dump (etc) long running
applications. Because of no one will be "interested" in SCN of
READ ONLY transactions - such transactions can make private copy
of the queue part and after this queue could be truncated...

Having 4 bytes per SCN enable to use special values to mark
corresponding transaction as running or aborted and avoid pg_log
lookup when we need in both SCN and state of transaction.

...Well, it's time to sleep :)

To be continued...

Comments ?

Vadim

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#3Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#2)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

I am retaining your entire message here for reference.

I have a good solution for this. It will require only 4k of shared
memory, and will have no restrictions on the age or number of
transactions.

First, I think we only want to implement "read committed isolation
level", not serialized. Not sure why someone would want serialized.

Serialized is DEFAULT isolation level in standards.
It must be implemented. Would you like inconsistent results
from pg_dump, etc?

OK, when a backend is looking at a row that has been committed, it must
decide if the row was committed before or after my transaction started.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the transaction commit id(xmin) is greater than our current xid, we
know we should not look at it because it is for a transaction that
started after our own transaction.

It's right for serialized, not for read committed.
In read committed backend must decide if the row was committed
before or after STATEMENT started...

The problem is for transactions started before our own (have xmin's less
than our own), and may have committed before or after our transaction.

Here is my idea. We add a field to the shared memory Proc structure
that can contain up to 32 transaction ids. When a transaction starts,
we spin though all other open Proc structures, and record all
currently-running transaction ids in our own Proc field used to store up
to 32 transaction ids. While we do this, we remember the lowest of
these open transaction ids.

This is our snapshot of current transactions at the time our transaction
starts. While analyzing a row, if it is greater than our transaction
id, then the transaction was not even started before our transaction.
If the xmin is lower than the min transaction id that we remembered from
the Proc structures, it was committed before our transaction started.
If it is greater than or equal to the min remembered transaction id, we
must spin through our stored transaction ids. If it is in the stored
list, we don't look at the row, because that transaction was not
committed when we started our transaction. If it is not in the list, it
must have been committed before our transaction started. We know this
because if any backend starting a transaction after ours would get a
transaction id higher than ours.

Yes, this is way.
But, first, why should we store running transaction xids in shmem ?
Who is interested in these xids?
We have to store in shmem only min of these xids: vacuum must
not delete rows deleted by transactions with xid greater
(or equal) than this min xid or we risk to get inconsistent
results...
Also, as you see, we have to lock Proc structures in shmem
to get list of xids for each statement in read committed
mode...

I don't know what way is better but using list of xids
is much easy to implement...

Vadim

#4Bruce Momjian
bruce@momjian.us
In reply to: Vadim Mikheev (#3)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

I am retaining your entire message here for reference.

I have a good solution for this. It will require only 4k of shared
memory, and will have no restrictions on the age or number of
transactions.

First, I think we only want to implement "read committed isolation
level", not serialized. Not sure why someone would want serialized.

Serialized is DEFAULT isolation level in standards.
It must be implemented. Would you like inconsistent results
from pg_dump, etc?

OK, I didn't know that.

OK, when a backend is looking at a row that has been committed, it must
decide if the row was committed before or after my transaction started.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the transaction commit id(xmin) is greater than our current xid, we
know we should not look at it because it is for a transaction that
started after our own transaction.

It's right for serialized, not for read committed.
In read committed backend must decide if the row was committed
before or after STATEMENT started...

OK.

The problem is for transactions started before our own (have xmin's less
than our own), and may have committed before or after our transaction.

Here is my idea. We add a field to the shared memory Proc structure
that can contain up to 32 transaction ids. When a transaction starts,
we spin though all other open Proc structures, and record all
currently-running transaction ids in our own Proc field used to store up
to 32 transaction ids. While we do this, we remember the lowest of
these open transaction ids.

This is our snapshot of current transactions at the time our transaction
starts. While analyzing a row, if it is greater than our transaction
id, then the transaction was not even started before our transaction.
If the xmin is lower than the min transaction id that we remembered from
the Proc structures, it was committed before our transaction started.
If it is greater than or equal to the min remembered transaction id, we
must spin through our stored transaction ids. If it is in the stored
list, we don't look at the row, because that transaction was not
committed when we started our transaction. If it is not in the list, it
must have been committed before our transaction started. We know this
because if any backend starting a transaction after ours would get a
transaction id higher than ours.

Yes, this is way.
But, first, why should we store running transaction xids in shmem ?
Who is interested in these xids?

We have to store in shmem only min of these xids: vacuum must
not delete rows deleted by transactions with xid greater
(or equal) than this min xid or we risk to get inconsistent
results...

Also, as you see, we have to lock Proc structures in shmem
to get list of xids for each statement in read committed
mode...

You are correct. We need to lock Proc stuctures during our scan, but we
don't need to keep the list in shared memory. No reason to do it. Do
we have to keep the Proc's locked while we get our table data locks. I
sure hope not. Not sure how we are going prevent someone from
committing their transaction between our Proc scan and when we start our
transaction. Not even sure if I should be worried about that.

I don't know what way is better but using list of xids
is much easy to implement...

Sure, do a list. Getting the min allows you to reduce the number of
times it has to be scanned. I had not thought about vacuum, but keeping
the min in shared memory will certain fix that issue.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#5Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#4)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

You are correct. We need to lock Proc stuctures during our scan, but we
don't need to keep the list in shared memory. No reason to do it. Do
we have to keep the Proc's locked while we get our table data locks. I

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No! Only while we are scanning Procs...

sure hope not. Not sure how we are going prevent someone from
committing their transaction between our Proc scan and when we start our
transaction. Not even sure if I should be worried about that.

We shouldn't... It doesn't matter.

Vadim

#6Vadim Mikheev
vadim@krs.ru
In reply to: Vadim Mikheev (#5)
Re: [HACKERS] proposals for LLL, part 1

Stan Brown wrote:

First, PostgreSQL is multi-version system due to its
non-overwriting storage manager. And so, first proposal is use
this feature (multi-versioning) in LLL implementation.

I must ask one basic question here. Since we dleted tme travel, and the
non-overwriting storage manager is no longer required, should we at
least discuss changing that, either as a part of the LLC work, or prior
to it.

I think one of the primary reasons to do so would be to eliminate
vacumm. Would this not be better for a system that needs to be up
24x7x365?

Yes, this is very important question...

In original postgres there was dedicated vacuum process...
Vacuuming without human administration is possible but
in any case commit in non-overwriting system requires
~2 data block writes (first - to write changes, second - to
write updated xmin/xmax statuses). In WAL systems only
1 data block write required...

Ok, we have to decide two issues about what would we like
to use in future:

1. type of storage manager/transaction system -

WAL or non-overwriting.

2. type of concurrency/consistency control -

Locking or multi-versions.

These are quite different issues!

Oracle is WAL and multi-version system!

We could implement multi-version control now and switch
to WAL latter...

If we decide that locking is ok for concurrency/consistency
then it's better to switch to WAL before implementing LLL.

I personally very like multi-versions...

Comments/votes..?

Vadim
P.S. I'll be off-line up to the monday...

#7Bruce Momjian
bruce@momjian.us
In reply to: Vadim Mikheev (#6)
Re: [HACKERS] proposals for LLL, part 1

Yes, this is very important question...

In original postgres there was dedicated vacuum process...
Vacuuming without human administration is possible but
in any case commit in non-overwriting system requires
~2 data block writes (first - to write changes, second - to
write updated xmin/xmax statuses). In WAL systems only
1 data block write required...

Ok, we have to decide two issues about what would we like
to use in future:

1. type of storage manager/transaction system -

WAL or non-overwriting.

Can you explain WAL. I understand locking vs. multi-version.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#8Bruce Momjian
bruce@momjian.us
In reply to: Vadim Mikheev (#5)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

You are correct. We need to lock Proc stuctures during our scan, but we
don't need to keep the list in shared memory. No reason to do it. Do
we have to keep the Proc's locked while we get our table data locks. I

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No! Only while we are scanning Procs...

sure hope not. Not sure how we are going prevent someone from
committing their transaction between our Proc scan and when we start our
transaction. Not even sure if I should be worried about that.

We shouldn't... It doesn't matter.

One more item. If we don't lock Proc between the scan and our
aquisition of a transaction id, it is possible some other backend will
get a transaction id between the time we scan the Proc structure and
when we get our transaction id, causing us to look at rows that are part
of a non-committed transaction. I think we have to get our transaction
id first, before scanning Proc.

There is definately an area of vulnerabilty there. I am now wondering
how much we need to lock Proc during the scan.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#9Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#7)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

Ok, we have to decide two issues about what would we like
to use in future:

1. type of storage manager/transaction system -

WAL or non-overwriting.

Can you explain WAL. I understand locking vs. multi-version.

Write Ahead Log - log of changes.

Vadim

#10Bruce Momjian
bruce@momjian.us
In reply to: Vadim Mikheev (#6)
Re: [HACKERS] proposals for LLL, part 1

Yes, this is very important question...

In original postgres there was dedicated vacuum process...
Vacuuming without human administration is possible but
in any case commit in non-overwriting system requires
~2 data block writes (first - to write changes, second - to
write updated xmin/xmax statuses). In WAL systems only
1 data block write required...

Doesn't a WAL have do an update by writing the old row to a log, then
write the changes to the real table? It is only inserts that have only
one write?

Ok, we have to decide two issues about what would we like
to use in future:

1. type of storage manager/transaction system -

WAL or non-overwriting.

2. type of concurrency/consistency control -

Locking or multi-versions.

If we could just get superceeded row reuse without vacuum, we can stick
with non-overwriting, can't we?

These are quite different issues!

Oracle is WAL and multi-version system!

We could implement multi-version control now and switch
to WAL latter...

If we decide that locking is ok for concurrency/consistency
then it's better to switch to WAL before implementing LLL.

I personally very like multi-versions...

OK, now I have to ask what multi-version is.

I have to read that Gray book. Did you get my e-mail on it?

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#11Vadim Mikheev
vadim@krs.ru
In reply to: Bruce Momjian (#10)
Re: [HACKERS] proposals for LLL, part 1

Bruce Momjian wrote:

Yes, this is very important question...

In original postgres there was dedicated vacuum process...
Vacuuming without human administration is possible but
in any case commit in non-overwriting system requires
~2 data block writes (first - to write changes, second - to
write updated xmin/xmax statuses). In WAL systems only
1 data block write required...

Doesn't a WAL have do an update by writing the old row to a log, then
write the changes to the real table? It is only inserts that have only
one write?

I was wrong: WAL systems need in 2 writes (data block and log block)
and we need in 3 writes (data block, log block and data block
with updated xact statuses). But in commit time WAL have to
fsync only log (Oracle places new data there and keep old
data in rollback segments) and we have to fsync 2 files.

Ok, we have to decide two issues about what would we like
to use in future:

1. type of storage manager/transaction system -

WAL or non-overwriting.

If we could just get superceeded row reuse without vacuum, we can stick
with non-overwriting, can't we?

I hope...

I personally very like multi-versions...

OK, now I have to ask what multi-version is.

This is our ability to return snapshot of data committed
in some point in time. Note, that we are able to return
different snapshots to different backends, simultaneously.

I have to read that Gray book. Did you get my e-mail on it?

Thanks and sorry... Yes, this could help me. Do you have my
mail address ?

Vadim

#12Michael Meskes
meskes@postgresql.org
In reply to: Vadim Mikheev (#6)
Re: [HACKERS] proposals for LLL, part 1

On Fri, Jul 17, 1998 at 01:00:10PM +0800, Vadim Mikheev wrote:

In original postgres there was dedicated vacuum process...
Vacuuming without human administration is possible but
in any case commit in non-overwriting system requires
~2 data block writes (first - to write changes, second - to
write updated xmin/xmax statuses). In WAL systems only
1 data block write required...

Then the arguments are clearly in favor of WAL.

Oracle is WAL and multi-version system!

While Oracle has some faults (or more) I agree with that choice.

We could implement multi-version control now and switch
to WAL latter...

Once again I fully agree.

I personally very like multi-versions...

Me too.

Michael
--
Dr. Michael Meskes meskes@online-club.de, meskes@debian.org
Go SF49ers! Go Rhein Fire! Use Debian GNU/Linux!