Question about behavior of snapshot too old feature

Started by Masahiko Sawadaabout 9 years ago10 messages
#1Masahiko Sawada
sawada.mshk@gmail.com

Hi all,

I have a question about behavior of snapshot too old feature.

For example, I set old_snapshot_threshold = 1min and prepare a table
and two terminals.
And I did the followings steps.

1. [Terminal 1] Begin transaction and get snapshot data and wait.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test;

2. [Terminal 2] Another session updates test table in order to make
snapshot dirty.
BEGIN;
UPDATE test SET c = c + 100;
COMMIT;

3. [Terminal 1] 1 minute after, read the test table again in same
transaction opened at #1. I got no error.
SELECT * FROM test;

4. [Terminal 2] Another session reads the test table.
BEGIN;
SELECT * FROM test;
COMMIT;

5. [Terminal 1] 1 minute after, read the test table again, and got
"snapshot error" error.
SELECT * FROM test;

Since #2 makes a snapshot I got at #1 dirty, I expected to get
"snapshot too old" error at #3 where I read test table again after
enough time. But I could never get "snapshot too old" error at #3.
On the other hand, when I set old_snapshot_threshold = 0 I can got the
error at #3.

Is this expected behavior?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Question about behavior of snapshot too old feature

On Fri, Oct 14, 2016 at 1:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

I have a question about behavior of snapshot too old feature.

For example, I set old_snapshot_threshold = 1min and prepare a table
and two terminals.
And I did the followings steps.

1. [Terminal 1] Begin transaction and get snapshot data and wait.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test;

2. [Terminal 2] Another session updates test table in order to make
snapshot dirty.
BEGIN;
UPDATE test SET c = c + 100;
COMMIT;

3. [Terminal 1] 1 minute after, read the test table again in same
transaction opened at #1. I got no error.
SELECT * FROM test;

4. [Terminal 2] Another session reads the test table.
BEGIN;
SELECT * FROM test;
COMMIT;

5. [Terminal 1] 1 minute after, read the test table again, and got
"snapshot error" error.
SELECT * FROM test;

Since #2 makes a snapshot I got at #1 dirty, I expected to get
"snapshot too old" error at #3 where I read test table again after
enough time. But I could never get "snapshot too old" error at #3.

Here, the basic idea is that till the time corresponding page is not
pruned or table vacuuming hasn't triggered, this error won't occur.
So, I think what is happening here that during step #4 or step #3, it
has pruned the table, after which you started getting error.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kevin Grittner
kgrittn@gmail.com
In reply to: Amit Kapila (#2)
Re: Question about behavior of snapshot too old feature

On Fri, Oct 14, 2016 at 8:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 14, 2016 at 1:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For example, I set old_snapshot_threshold = 1min and prepare a table
and two terminals.
And I did the followings steps.

1. [Terminal 1] Begin transaction and get snapshot data and wait.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test;

2. [Terminal 2] Another session updates test table in order to make
snapshot dirty.
BEGIN;
UPDATE test SET c = c + 100;
COMMIT;

3. [Terminal 1] 1 minute after, read the test table again in same
transaction opened at #1. I got no error.
SELECT * FROM test;

4. [Terminal 2] Another session reads the test table.
BEGIN;
SELECT * FROM test;
COMMIT;

5. [Terminal 1] 1 minute after, read the test table again, and got
"snapshot error" error.
SELECT * FROM test;

Since #2 makes a snapshot I got at #1 dirty, I expected to get
"snapshot too old" error at #3 where I read test table again after
enough time. But I could never get "snapshot too old" error at #3.

Here, the basic idea is that till the time corresponding page is not
pruned or table vacuuming hasn't triggered, this error won't occur.
So, I think what is happening here that during step #4 or step #3, it
has pruned the table, after which you started getting error.

The pruning might be one factor. Another possible issue is that
effectively it doesn't start timing that 1 minute until the clock
hits the start of the next minute (i.e., 0 seconds after the next
minute). The old_snapshot_threshold does not attempt to guarantee
that the snapshot too old error will happen at the earliest
opportunity, but that the error will *not* happen until the
snapshot is *at least* that old. Keep in mind that the expected
useful values for this parameter are from a small number of hours
to a day or two, depending on the workload. The emphasis was on
minimizing overhead, even when it meant the cleanup might not be
quite as "eager" as it could otherwise be.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kevin Grittner (#3)
Re: Question about behavior of snapshot too old feature

On Fri, Oct 14, 2016 at 11:29 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

On Fri, Oct 14, 2016 at 8:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 14, 2016 at 1:40 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For example, I set old_snapshot_threshold = 1min and prepare a table
and two terminals.
And I did the followings steps.

1. [Terminal 1] Begin transaction and get snapshot data and wait.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test;

2. [Terminal 2] Another session updates test table in order to make
snapshot dirty.
BEGIN;
UPDATE test SET c = c + 100;
COMMIT;

3. [Terminal 1] 1 minute after, read the test table again in same
transaction opened at #1. I got no error.
SELECT * FROM test;

4. [Terminal 2] Another session reads the test table.
BEGIN;
SELECT * FROM test;
COMMIT;

5. [Terminal 1] 1 minute after, read the test table again, and got
"snapshot error" error.
SELECT * FROM test;

Since #2 makes a snapshot I got at #1 dirty, I expected to get
"snapshot too old" error at #3 where I read test table again after
enough time. But I could never get "snapshot too old" error at #3.

Here, the basic idea is that till the time corresponding page is not
pruned or table vacuuming hasn't triggered, this error won't occur.
So, I think what is happening here that during step #4 or step #3, it
has pruned the table, after which you started getting error.

The pruning might be one factor. Another possible issue is that
effectively it doesn't start timing that 1 minute until the clock
hits the start of the next minute (i.e., 0 seconds after the next
minute). The old_snapshot_threshold does not attempt to guarantee
that the snapshot too old error will happen at the earliest
opportunity, but that the error will *not* happen until the
snapshot is *at least* that old. Keep in mind that the expected
useful values for this parameter are from a small number of hours
to a day or two, depending on the workload. The emphasis was on
minimizing overhead, even when it meant the cleanup might not be
quite as "eager" as it could otherwise be.

Thanks! I understood.
I've tested with autovacuum = off, so it has pruned the table at step #4.

When I set old_snapshot_threshold = 0 I got error at step #3, which
means that the error is occurred without table pruning.
We have regression test for this feature but it sets
old_snapshot_threshold = 0, I doubt about we can test it properly.
Am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kevin Grittner
kgrittn@gmail.com
In reply to: Masahiko Sawada (#4)
Re: Question about behavior of snapshot too old feature

On Sun, Oct 16, 2016 at 9:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

When I set old_snapshot_threshold = 0 I got error at step #3, which
means that the error is occurred without table pruning.

The "snapshot too old" error can happen without pruning, but only
because there is no way to tell the difference between a page that
has been pruned since the snapshot was taken and a page which has
had some other kind of modification since the snapshot was taken.

Ignoring false positives for a moment (where the page is updated by
something other than pruning), what is required for early pruning
is that the snapshot has expired (which due to "rounding" and
avoidance of locking could easily take up to a minute or two more
than the old_snapshot_threshold setting) and then there is page
pruning due to a vacuum or just HOT pruning from a page read. At
some point after that, a read which is part of returning data to
the user (e.g., not just positioning for index modification) can
see that the snapshot is too old and that the LSN for the page is
past the snapshot LSN. That is when you get the error.

We have regression test for this feature but it sets
old_snapshot_threshold = 0, I doubt about we can test it properly.
Am I missing something?

This is a hard feature to test properly, and certainly hard to test
without the test running for a long time. The zero setting is
really not intended to be used in production, but only to allow
some half-way decent testing that doesn't take extreme lengths of
time. If you add some delays of a few minutes each at key points
in a test, you should be able to get a test that works with a
setting of 1min. It is not impossible that we might need to add a
memory barrier to one or two places to get such tests to behave
consistently, but I have not been able to spot where, if anywhere,
that would be.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Bruce Momjian
bruce@momjian.us
In reply to: Kevin Grittner (#5)
Re: Question about behavior of snapshot too old feature

On Mon, Oct 17, 2016 at 08:04:43AM -0500, Kevin Grittner wrote:

We have regression test for this feature but it sets
old_snapshot_threshold = 0, I doubt about we can test it properly.
Am I missing something?

This is a hard feature to test properly, and certainly hard to test
without the test running for a long time. The zero setting is
really not intended to be used in production, but only to allow
some half-way decent testing that doesn't take extreme lengths of
time. If you add some delays of a few minutes each at key points
in a test, you should be able to get a test that works with a
setting of 1min. It is not impossible that we might need to add a
memory barrier to one or two places to get such tests to behave
consistently, but I have not been able to spot where, if anywhere,
that would be.

Slide 10 of this presentation has an example showing
old_snapshot_threshold set to '1min':

http://momjian.us/main/writings/pgsql/features.pdf

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kevin Grittner
kgrittn@gmail.com
In reply to: Bruce Momjian (#6)
Re: Question about behavior of snapshot too old feature

On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian <bruce@momjian.us> wrote:

Slide 10 of this presentation has an example showing
old_snapshot_threshold set to '1min':

http://momjian.us/main/writings/pgsql/features.pdf

If the presentation is intending to show reasonable values for
production use, that would be better as, maybe, '2h'.

As the documentation says at:

https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-OLD-SNAPSHOT-THRESHOLD

| Useful values for production work probably range from a small
| number of hours to a few days. The setting will be coerced to a
| granularity of minutes, and small numbers (such as 0 or 1min) are
| only allowed because they may sometimes be useful for testing.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Bruce Momjian
bruce@momjian.us
In reply to: Kevin Grittner (#7)
Re: Question about behavior of snapshot too old feature

On Wed, Oct 19, 2016 at 11:08:28AM -0500, Kevin Grittner wrote:

On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian <bruce@momjian.us> wrote:

Slide 10 of this presentation has an example showing
old_snapshot_threshold set to '1min':

http://momjian.us/main/writings/pgsql/features.pdf

If the presentation is intending to show reasonable values for
production use, that would be better as, maybe, '2h'.

As the documentation says at:

https://www.postgresql.org/docs/current/static/runtime-config-resource.html#GUC-OLD-SNAPSHOT-THRESHOLD

The example is just to illustrate the activities required to trigger it,
e.g. pg_sleep(), VACUUM.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kevin Grittner
kgrittn@gmail.com
In reply to: Bruce Momjian (#8)
Re: Question about behavior of snapshot too old feature

On Wed, Oct 19, 2016 at 11:11 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Oct 19, 2016 at 11:08:28AM -0500, Kevin Grittner wrote:

On Wed, Oct 19, 2016 at 10:04 AM, Bruce Momjian <bruce@momjian.us> wrote:

Slide 10 of this presentation has an example showing
old_snapshot_threshold set to '1min':

http://momjian.us/main/writings/pgsql/features.pdf

If the presentation is intending to show reasonable values for
production use, that would be better as, maybe, '2h'.

The example is just to illustrate the activities required to trigger it,
e.g. pg_sleep(), VACUUM.

Yeah, in such a demonstration you probably don't want to have
everyone sit for 2 hours, so 1 minute makes perfect sense. :-)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Kevin Grittner (#5)
Re: Question about behavior of snapshot too old feature

On Mon, Oct 17, 2016 at 10:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

On Sun, Oct 16, 2016 at 9:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

When I set old_snapshot_threshold = 0 I got error at step #3, which
means that the error is occurred without table pruning.

The "snapshot too old" error can happen without pruning, but only
because there is no way to tell the difference between a page that
has been pruned since the snapshot was taken and a page which has
had some other kind of modification since the snapshot was taken.

Ignoring false positives for a moment (where the page is updated by
something other than pruning), what is required for early pruning
is that the snapshot has expired (which due to "rounding" and
avoidance of locking could easily take up to a minute or two more
than the old_snapshot_threshold setting) and then there is page
pruning due to a vacuum or just HOT pruning from a page read. At
some point after that, a read which is part of returning data to
the user (e.g., not just positioning for index modification) can
see that the snapshot is too old and that the LSN for the page is
past the snapshot LSN. That is when you get the error.

We have regression test for this feature but it sets
old_snapshot_threshold = 0, I doubt about we can test it properly.
Am I missing something?

This is a hard feature to test properly, and certainly hard to test
without the test running for a long time. The zero setting is
really not intended to be used in production, but only to allow
some half-way decent testing that doesn't take extreme lengths of
time. If you add some delays of a few minutes each at key points
in a test, you should be able to get a test that works with a
setting of 1min. It is not impossible that we might need to add a
memory barrier to one or two places to get such tests to behave
consistently, but I have not been able to spot where, if anywhere,
that would be.

Thank you for explanation! I understood.
When old_snapshot_threshold = 0, it skips to allocate shared memory
area for the xid array and skips the some logic in order to avoid
using the shared memory, so I was concerned about that a little.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers