Re: Slow planning time for simple query

Started by Maksim Milyutinover 7 years ago13 messages

milyutinma@gmail.com

over 7 years ago

On hot standby I faced with the similar problem.

The following simple query:

SELECT array_to_json(array_agg(t)) from (
    select *
        from     main.message m
            join main.message_instance mi on m.message_id = mi.message_id
            join main.channel_type ct on mi.channel_type_id =
ct.channel_type_id
        where   m.user_id = 2152595
            and ct.name = 'android'
            and m.user_delete_ts is null
        order by
            case
                when read_ts is null then 0
                else 1
            end,
            m.message_id desc
        limit 100
    ) t;

is planned 4.940 ms on master and *254.741* ms on standby. I tried to
investigate the reasons of so long planning on replica and determined
that index accesses on planning stage come to multiple heap scans.

Here is the execution plan statistics of query above:

Aggregate (actual time=0.641..0.642 rows=1 loops=1)
   Buffers: shared hit=14
   -> Subquery Scan on t (actual time=0.337..0.360 rows=2 loops=1)
         Buffers: shared hit=14
         -> Limit (actual time=0.292..0.301 rows=2 loops=1)
               Buffers: shared hit=14
               -> Sort (actual time=0.287..0.291 rows=2 loops=1)
                     Sort Key: (CASE WHEN (m.read_ts IS NULL) THEN 0
ELSE 1 END), m.message_id DESC
                     Sort Method: quicksort Memory: 27kB
                     Buffers: shared hit=14
                     -> Nested Loop (actual time=0.157..0.219 rows=2
loops=1)
                           Buffers: shared hit=14
                           -> Seq Scan on channel_type ct (actual
time=0.043..0.048 rows=1 loops=1)
                                 Filter: (name = 'android'::text)
                                 Rows Removed by Filter: 7
                                 Buffers: shared hit=1
                           -> Nested Loop (actual time=0.098..0.148
rows=2 loops=1)
                                 Buffers: shared hit=13
                                 -> Index Scan using
message_user_id_idx1 on message m (actual time=0.055..0.063 rows=2 loops=1)
                                       Index Cond: (user_id = 2152595)
                                       Buffers: shared hit=5
                                 -> Index Scan using
message_instance_pkey on message_instance mi (actual time=0.026..0.029
rows=1 loops=2)
                                       Index Cond: ((message_id =
m.message_id) AND (channel_type_id = ct.channel_type_id))
                                       Buffers: shared hit=8

The 14 accesses to buffer cache under query execution. But
pg_statio_user_tables and pg_statio_user_indexes views show the
different picture:

select
schemaname, relname, indexrelname, idx_blks_hit, idx_blks_read,
pg_size_pretty(pg_relation_size(indexrelid::regclass))
from pg_statio_user_indexes
where idx_blks_hit + idx_blks_read > 0;

select
schemaname, relname, heap_blks_read, heap_blks_hit
from pg_statio_user_tables
where heap_blks_read + heap_blks_hit + idx_blks_hit + idx_blks_read > 0;

(This experiment is carried out on test hot standby without parallel
activities and with cleared system statistics)

The vacuum on problem tables (main.message and main.message_instance) on
master node resolves the problem somehow but its often execution slows
down all queries and generally increases IO.

Is there any case to overcome the problem or it's fundamental issue and
necessary to rewrite the query to simplify planning?

--
Regards,
Maksim Milyutin

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Maksim Milyutin (#1)

On 09.06.2018 21:49, Maksim Milyutin wrote:

On hot standby I faced with the similar problem.

Sorry, the problem in question is described here
/messages/by-id/22136.1528312205@sss.pgh.pa.us

--
Regards,
Maksim Milyutin

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Maksim Milyutin (#1)

Maksim Milyutin <milyutinma@gmail.com> writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

Presumably the problem is that the standby isn't authorized to change
the btree index's "entry is dead" bits, so it's unable to prune index
entries previously detected as dead, and thus the logic that intends
to improve this situation doesn't work on the standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable. This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates. I'm unsure if that's a good tradeoff or
not.

regards, tom lane

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Tom Lane (#3)

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect
extension) in the first leaf page of index participating in query
('main.message_instance_pkey') on master and slave nodes and have
noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

blkno | type | live_items | dead_items | avg_item_size | page_size |
free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3705 | l | 1 | 58 | 24 | 8192
| 6496 | 0 | 3719 | 0 | 65

On standby:

The vacuum routine improves the situation.
Сan there be something that I have incorrectly configured WAL logging or
replication?

I wonder if we should extend the "SnapshotNonVacuumable" logic introduced
in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed
non vacuumable. This would essentially get rid of long standby planning
times in this sort of scenario by instead accepting worse (possibly much
worse) planner range estimates. I'm unsure if that's a good tradeoff or
not.

I applied the patch introduced in this commit to test standby (not
master; I don't know if this is correct) and haven't noticed any
differences.

--
Regards,
Maksim Milyutin

Maksim Milyutin

milyutinma@gmail.com

over 7 years ago

In reply to: Maksim Milyutin (#4)

13.06.2018 12:40, Maksim Milyutin wrote:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin<milyutinma@gmail.com> writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect
extension) in the first leaf page of index participating in query
('main.message_instance_pkey') on master and slave nodes and have
noticed a big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey',
3705);

On master:

blkno | type | live_items | dead_items | avg_item_size | page_size |
free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3705 | l    |          1 |         58 |            24 |      8192
|      6496 |         0 |      3719 |    0 | 65

On standby:

blkno | type | live_items | dead_items | avg_item_size | page_size |
free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3705 | l    |         59 |          0 |            24 |      8192
|      6496 |         0 |      3719 |    0 |          1

In this point I want to highlight the issue that the changes in
*lp_flags* bits (namely, set items as dead) for index item pointers
doesn't propagate from master to replica in my case. As a consequence,
on standby I have live index items most of which on master are marked as
dead. And my queries on planning stage are forced to descent to heap
pages under *get_actual_variable_range* execution that considerately
slows down planning.

Is it bug or restriction of implementation or misconfiguration of
WAL/replication?

--
Regards,
Maksim Milyutin

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Tom Lane (#3)

On Sun, Jun 10, 2018 at 1:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

Presumably the problem is that the standby isn't authorized to change
the btree index's "entry is dead" bits,

I don't see anything like that in the code. We use _bt_killitems to
mark the items as dead and neither that function or any of its caller
has any such assumption.

so it's unable to prune index
entries previously detected as dead, and thus the logic that intends
to improve this situation doesn't work on the standby.

If my above understanding is correct, then one thing that could lead
to such behavior is the computation of RecentGlobalXmin on standby.
Basically, if the RecentGlobalXmin has a different value on standby,
then it is possible that the decision whether a particular item is
dead differs on master and standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

Because we don't WAL log it. See _bt_killitems.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Maksim Milyutin (#5)

On Thu, Jun 14, 2018 at 4:34 AM, Maksim Milyutin <milyutinma@gmail.com> wrote:

13.06.2018 12:40, Maksim Milyutin wrote:

On 09.06.2018 22:49, Tom Lane wrote:

Maksim Milyutin <milyutinma@gmail.com> writes:

On hot standby I faced with the similar problem.
...
is planned 4.940 ms on master and *254.741* ms on standby.

(I wonder though why, if you executed the same query on the master,
its setting of the index-entry-is-dead bits didn't propagate to the
standby.)

I have verified the number dead item pointers (through pageinspect
extension) in the first leaf page of index participating in query
('main.message_instance_pkey') on master and slave nodes and have noticed a
big difference.

SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705);

On master:

blkno | type | live_items | dead_items | avg_item_size | page_size |
free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3705 | l | 1 | 58 | 24 | 8192 |
6496 | 0 | 3719 | 0 | 65

On standby:

blkno | type | live_items | dead_items | avg_item_size | page_size |
free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3705 | l | 59 | 0 | 24 | 8192 |
6496 | 0 | 3719 | 0 | 1

In this point I want to highlight the issue that the changes in lp_flags
bits (namely, set items as dead) for index item pointers doesn't propagate
from master to replica in my case. As a consequence, on standby I have live
index items most of which on master are marked as dead. And my queries on
planning stage are forced to descent to heap pages under
get_actual_variable_range execution that considerately slows down planning.

Is it bug or restriction of implementation or misconfiguration of
WAL/replication?

It is not a misconfiguration issue.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Amit Kapila (#6)

"Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

Presumably the problem is that the standby isn't authorized to change
the btree index's "entry is dead" bits,

Amit> I don't see anything like that in the code. We use _bt_killitems
Amit> to mark the items as dead and neither that function or any of its
Amit> caller has any such assumption.

See index_fetch_heap:

/*
* If we scanned a whole HOT chain and found only dead tuples, tell index
* AM to kill its entry for that TID (this will take effect in the next
* amgettuple call, in index_getnext_tid). We do not do this when in
* recovery because it may violate MVCC to do so. See comments in
* RelationGetIndexScan().
*/
if (!scan->xactStartedInRecovery)
scan->kill_prior_tuple = all_dead;

(this is the only place where kill_prior_tuple can be set to true)

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andrew Gierth (#8)

[ trimming cc's ]

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

See index_fetch_heap:
/*
* If we scanned a whole HOT chain and found only dead tuples, tell index
* AM to kill its entry for that TID (this will take effect in the next
* amgettuple call, in index_getnext_tid). We do not do this when in
* recovery because it may violate MVCC to do so. See comments in
* RelationGetIndexScan().
*/
if (!scan->xactStartedInRecovery)
scan->kill_prior_tuple = all_dead;

So the referenced comment in RelationGetIndexScan says

* During recovery we ignore killed tuples and don't bother to kill them
* either. We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples. Do *not* set ignore_killed_tuples to true when running in a
* transaction that was started during recovery. xactStartedInRecovery
* should not be altered by index AMs.

but it seems to me that this is not terribly carefully thought through.

1. If global xmin on the primary is later than on the standby, VACUUM could
remove tuples that should be visible on the standby, and that would shortly
propagate to the standby. Simply ignoring index kill bits on the standby
won't prevent that.

2. Although _bt_killitems doesn't WAL-log its setting of kill bits, those
bits could propagate to the standby anyway, as a result of a subsequent
WAL action on the index page that gets a full-page image added.

I believe that in some replication modes, #1 isn't a problem because we
have mechanisms to hold back the primary's global xmin. If that's true
always, then all of this logic is obsolete silliness that we could remove.
If it's not true always, we have bigger problems to fix.

Similarly, the possibility of #2 means the standby can't trust the kill
bits to reflect its own xmin anyway. So if we aren't junking this entire
stack of logic, we could perhaps allow the standby to set kill bits per
its own understanding of global xmin, and then allow the kill bits to be
used in SnapshotNonVacuumable scans even though they're unsafe for regular
MVCC scans.

However, I'm still caffeine-short so maybe there's something I missed.

regards, tom lane

#10

Andrew Gierth

andrew@tao11.riddles.org.uk

over 7 years ago

In reply to: Tom Lane (#9)

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> * During recovery we ignore killed tuples and don't bother to kill them
Tom> * either. We do this because the xmin on the primary node could easily be
Tom> * later than the xmin on the standby node, so that what the primary
Tom> * thinks is killed is supposed to be visible on standby. So for correct
Tom> * MVCC for queries during recovery we must ignore these hints and check
Tom> * all tuples. Do *not* set ignore_killed_tuples to true when running in a
Tom> * transaction that was started during recovery. xactStartedInRecovery
Tom> * should not be altered by index AMs.

Tom> but it seems to me that this is not terribly carefully thought through.

Tom> 1. If global xmin on the primary is later than on the standby,
Tom> VACUUM could remove tuples that should be visible on the standby,
Tom> and that would shortly propagate to the standby. Simply ignoring
Tom> index kill bits on the standby won't prevent that.

Right, but we have conflict detection for vacuum's tuple removal
actions, which we don't have for the index hints.

Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
Tom> bits, those bits could propagate to the standby anyway, as a
Tom> result of a subsequent WAL action on the index page that gets a
Tom> full-page image added.

That's OK as long as we're ignoring those hints on the standby.

Tom> I believe that in some replication modes, #1 isn't a problem
Tom> because we have mechanisms to hold back the primary's global xmin.

That's the case if feedback is enabled, but not if it's disabled, which
is sometimes done intentionally to ensure that long-running queries on
the standby don't hold back the master's xmin horizon.

Conflict detection then comes into play to kill the aforesaid
long-running queries before vacuuming away anything they might see.

--
Andrew (irc:RhodiumToad)

#11

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Andrew Gierth (#10)

On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> 2. Although _bt_killitems doesn't WAL-log its setting of kill
Tom> bits, those bits could propagate to the standby anyway, as a
Tom> result of a subsequent WAL action on the index page that gets a
Tom> full-page image added.

That's OK as long as we're ignoring those hints on the standby.

What if we don't ignore those hints on standby for a specific case
like the one in get_actual_variable_range? Now, if the user has
enabled wal_log_hints on the master, it could save time from scanning
many dead tuples on standby.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Amit Kapila (#11)

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

That's OK as long as we're ignoring those hints on the standby.

What if we don't ignore those hints on standby for a specific case
like the one in get_actual_variable_range?

Yeah, that's the same idea I suggested upthread. Andrew shot down
my first thought (correctly I think) but the second one still seems
feasible.

Now, if the user has
enabled wal_log_hints on the master, it could save time from scanning
many dead tuples on standby.

We should have the standby set the hint bits for itself, rather than
relying on wal_log_hints.

regards, tom lane

#13

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Tom Lane (#12)

On Mon, Jun 18, 2018 at 7:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Sun, Jun 17, 2018 at 9:22 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

That's OK as long as we're ignoring those hints on the standby.

What if we don't ignore those hints on standby for a specific case
like the one in get_actual_variable_range?

Yeah, that's the same idea I suggested upthread. Andrew shot down
my first thought (correctly I think) but the second one still seems
feasible.

Now, if the user has
enabled wal_log_hints on the master, it could save time from scanning
many dead tuples on standby.

We should have the standby set the hint bits for itself, rather than
relying on wal_log_hints.

Yeah, I think that is a good idea unless someone sees a hole in it,
but we have to keep in mind that the bits set by standby can be
overwritten later for instance (a) due to an FPI for a change on that
page in master and (b) when that page is evicted from shared buffers
because we don't allow to dirty the page during recovery. However, I
think the same is true for regular hint bits (hint bit on a tuple that
suggests transaction is committed). On the one hand it will be good
if we can do it as a single patch, but it is not difficult to imagine
that we can do this as two separate patches where the first patch can
just introduce an idea to not ignore kill bits on standby in a
particular case (get_actual_variable_range) and the second patch to
introduce something like StandbyGlobalXmin which will then be used in
get_actual_variable_range. Both are independently useful.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com