Question about updates and MVCC

Started by Net Virtual Mailing Listsabout 17 years ago5 messagesgeneral
Jump to latest
#1Net Virtual Mailing Lists
mailinglists@net-virtual.com

Hello,

I have a couple of questions regarding how MVCC (in postges 8.3.3 if it
makes a difference) affects vacuum.

#1. If I am doing an update to a row and none of the values have changed,
will that cause a "hole" that requires vacuum to reclaim?

#2. I have a column in my table (called "status", if you can believe
*that*). This contains 1 of 4 values:

-1: row is expired, but needs to be marked deleted from index
0: row is expired, and has been indexed
1: row is active, and has been indexed
2: row is new or updated, and needs to be indexed

.. The point of all this is that when a new row is added, or updated, it
goes into a status = 2, so the process that comes along later to build
search indexes, can quickly query any listings in status = 2 and
incrementally update the index. (Same with respect to status -1, except
those rows are no longer active and need to be deleted from the index)...

The issue with this is that it seems to be causing a lot of vacuum
work.... The total number of rows in the table are about 30 million, but
partitioned into about 130 segments, based on a category... I'm trying to
minimize the amount of vacuum work because not much else changes in the
table over time, but the status column will get fiddled with 4 times
during the life of a row...

Thanks, as always!

- Greg

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Net Virtual Mailing Lists (#1)
Re: Question about updates and MVCC

On Sat, Jan 10, 2009 at 08:00:29AM -0800, mailinglists@net-virtual.com wrote:

Hello,

I have a couple of questions regarding how MVCC (in postges 8.3.3 if it
makes a difference) affects vacuum.

#1. If I am doing an update to a row and none of the values have changed,
will that cause a "hole" that requires vacuum to reclaim?

Yes. There are various ways to work wih this, depending on your
capabilties.

#2. I have a column in my table (called "status", if you can believe
*that*). This contains 1 of 4 values:

-1: row is expired, but needs to be marked deleted from index
0: row is expired, and has been indexed
1: row is active, and has been indexed
2: row is new or updated, and needs to be indexed

How wide is the row otherwise. Each status update will make a copy of
the complete row. One thought is to think about how often you need the
status anyway and decide if it's worthwhile to split the status off
into a seperate table, which will be much smaller and vacuum quicker.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#3Net Virtual Mailing Lists
mailinglists@net-virtual.com
In reply to: Net Virtual Mailing Lists (#1)
Query question

Hello,

Thanks to the replier (Martijn -- thank you very much!) to an earlier
question I had about MVCC, I've decided to re-think entirely my use of the
status column in a table. What I've decided to do is to put all of my
new or changed records into a "holding" table, then after they are
indexed, commit the changes to their final location. This has worked
extremely well, except when I am querying the holding table.

This is not the actual table, but my problem can be demonstrated by this
(mode can be "U" for an update/insert or "D" for a delete):

CREATE TABLE listings (
trans_id SERIAL,
mode CHAR(1),
listing_id INT,
region_id INT,
category INT
);

.. so, my process goes along and inserts all these rows into the table,
about 2,000,000 a day. Then it comes time to query the data, I do a query
like this:

"SELECT * FROM listings ORDER BY region_id, category, listing_id,
trans_id" -- this is *very* expensive obviously, but since multiple rows
can be inserted for the same listing_id I have to get the data into some
deterministic order.

There can be multiple writers adding to this listings table, when it comes
time to process it, what I want to do is get only the last transaction for
a given listing_id, because the earlier ones don't matter. On top of
that, each region_id and category_id has its own index. I need to be able
to process the indexes in-full, one-at-a-time because there are too many
to hold that many open filehandles/processes at one time.

So, my question is, is there some way to return the rows in a
deterministic order, without actually having to do an explicit sort on the
data? What I mean is, I don't care if category_id 4 / region_id 10 /
listing_id 10000 comes before category_id 1 / region_id 1 / lisitng_id 1
-- I just need them returned to me in that sort of grouped order (although
sorted by trans_id). And would this even be more efficient in the first
place or am I barking up the wrong tree?

I hope this makes sense, I've been up all night so not thinking too
clearly....

Thanks!

- Greg

#4Sam Mason
sam@samason.me.uk
In reply to: Net Virtual Mailing Lists (#3)
Re: Query question

On Wed, Jan 14, 2009 at 07:36:03PM -0800, mailinglists@net-virtual.com wrote:

CREATE TABLE listings (
trans_id SERIAL,
mode CHAR(1),
listing_id INT,
region_id INT,
category INT
);

"SELECT * FROM listings ORDER BY region_id, category, listing_id,
trans_id"

[...] what I want to do is get only the last transaction for
a given listing_id, because the earlier ones don't matter.

If you have an index on (region_id,category,listing_id,trans_id) you
should be able to do:

SELECT region_id,category,listing_id,MAX(trans_id)
FROM listings
GROUP BY region_id,category,listing_id;

And have PG answer this using the index (it'll only do this if it thinks
there are many transactions for each group though).

On top of
that, each region_id and category_id has its own index. I need to be able
to process the indexes in-full, one-at-a-time because there are too many
to hold that many open filehandles/processes at one time.

Not sure what you mean by "index" here; I'm assuming you're talking
about something outside PG, or am I missing some context?

If it's inside PG, then you do know that every index you have will slow
down every modification of the table?

So, my question is, is there some way to return the rows in a
deterministic order, without actually having to do an explicit sort on the
data? What I mean is, I don't care if category_id 4 / region_id 10 /
listing_id 10000 comes before category_id 1 / region_id 1 / lisitng_id 1
-- I just need them returned to me in that sort of grouped order (although
sorted by trans_id).

If you want to know all the transaction ids then you need to do the
sort, if you only want the largest/latest then you're probably better
off telling PG that's what you want (i.e. using GROUP BY and MAX
aggregate and letting it make an appropiate decision).

--

Sam http://samason.me.uk/

#5Net Virtual Mailing Lists
mailinglists@net-virtual.com
In reply to: Sam Mason (#4)
Re: Query question

On Wed, Jan 14, 2009 at 07:36:03PM -0800, mailinglists@net-virtual.com
wrote:

CREATE TABLE listings (
trans_id SERIAL,
mode CHAR(1),
listing_id INT,
region_id INT,
category INT
);

"SELECT * FROM listings ORDER BY region_id, category, listing_id,
trans_id"

[...] what I want to do is get only the last transaction for
a given listing_id, because the earlier ones don't matter.

If you have an index on (region_id,category,listing_id,trans_id) you
should be able to do:

SELECT region_id,category,listing_id,MAX(trans_id)
FROM listings
GROUP BY region_id,category,listing_id;

And have PG answer this using the index (it'll only do this if it thinks
there are many transactions for each group though).

On top of
that, each region_id and category_id has its own index. I need to be
able
to process the indexes in-full, one-at-a-time because there are too many
to hold that many open filehandles/processes at one time.

Not sure what you mean by "index" here; I'm assuming you're talking
about something outside PG, or am I missing some context?

Yes, sorry I wasn't clear... This is to keep track of incremental updates
to an index outside of Postgres

So, my question is, is there some way to return the rows in a
deterministic order, without actually having to do an explicit sort on
the
data? What I mean is, I don't care if category_id 4 / region_id 10 /
listing_id 10000 comes before category_id 1 / region_id 1 / lisitng_id 1
-- I just need them returned to me in that sort of grouped order
(although
sorted by trans_id).

If you want to know all the transaction ids then you need to do the
sort, if you only want the largest/latest then you're probably better
off telling PG that's what you want (i.e. using GROUP BY and MAX
aggregate and letting it make an appropiate decision).

I only need to know the max id, but the problem is that for each region_id
has N categories and I need to process each one individually. The query
you suggested does, indeed, sort them by category, but it does not sort
them by region_id.

Here's an example: (select region_id,category,listing_id,max(trans_id)
from listings_pending where region_id IN('3134000000', '2222000000')
group by region_id,category,listing_id;)

region_id | category_id | listing_id | max
------------+----------------+------------+--------
2222000000 | 1 | 2221473 | 640799
2222000000 | 1 | 2426142 | 845468
2222000000 | 1 | 2103599 | 522925
3134000000 | 1 | 2146326 | 565652
2222000000 | 1 | 2462112 | 881438
2222000000 | 1 | 1947690 | 367016
2222000000 | 1 | 2526731 | 946057
2222000000 | 1 | 2217864 | 637190
2222000000 | 1 | 2288420 | 707746

As you can see, at transaction 565652, I would close the index (which is
very expensive, because it actually has to do a "merge" of the newly
created index with old one), then reopen it at transaction 881438.

Thanks for your help!

- Greg