Deletion Challenge

Started by Berend Toberover 10 years ago13 messagesgeneral
Jump to latest
#1Berend Tober
btober@computer.org

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.

Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.

I have included below an annotated test case and my current solution for
the N = 1 case.

*/

DROP TABLE IF EXISTS cash_journal;

CREATE TABLE cash_journal (
click bigint NOT NULL,
cash_journal_id bigint NOT NULL,
fairian_id bigint NOT NULL,
debit double precision,
credit double precision,
balance real DEFAULT 0,
description text
);

COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account effected.';
COMMENT ON COLUMN cash_journal.debit IS 'Account balance increase amount.';
COMMENT ON COLUMN cash_journal.credit IS 'Account balance decrease amount.';
COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per Fairian running total.';
COMMENT ON COLUMN cash_journal.description IS 'Transaction description.';

/*

Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.

The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.

Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.

*/

copy cash_journal (click,cash_journal_id,fairian_id,debit,credit,balance,description) from stdin;
36 3 7 0 0 0 Initial cash balance
36 4 8 0 0 0 Initial cash balance
36 5 9 0 0 0 Initial cash balance
36 14 18 0 0 0 initial cash balance
37 5 7 9 \N 9 Ratified contract fa35e192121eab
37 7 8 8 \N 8 Ratified contract f1abd670358e03
37 9 9 7 \N 7 Ratified contract 1574bddb75c78a
411 1 25 0 0 0 Initial cash balance
411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba
412 1 7 5 \N 14 Sold food quantity 7 units.
412 2 25 \N 5 995 Bought food quantity 7 units.
413 1 25 \N 995 0 Redeemed bond 7719a1c782a1ba
\.

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.

click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
36 | 3 | 7 | 0 | 0 | 0 | Initial cash balance
37 | 5 | 7 | 9 | | 9 | Ratified contract fa35e192121eab
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
36 | 4 | 8 | 0 | 0 | 0 | Initial cash balance
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
36 | 5 | 9 | 0 | 0 | 0 | Initial cash balance
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
411 | 1 | 25 | 0 | 0 | 0 | Initial cash balance
411 | 2 | 25 | 1000 | | 1000 | Issued bond 7719a1c782a1ba
412 | 2 | 25 | | 5 | 995 | Bought food quantity 7 units.
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(12 rows)

*/

/*

Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?

Can you create a solution that retains an arbitrarily specified number
of rows per player?

*/
BEGIN;

WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
)
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click;

WITH max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
)
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id;

COMMIT;

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(5 rows)

*/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Steve Crawford
scrawford@pinpointresearch.com
In reply to: Berend Tober (#1)
Re: Deletion Challenge

If I understand correctly the value of "click" always advances and within a
"click" the "cash_journal_id" always advances - not necessarily by single
steps so within a fairian_id, ordering by "click" plus "cash_journal_id"
would return the records in order from which you want the most recent 5 for
each farian_id.

Typing without testing and ignoring performance optimizations, something
along the lines of the following should work and covers the "last 5" issue
as well.

with stuff_to_delete as (
select farian_id, click, cash_journal_id,
rank() over (partition by farian_id order by (click, cash_journal_id) desc)
as howold)
from cash_journal)
delete from cash_journal
using stuff_to_delete
where
cash_journal.farian_id = stuff_to_delete.farian_id
and cash_journal.click = stuff_to_delete.click
and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id
and stuff_to_delete.howold > 5;

Cheers,
Steve

On Sat, Dec 5, 2015 at 8:08 AM, Berend Tober <btober@computer.org> wrote:

Show quoted text

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.

Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.

I have included below an annotated test case and my current solution for
the N = 1 case.

*/

DROP TABLE IF EXISTS cash_journal;

CREATE TABLE cash_journal (
click bigint NOT NULL,
cash_journal_id bigint NOT NULL,
fairian_id bigint NOT NULL,
debit double precision,
credit double precision,
balance real DEFAULT 0,
description text
);

COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of
transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account
effected.';
COMMENT ON COLUMN cash_journal.debit IS 'Account balance
increase amount.';
COMMENT ON COLUMN cash_journal.credit IS 'Account balance
decrease amount.';
COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per
Fairian running total.';
COMMENT ON COLUMN cash_journal.description IS 'Transaction
description.';

/*

Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.

The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.

Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.

*/

copy cash_journal
(click,cash_journal_id,fairian_id,debit,credit,balance,description) from
stdin;
36 3 7 0 0 0 Initial cash balance
36 4 8 0 0 0 Initial cash balance
36 5 9 0 0 0 Initial cash balance
36 14 18 0 0 0 initial cash balance
37 5 7 9 \N 9 Ratified contract
fa35e192121eab
37 7 8 8 \N 8 Ratified contract
f1abd670358e03
37 9 9 7 \N 7 Ratified contract
1574bddb75c78a
411 1 25 0 0 0 Initial cash balance
411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba
412 1 7 5 \N 14 Sold food quantity 7 units.
412 2 25 \N 5 995 Bought food quantity 7
units.
413 1 25 \N 995 0 Redeemed bond
7719a1c782a1ba
\.

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.

click | cash_journal_id | fairian_id | debit | credit | balance |
description

-------+-----------------+------------+-------+--------+---------+----------------------------------
36 | 3 | 7 | 0 | 0 | 0 | Initial
cash balance
37 | 5 | 7 | 9 | | 9 |
Ratified contract fa35e192121eab
412 | 1 | 7 | 5 | | 14 | Sold
food quantity 7 units.
36 | 4 | 8 | 0 | 0 | 0 | Initial
cash balance
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
36 | 5 | 9 | 0 | 0 | 0 | Initial
cash balance
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial
cash balance
411 | 1 | 25 | 0 | 0 | 0 | Initial
cash balance
411 | 2 | 25 | 1000 | | 1000 | Issued
bond 7719a1c782a1ba
412 | 2 | 25 | | 5 | 995 | Bought
food quantity 7 units.
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(12 rows)

*/

/*

Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?

Can you create a solution that retains an arbitrarily specified number
of rows per player?

*/
BEGIN;

WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
)
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click;

WITH max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
)
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id;

COMMIT;

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

click | cash_journal_id | fairian_id | debit | credit | balance |
description

-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 | Sold
food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial
cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)

*/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Berend Tober (#1)
Re: Deletion Challenge

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

I have devised a way to accomplish this, but it is a 'two-stage'
approach: that is, it requires two delete statements. I would like to
know if there is a way to do it in a single statement.

Bonus challenge: Same question, except preserving the most recent N, for
N > 1, rows for each person so that a short history is retained after
the deletion.

I have included below an annotated test case and my current solution for
the N = 1 case.

*/

DROP TABLE IF EXISTS cash_journal;

CREATE TABLE cash_journal (
click bigint NOT NULL,
cash_journal_id bigint NOT NULL,
fairian_id bigint NOT NULL,
debit double precision,
credit double precision,
balance real DEFAULT 0,
description text
);

COMMENT ON COLUMN cash_journal.click IS 'Time of transaction.';
COMMENT ON COLUMN cash_journal.cash_journal_id IS 'Sequence of
transaction within current click.';
COMMENT ON COLUMN cash_journal.fairian_id IS 'Fairian account
effected.';
COMMENT ON COLUMN cash_journal.debit IS 'Account balance increase
amount.';
COMMENT ON COLUMN cash_journal.credit IS 'Account balance
decrease amount.';
COMMENT ON COLUMN cash_journal.balance IS 'Account balance, per
Fairian running total.';
COMMENT ON COLUMN cash_journal.description IS 'Transaction
description.';

/*

Below is some sample data, listed in the click/sequence order that the
data would actually be entered. That is, the 'click' column represents
advancing time, and within each click, transactions are sequenced by the
'cash_journal_id' column. Note there are some missing cash_journal_id
sequence numbers. This is an artifact of having presented here only
an illustrative sample. Generally, within each click, the sequence
would start at one and increment uniformly by one for each new row
in the same click, and then reset to one for the next click. The
missing increments in the sample data should not make any difference
in the solution.

The 'balance' column is a per-player running total, which is a
deliberate denormalization. It is calculated in a before insert trigger
by starting with the per-player previous balance, and then adding
the new row debit, if any, and subtracting the new row credit, if any.

Note, not all Fairians will have a transaction in every click, but any
number of Fairians may have multiple transactions in any click.

*/

copy cash_journal
(click,cash_journal_id,fairian_id,debit,credit,balance,description) from
stdin;
36 3 7 0 0 0 Initial cash balance
36 4 8 0 0 0 Initial cash balance
36 5 9 0 0 0 Initial cash balance
36 14 18 0 0 0 initial cash balance
37 5 7 9 \N 9 Ratified contract fa35e192121eab
37 7 8 8 \N 8 Ratified contract f1abd670358e03
37 9 9 7 \N 7 Ratified contract 1574bddb75c78a
411 1 25 0 0 0 Initial cash balance
411 2 25 1000 \N 1000 Issued bond 7719a1c782a1ba
412 1 7 5 \N 14 Sold food quantity 7 units.
412 2 25 \N 5 995 Bought food quantity 7 units.
413 1 25 \N 995 0 Redeemed bond 7719a1c782a1ba
\.

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

The sample starting data is shown here in order by Fairian so that it is
perhaps easier to see what is happening for each player. Note that the
result of the deletion should be the last row for each player.

click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------

36 | 3 | 7 | 0 | 0 | 0 |
Initial cash balance
37 | 5 | 7 | 9 | | 9 |
Ratified contract fa35e192121eab
412 | 1 | 7 | 5 | | 14 | Sold
food quantity 7 units.
36 | 4 | 8 | 0 | 0 | 0 |
Initial cash balance
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
36 | 5 | 9 | 0 | 0 | 0 |
Initial cash balance
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
411 | 1 | 25 | 0 | 0 | 0 |
Initial cash balance
411 | 2 | 25 | 1000 | | 1000 |
Issued bond 7719a1c782a1ba
412 | 2 | 25 | | 5 | 995 |
Bought food quantity 7 units.
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(12 rows)

*/

/*

Here is the current, two-stage solution in use. Is there a way to do it
with a single statement?

Can you create a solution that retains an arbitrarily specified number
of rows per player?

*/
BEGIN;

WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
)
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click;

WITH max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
)
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id;

COMMIT;

SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;

/*

click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------

412 | 1 | 7 | 5 | | 14 | Sold
food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)

*/

test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(5 rows)

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Berend Tober
btober@computer.org
In reply to: Steve Crawford (#2)
Re: Deletion Challenge

Steve Crawford wrote:

If I understand correctly the value of "click" always advances and within a "click" the
"cash_journal_id" always advances - not necessarily by single steps so within a fairian_id, ordering
by "click" plus "cash_journal_id" would return the records in order from which you want the most
recent 5 for each farian_id.

Typing without testing and ignoring performance optimizations, something along the lines of the
following should work and covers the "last 5" issue as well.

with stuff_to_delete as (
select farian_id, click, cash_journal_id,
rank() over (partition by farian_id order by (click, cash_journal_id) desc) as howold)
from cash_journal)
delete from cash_journal
using stuff_to_delete
where
cash_journal.farian_id = stuff_to_delete.farian_id
and cash_journal.click = stuff_to_delete.click
and cash_journal.cash_journal_id = stuff_to_delete.cash_journal_id
and stuff_to_delete.howold > 5;

Assessing without testing, I like that. Thanks!

Although the above is not the exactly the form I was using, an earlier iteration of a related
problem employed window functions. But as the data set grew performance suffered, so if deletes were
not done on a regular, continuing basis in order to keep the historical data set approximately
"small", the process execution time using a windowing scheme eventually exceeded the extent of my
patience.

That "non-scalable" situation is actually what motivated the deliberate de-normalization (of
retaining the "running balance" in a separate column) and the desire to delete old data. The
original implementation calculated the running balance on-the-fly, employing windowing per
fairian_id, and those tallies of the net balance entailed increasingly lengthy execution times as
the number of rows increased, hence I was motivated to retain only a relatively constant-sized
per-farian history, and I dismissed the use of windowing for the delete problem since it was so
problematic for the running-balance-without-delete problem.

Thanks for knocking some sense into me!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Berend Tober
btober@computer.org
In reply to: Adrian Klaver (#3)
Re: Deletion Challenge

Adrian Klaver wrote:

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...

test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click, cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance | description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 | Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 | Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 | Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 | initial cash balance
413 | 1 | 25 | | 995 | 0 | Redeemed bond 7719a1c782a1ba
(5 rows)

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to pursue it because at some
point in the distant past I had gained the impression that NOT IN queries were not computationally
efficient. During one round of testing I had like a million rows. I'll have to run some EXPLAIN
query testing with a larger data sample for comparison. Thanks!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Berend Tober (#5)
Re: Deletion Challenge

On 12/09/2015 12:24 AM, Berend Tober wrote:

Adrian Klaver wrote:

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...

test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------

412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7David G. Johnston
david.g.johnston@gmail.com
In reply to: Adrian Klaver (#6)
Re: Deletion Challenge

On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 12/09/2015 12:24 AM, Berend Tober wrote:

Adrian Klaver wrote:

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...

test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description

-------+-----------------+------------+-------+--------+---------+----------------------------------

412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;

​Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT
DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY
click DESC, cash_journal_id" or something similar? It doesn't seem like
you should need to introduce an array and an aggregate here.

​It does have the negative property of only providing a single row; which
excludes using it for the "last 5" part but I suspect it will be
considerably faster for the single version.

David J.

#8Steve Crawford
scrawford@pinpointresearch.com
In reply to: David G. Johnston (#7)
Re: Deletion Challenge

The two general solutions are the "keep the last one" proposed by Adrian
"keep the last N" that I sent.

But it might be worth stepping back a bit. You said you are having
performance problems that you feel would be improved by removing only a
million rows which doesn't sound like that much to me. It's less than half
of what I *add* to just one of my tables every week and my database is
dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or
query optimization at play. Depending on your requirements, partitioning
might be useful. It wouldn't be last N but could easily be done to
partition by date-ranges which makes archiving and purging a low-cost
operation.

You might want to expand a bit on the core issue you are trying to solve.

Cheers,
Steve

On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <
david.g.johnston@gmail.com> wrote:

Show quoted text

On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 12/09/2015 12:24 AM, Berend Tober wrote:

Adrian Klaver wrote:

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from
a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.

...

test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description

-------+-----------------+------------+-------+--------+---------+----------------------------------

412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!

Plan B:

WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;

​Couldn't the LEFT JOIN relation in the CTE be better written using
"SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...]
ORDER BY click DESC, cash_journal_id" or something similar? It doesn't
seem like you should need to introduce an array and an aggregate here.

​It does have the negative property of only providing a single row; which
excludes using it for the "last 5" part but I suspect it will be
considerably faster for the single version.

David J.

#9Berend Tober
btober@computer.org
In reply to: Steve Crawford (#8)
Re: Deletion Challenge

Steve Crawford wrote:

The two general solutions are the "keep the last one" proposed by Adrian
"keep the last N" that I sent.

But it might be worth stepping back a bit. You said you are having
performance problems that you feel would be improved by removing only a
million rows which doesn't sound like that much to me. It's less than
half of what I *add* to just one of my tables every week and my database
is dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or
query optimization at play. Depending on your requirements, partitioning
might be useful. It wouldn't be last N but could easily be done to
partition by date-ranges which makes archiving and purging a low-cost
operation.

You might want to expand a bit on the core issue you are trying to solve.

I really appreciate the deep-dive.

I'm quite sure the performance issue is mostly a hardware limitation at
this point. The application is in a developmental phase and running for
test purposes on grossly inadequate hardware for production purposes ...
think home laptop computer in one case.

The issue is that I'd like the application (that is, the data base and
its stored procedures) to be robust enough to be a "long-running"
application, i.e. one that doesn't suffer gradual performance
degradation as time and the accumulated data increase. For the cash and
similar journals, I'd like to retain some amount of history so that
players can review "recent" transactions so as to understand and verify
how the current balance was arrived at, but at some point let old
transactions age-off and be deleted.

This question was sort of addressed at the "query tuning" aspect, and
I'm confident that partitioning would help. But since this is just a
game, retention of a full and auditable history is not really required:
I have a lot of flexibility to determine what to keep and in fact am not
exactly sure how much to keep. I know I need at least one row in order
to retain the current balance, but I'm thinking something on the order
of scores or hundreds, maybe a thousand transactions per player in each
of several similar journals retained at any point in time would be
sufficient.

This project is a game, btw, described at

https://github.com/bmtober/fairwinds

for those interested in the backstory.

I am eager to get some real-world experience with multiple players
actually using the application and providing feedback, which is
gradually happening by means of presentations I have and am scheduled to
make at local user groups. Eventually I want to raise some money to rent
some virtual private server space and host it on a publicly-available
site when obvious scalability issues like this are mitigated.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Berend Tober (#9)
Re: Deletion Challenge

On 12/9/15 7:59 PM, Berend Tober wrote:

The issue is that I'd like the application (that is, the data base and
its stored procedures) to be robust enough to be a "long-running"
application, i.e. one that doesn't suffer gradual performance
degradation as time and the accumulated data increase.

In my experience, no such thing exists. This is one of the things that
makes database development very different than other forms of
programming. Once you write a program, it's basically always going to
perform the same. Database performance slowly changes over time, not
only due to different amounts of data, but also different *composition*
of data stored.

Of course, it is wise to avoid things that will obviously hurt future
performance, so it's good that you're thinking about this. But don't
expect something you can "set and forget". :)

This question was sort of addressed at the "query tuning" aspect, and
I'm confident that partitioning would help.

Keep in mind that partitioning isn't a magic bullet either, though in
this case I agree it would help. Sometimes something as simple as having
"active" and "history" partitions is enough.

This project is a game, btw, described at

You might be interested in https://schemaverse.com/
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Benjamin Smith
lists@benjamindsmith.com
In reply to: Berend Tober (#1)
Re: Deletion Challenge

On Saturday, December 05, 2015 11:08:05 AM Berend Tober wrote:

WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
)
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click;

WITH max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
)
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id;

Although I couldn't be sure if this would provide atomicity, I'd merge these
into one query like:

WITH max_click AS (
SELECT
cash_journal.fairian_id,
max(cash_journal.click) AS click
FROM cash_journal
GROUP BY cash_journal.fairian_id
),
max_journal_id AS (
SELECT
cash_journal.fairian_id,
cash_journal.click,
max(cash_journal.cash_journal_id) AS cash_journal_id
FROM cash_journal
GROUP BY cash_journal.fairian_id, cash_journal.click
),
delete_journal1 AS
(
delete from cash_journal j
using max_click b
where j.fairian_id = b.fairian_id
and j.click < b.click
returning *, 'journal1'::varchar AS source
),
delete_journal2 AS
(
delete from cash_journal j
using max_journal_id b
where j.fairian_id = b.fairian_id
and j.click = b.click
and j.cash_journal_id < b.cash_journal_id
returning *, 'journal2'::varchar AS source
)
-- AND THEN TO FIND OUT WHAT HAPPENED
SELECT delete_journal1.*
UNION ALL
select delete_journal2.*

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Benjamin Smith
lists@benjamindsmith.com
In reply to: Adrian Klaver (#3)
Re: Deletion Challenge

test=> delete from cash_journal where ARRAY[click, cash_journal_id] NOT in
(select max(ARRAY[click,cash_journal_id]) from cash_journal group by
fairian_id); DELETE 7

For what it's worth, we've run into *severe* performance issues using in() if
there are a large number of values in conjunction with a complex query. (EG:
more than 10,000) Using a with() prefix table and joining against that doesn't
seem to carry anything like a similar performance penalty.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Berend Tober
btober@computer.org
In reply to: Jim Nasby (#10)
Re: Deletion Challenge

Jim Nasby wrote:

On 12/9/15 7:59 PM, Berend Tober wrote:

This project is a game, btw, described at

You might be interested in https://schemaverse.com/

Schemaverse looks somewhat interesting. Seems like it and Fairwinds
share in common Postgresql as a foundation, but they are very different
games: while they both have an aspect of conquest, one by battle, one by
economic dominance, Fairwinds involves a different kind of strategy that
entails a balance of cooperation and competition. Also, whereas in
Schemaverse you get "free money" to build your space ship just by
joining the game, a fundamental feature of Fairwinds is wealth creation
... you have to make your own success out of nothing.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general