Limitting full join to one match

Started by Phil Endecottabout 7 years ago7 messages

spam_from_pgsql_lists@chezphil.org

about 7 years ago

Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 | <-- missing from b
| 2018-04-01 | 5.00 |
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 | <-- missing from a
| 2018-04-01 | 5.00 |
+------------+--------+

db=> select a.date, a.amount, b.date, b.amount from a full join b using (date,amount);
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
+------------+--------+------------+--------+

This works fine until I have multiple items with the same date
and amount:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+

db=> select a.date, a.amount, b.date, b.amount from a full join b using (date,amount);
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  3
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  4
+------------+--------+------------+--------+

It has, of course, put four rows in the output for the new items.

So my question is: how can I modify my query to output only two rows,
like this:?

+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
+------------+--------+------------+--------+

Any suggestions anyone?

The best I have found so far is something involving EXCEPT ALL:

db=> select * from a except all select * from b;
db=> select * from b except all select * from a;

That's not ideal, though, as what I ultimately want is something
that lists everything with its status:

+------------+--------+--------+
| date | amount | status |
+------------+--------+--------+
| 2018-01-01 | 10.00 | OK |
| 2018-02-01 | 5.00 | a_only |
| 2018-03-01 | 8.00 | b_only |
| 2018-04-01 | 5.00 | OK |
| 2018-05-01 | 20.00 | OK |
| 2018-05-01 | 20.00 | OK |
+------------+--------+--------+

That would be easy enough to achieve from the JOIN.

Thanks, Phil.

John W Higgins

wishdev@gmail.com

about 7 years ago

In reply to: Phil Endecott (#1)

Re: Limitting full join to one match

On Wed, Dec 5, 2018 at 4:34 PM Phil Endecott <
spam_from_pgsql_lists@chezphil.org> wrote:

Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

...

So my question is: how can I modify my query to output only two rows,
like this:?

+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
+------------+--------+------------+--------+

Evening Phil,

Window functions are your friend here. I prefer views for this stuff - but
subqueries would work just fine.

create view a_rows as (select *,
row_number() OVER (PARTITION BY date, amount) AS pos
from a);
create view b_rows as (select *,
row_number() OVER (PARTITION BY date, amount) AS pos
from b);

select
a_rows.date,
a_rows.amount,
a_rows.pos,
b_rows.date,
b_rows.amount,
b_rows.pos
from
a_rows full join b_rows using (date,amount,pos);

Example here - http://sqlfiddle.com/#!17/305d6/3

John

Show quoted text

Any suggestions anyone?

The best I have found so far is something involving EXCEPT ALL:

db=> select * from a except all select * from b;
db=> select * from b except all select * from a;

That's not ideal, though, as what I ultimately want is something
that lists everything with its status:

+------------+--------+--------+
| date | amount | status |
+------------+--------+--------+
| 2018-01-01 | 10.00 | OK |
| 2018-02-01 | 5.00 | a_only |
| 2018-03-01 | 8.00 | b_only |
| 2018-04-01 | 5.00 | OK |
| 2018-05-01 | 20.00 | OK |
| 2018-05-01 | 20.00 | OK |
+------------+--------+--------+

That would be easy enough to achieve from the JOIN.

Thanks, Phil.

Sergei Agalakov

sergei.agalakov@gmail.com

about 7 years ago

In reply to: John W Higgins (#2)

Re: Limitting full join to one match

On 12/5/2018 8:30 PM, John W Higgins wrote:

On Wed, Dec 5, 2018 at 4:34 PM Phil Endecott
<spam_from_pgsql_lists@chezphil.org
<mailto:spam_from_pgsql_lists@chezphil.org>> wrote:

Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

...

So my question is: how can I modify my query to output only two rows,
like this:?
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
+------------+--------+------------+--------+
Evening Phil,

Window functions are your friend here. I prefer views for this stuff -
but subqueries would work just fine.

create view a_rows as (select *,
row_number() OVER (PARTITION BY date, amount)
AS pos from a);
create view b_rows as (select *,
row_number() OVER (PARTITION BY date, amount)
AS pos from b);

select
a_rows.date,
a_rows.amount,
a_rows.pos,
b_rows.date,
b_rows.amount,
b_rows.pos
from
a_rows full join b_rows using (date,amount,pos);

Example here - http://sqlfiddle.com/#!17/305d6/3

John

Any suggestions anyone?

The best I have found so far is something involving EXCEPT ALL:

db=> select * from a except all select * from b;
db=> select * from b except all select * from a;

That's not ideal, though, as what I ultimately want is something
that lists everything with its status:

+------------+--------+--------+
| date | amount | status |
+------------+--------+--------+
| 2018-01-01 | 10.00 | OK |
| 2018-02-01 | 5.00 | a_only |
| 2018-03-01 | 8.00 | b_only |
| 2018-04-01 | 5.00 | OK |
| 2018-05-01 | 20.00 | OK |
| 2018-05-01 | 20.00 | OK |
+------------+--------+--------+

That would be easy enough to achieve from the JOIN.

Thanks, Phil.

This question is always asked time to time.
I have found an old article with so far the best solution for big tables.
https://asktom.oracle.com/pls/asktom/f?p=100:11:::::P11_QUESTION_ID:2151582681236#15393095283923

On the same test data
create table a (date date, amount money);
create table b (date date, amount money);

insert into a values ('2018-01-01', 10);
insert into a values ('2018-02-01', 5);
insert into a values ('2018-04-01', 5);
insert into a values ('2018-05-01', 20);
insert into a values ('2018-05-01', 20);
insert into b values ('2018-01-01', 10);
insert into b values ('2018-03-01', 8);
insert into b values ('2018-04-01', 5);
insert into b values ('2018-05-01', 20);
insert into b values ('2018-05-01', 20);

select tt.date,
       tt.amount,
       count(tt.src1) CNT1,
       count(tt.src2) CNT2
from
(
select a.date,
        a.amount,
        1 src1,
        null::integer src2
   from a
union all
select b.date,
        b.amount,
        null::integer src1,
        2 src2
from b
) tt
group by tt.date, tt.amount;

date amount    cnt1    cnt2
2018-01-01    $10.00     1 1
2018-02-01    $5.00 1       0
2018-03-01    $8.00      0   1
2018-04-01    $5.00      1       1
2018-05-01    $20.00   2       2

It requires a sort, so you may want to increase work_mem before
execution, and then return it back like
SET work_mem = '512MB';
... run your query
RESET work_mem;

Regards,

Sergei Agalakov

Ron

ronljohnsonjr@gmail.com

about 7 years ago

In reply to: Phil Endecott (#1)

Re: Limitting full join to one match

On 12/05/2018 06:34 PM, Phil Endecott wrote:

Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 | <-- missing from b
| 2018-04-01 | 5.00 |
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 | <-- missing from a
| 2018-04-01 | 5.00 |
+------------+--------+
db=> select a.date, a.amount, b.date, b.amount from a full join b using (date,amount);
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
+------------+--------+------------+--------+
This works fine until I have multiple items with the same date
and amount:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+

What's your PK on "a" and "b"?

(Also, gmail seems to think that all -- or at least most -- of your email is
spam.)

db=> select a.date, a.amount, b.date, b.amount from a full join b using (date,amount);
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  3
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  4
+------------+--------+------------+--------+
It has, of course, put four rows in the output for the new items.

So my question is: how can I modify my query to output only two rows,
like this:?
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
+------------+--------+------------+--------+
Any suggestions anyone?

The best I have found so far is something involving EXCEPT ALL:

db=> select * from a except all select * from b;
db=> select * from b except all select * from a;

That's not ideal, though, as what I ultimately want is something
that lists everything with its status:

+------------+--------+--------+
| date | amount | status |
+------------+--------+--------+
| 2018-01-01 | 10.00 | OK |
| 2018-02-01 | 5.00 | a_only |
| 2018-03-01 | 8.00 | b_only |
| 2018-04-01 | 5.00 | OK |
| 2018-05-01 | 20.00 | OK |
| 2018-05-01 | 20.00 | OK |
+------------+--------+--------+

That would be easy enough to achieve from the JOIN.

Thanks, Phil.

--
Angular momentum makes the world go 'round.

Phil Endecott

spam_from_pgsql_lists@chezphil.org

about 7 years ago

In reply to: John W Higgins (#2)

Re: Limitting full join to one match

John W Higgins wrote:

On Wed, Dec 5, 2018 at 4:34 PM Phil Endecott <
spam_from_pgsql_lists@chezphil.org> wrote:

Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

...
So my question is: how can I modify my query to output only two rows,
like this:?
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  1
| 2018-05-01 |  20.00 | 2018-05-01 |  20.00 |  2
+------------+--------+------------+--------+
Evening Phil,

Window functions are your friend here. I prefer views for this stuff - but
subqueries would work just fine.

create view a_rows as (select *,
row_number() OVER (PARTITION BY date, amount) AS pos
from a);
create view b_rows as (select *,
row_number() OVER (PARTITION BY date, amount) AS pos
from b);

select
a_rows.date,
a_rows.amount,
a_rows.pos,
b_rows.date,
b_rows.amount,
b_rows.pos
from
a_rows full join b_rows using (date,amount,pos);

Thanks John, that's great. I'm a little surprised that there isn't an
easier way, but this certainly works.

Regard, Phil.

Phil Endecott

spam_from_pgsql_lists@chezphil.org

about 7 years ago

In reply to: Ron (#4)

Re: Limitting full join to one match

Hi Ron,

Ron wrote:

On 12/05/2018 06:34 PM, Phil Endecott wrote:
Dear Experts,

I have a couple of tables that I want to reconcile, finding rows
that match and places where rows are missing from one table or the
other:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 | <-- missing from b
| 2018-04-01 | 5.00 |
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 | <-- missing from a
| 2018-04-01 | 5.00 |
+------------+--------+
db=> select a.date, a.amount, b.date, b.amount from a full join b using (date,amount);
+------------+--------+------------+--------+
|    date    | amount |    date    | amount |
+------------+--------+------------+--------+
| 2018-01-01 |  10.00 | 2018-01-01 |  10.00 |
| 2018-02-01 |   5.00 |            |        |
|            |        | 2018-03-01 |   8.00 |
| 2018-04-01 |   5.00 | 2018-04-01 |   5.00 |
+------------+--------+------------+--------+
This works fine until I have multiple items with the same date
and amount:

db=> select * from a;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-02-01 | 5.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+

db=> select * from b;
+------------+--------+
| date | amount |
+------------+--------+
| 2018-01-01 | 10.00 |
| 2018-03-01 | 8.00 |
| 2018-04-01 | 5.00 |
| 2018-05-01 | 20.00 | <--
| 2018-05-01 | 20.00 | <--
+------------+--------+
What's your PK on "a" and "b"?

These input tables can have duplicate rows, so defining a primary key
requires something like a row ID or similar.

(Also, gmail seems to think that all -- or at least most -- of your email is
spam.)

Yes, it is becoming increasingly difficult to persuade gmail etc. that
you are not a spammer if you run your own mail server. If you have any
interesting headers suggesting exactly what they disliked about my message,
could you please forward them off-list? Thanks.

Regards, Phil.

Ravi Krishna

srkrishna@fastmail.com

about 7 years ago

In reply to: Phil Endecott (#6)

Re: Limitting full join to one match

Yes, it is becoming increasingly difficult to persuade gmail etc. that> you are not a spammer if you run your own mail server. If you
have any> interesting headers suggesting exactly what they disliked about my
message,> could you please forward them off-list? Thanks.

It is for this reason (and few others) I am off gmail and other free
email accounts.I have tried gmail / outlook / yahoo / aol and all of them mark many
mails to this listas spam.

I like fastmail a lot and $3 per month is practically free.