Upsert Functionality using CTEs

Started by Tim Uckunabout 13 years ago6 messagesgeneral

Jump to latest

Tim Uckun

timuckun@gmail.com

about 13 years ago

I am using a query like this to try and normalize a table.

WITH nd as (select * from sales order by id limit 100),

people_update as (update people p set first_name = nd.first_name from
nd where p.email = nd.email returning nd.id),

insert into people (first_name, email, created_at, updated_at)
select first_name, email , now(), now()
from nd
left join people_update using(id) where
people_update.id is null),

This works pretty good except for when the top 100 records have
duplicated email address (two sales for the same email address).

I am wondering what the best strategy is for dealing with this
scenario. Doing the records one at a time would work but obviously it
would be much slower. There are no other columns I can rely on to
make the record more unique either.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sergey Konoplev

gray.ru@gmail.com

about 13 years ago

In reply to: Tim Uckun (#1)

Re: Upsert Functionality using CTEs

On Sun, Feb 10, 2013 at 5:23 PM, Tim Uckun <timuckun@gmail.com> wrote:

This works pretty good except for when the top 100 records have
duplicated email address (two sales for the same email address).

How is it assumed to work when the migrating email already exists in people?

I am wondering what the best strategy is for dealing with this
scenario. Doing the records one at a time would work but obviously it
would be much slower. There are no other columns I can rely on to
make the record more unique either.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sergey Konoplev
Database and Software Architect
http://www.linkedin.com/in/grayhemp

Phones:
USA +1 415 867 9984
Russia, Moscow +7 901 903 0499
Russia, Krasnodar +7 988 888 1979

Skype: gray-hemp
Jabber: gray.ru@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys

haramrae@gmail.com

about 13 years ago

In reply to: Tim Uckun (#1)

Re: Upsert Functionality using CTEs

On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote:

This works pretty good except for when the top 100 records have
duplicated email address (two sales for the same email address).

I am wondering what the best strategy is for dealing with this
scenario. Doing the records one at a time would work but obviously it
would be much slower. There are no other columns I can rely on to
make the record more unique either.

The best strategy is fixing your data-model so that you have a unique key. As you found out already, e-mail addresses aren't very suitable as unique keys for people. For this particular case I'd suggest adding a surrogate key.

Alternatively, you might try using (first_name, email) as your key. You'll probably still get some duplicates, but they should be less and perhaps few enough for your case.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tim Uckun

timuckun@gmail.com

about 13 years ago

In reply to: Alban Hertroys (#3)

Re: Upsert Functionality using CTEs

The best strategy is fixing your data-model so that you have a unique key. As you found out already, e-mail addresses aren't very suitable as unique keys for people. For this particular case I'd suggest adding a surrogate key.

Alternatively, you might try using (first_name, email) as your key. You'll probably still get some duplicates, but they should be less and perhaps few enough for your case.

Yea I'll have to see if I can do a better match.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Geoff Winkless

pgsqlgeneral@geoff.dj

about 13 years ago

In reply to: Alban Hertroys (#3)

Re: Upsert Functionality using CTEs

On 11 February 2013 07:47, Alban Hertroys <haramrae@gmail.com> wrote:

On Feb 11, 2013, at 2:23, Tim Uckun <timuckun@gmail.com> wrote:

This works pretty good except for when the top 100 records have
duplicated email address (two sales for the same email address).

I am wondering what the best strategy is for dealing with this
scenario. Doing the records one at a time would work but obviously it
would be much slower. There are no other columns I can rely on to
make the record more unique either.

The best strategy is fixing your data-model so that you have a unique key.
As you found out already, e-mail addresses aren't very suitable as unique
keys for people. For this particular case I'd suggest adding a surrogate
key.

Alternatively, you might try using (first_name, email) as your key. You'll
probably still get some duplicates, but they should be less and perhaps few
enough for your case.

I read it that he has multiple sales from the same person? In which case
pretending that the two sales were from different people isn't the correct
result at all.

I may be missing the point of the query, but wasn't it to add an entry for
each email address and (if it already exists in people) to update the
firstname field to match the firstname in the order?

Given that the results will be returned in random order I would just change
the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be
missing the point.

Geoff

Tim Uckun

timuckun@gmail.com

about 13 years ago

In reply to: Geoff Winkless (#5)

Re: Upsert Functionality using CTEs

I read it that he has multiple sales from the same person? In which case
pretending that the two sales were from different people isn't the correct
result at all.

Actually it turns out that both cases exist.

I may be missing the point of the query, but wasn't it to add an entry for
each email address and (if it already exists in people) to update the
firstname field to match the firstname in the order?

Yes. Basically the sales table is denormalized and I need to create or
update normalized tables from it. As you have pointed out the
difficult scenario is when the same person has multiple sales.

Given that the results will be returned in random order I would just change
the "nd" query to SELECT DISTINCT ON (email) , but like I said I may be
missing the point.

I think in this case I am probably going to have go through the
records one at a time. I run into the same issues with other columns
such as products.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general