Manual query vs trigger during data load

Started by yudhi sover 1 year ago10 messagesgeneral
Jump to latest
#1yudhi s
learnerdatabase99@gmail.com

Hello All,

We are having a table which is going to be inserted with 100's of millions
of rows each day. And we now want to have a requirement in which we need to
do some transformation/lookup logic built on top of a few of the input bind
values , while inserting the data. So I wanted to understand ,is it
possible to do it along with the INSERT query or is it better to have a
trigger created for the same?

For. e.g Below is the current Insert query used in the Java code. We want
to fetch the value for "column2" from a lookup table rather than directly
inserting as it's coming from the customer side. So I am thinking of a
trigger like below. But at the same time I also want to compare the
performance of a normal way of doing the lookup vs having it performed
using triggers.

So one way i am thinking is first fetching the value of the "column2" from
reference_tab1 using a separate "select query" in Java code itself, and
then passing that to the below insert query, but i think that will increase
the response time as that will be a separate DB call.

1)So, is there a way I can do it directly using the single INSERT query
itself without additional SELECT query? And then will try to compare that
with the trigger based approach.
2)Additionally , if this decision will impact a batch insert approach. i.e.
say , in case of trigger , will the batch insert fail because trigger will
force it to make it row by row?

INSERT INTO tab_part1 (column1, column2, column3, column4, column5,
part_date)
VALUES (:v_col1, :v_col2, :v_col3, :v_col3, :v_col4,:v_col5,
CURRENT_DATE);

CREATE OR REPLACE FUNCTION trg_func_populate_column2() RETURNS TRIGGER AS $$
BEGIN
-- Fetch reference value and populate column2
NEW.column2 := (SELECT lookup_key FROM reference_tab1 WHERE lookup_key
= old.column2);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Regards
Yudhi

#2Thiemo Kellner
thiemo@gelassene-pferde.biz
In reply to: yudhi s (#1)

Hi

To me it seems possible to create an insert select, e.g.

Insert into tab1 (val1, val2)
Select valA, valB
From tab2
Where valC = :param1

Cheers

Thiemo

#3Juan Rodrigo Alejandro Burgos Mella
rodrigoburgosmella@gmail.com
In reply to: yudhi s (#1)
Re: Manual query vs trigger during data load

Hello, I find it unlikely that the trigger will work properly, since the
reserved fields of the OLD subset have no value in an INSERT

Atte
JRBN

El vie, 13 de sept de 2024, 04:32, yudhi s <learnerdatabase99@gmail.com>
escribió:

Show quoted text

Hello All,

We are having a table which is going to be inserted with 100's of millions
of rows each day. And we now want to have a requirement in which we need to
do some transformation/lookup logic built on top of a few of the input bind
values , while inserting the data. So I wanted to understand ,is it
possible to do it along with the INSERT query or is it better to have a
trigger created for the same?

For. e.g Below is the current Insert query used in the Java code. We want
to fetch the value for "column2" from a lookup table rather than directly
inserting as it's coming from the customer side. So I am thinking of a
trigger like below. But at the same time I also want to compare the
performance of a normal way of doing the lookup vs having it performed
using triggers.

So one way i am thinking is first fetching the value of the "column2" from
reference_tab1 using a separate "select query" in Java code itself, and
then passing that to the below insert query, but i think that will increase
the response time as that will be a separate DB call.

1)So, is there a way I can do it directly using the single INSERT query
itself without additional SELECT query? And then will try to compare that
with the trigger based approach.
2)Additionally , if this decision will impact a batch insert approach.
i.e. say , in case of trigger , will the batch insert fail because
trigger will force it to make it row by row?

INSERT INTO tab_part1 (column1, column2, column3, column4, column5,
part_date)
VALUES (:v_col1, :v_col2, :v_col3, :v_col3, :v_col4,:v_col5,
CURRENT_DATE);

CREATE OR REPLACE FUNCTION trg_func_populate_column2() RETURNS TRIGGER AS
$$
BEGIN
-- Fetch reference value and populate column2
NEW.column2 := (SELECT lookup_key FROM reference_tab1 WHERE lookup_key
= old.column2);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Regards
Yudhi

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Juan Rodrigo Alejandro Burgos Mella (#3)
Re: Manual query vs trigger during data load

On 9/13/24 02:58, Juan Rodrigo Alejandro Burgos Mella wrote:

Hello, I find it unlikely that the trigger will work properly, since the
reserved fields of the OLD subset have no value in an INSERT

I'm not seeing that the OP is asking for OLD.* values, they are just
looking to include the result of a lookup on another table in the INSERT.

Atte
JRBN

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Adrian Klaver (#4)
Re: Manual query vs trigger during data load

On 9/13/24 07:50, Adrian Klaver wrote:

On 9/13/24 02:58, Juan Rodrigo Alejandro Burgos Mella wrote:

Hello, I find it unlikely that the trigger will work properly, since
the reserved fields of the OLD subset have no value in an INSERT

I'm not seeing that the OP is asking for OLD.* values, they are just
looking to include the result of a lookup on another table in the INSERT.

My mistake I see the OLD reference now.

Atte
JRBN

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Rob Sargent
robjsargent@gmail.com
In reply to: Adrian Klaver (#5)
Re: Manual query vs trigger during data load

On Sep 13, 2024, at 10:57 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 9/13/24 07:50, Adrian Klaver wrote:

On 9/13/24 02:58, Juan Rodrigo Alejandro Burgos Mella wrote:
Hello, I find it unlikely that the trigger will work properly, since the reserved fields of the OLD subset have no value in an INSERT

I'm not seeing that the OP is asking for OLD.* values, they are just looking to include the result of a lookup on another table in the INSERT.

My mistake I see the OLD reference now.

Personally I would cache the lookup on the java side a send the correct value to a simple insert statement

Show quoted text
#7yudhi s
learnerdatabase99@gmail.com
In reply to: Adrian Klaver (#5)
Re: Manual query vs trigger during data load

On Fri, Sep 13, 2024 at 8:27 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 9/13/24 07:50, Adrian Klaver wrote:

On 9/13/24 02:58, Juan Rodrigo Alejandro Burgos Mella wrote:

Hello, I find it unlikely that the trigger will work properly, since
the reserved fields of the OLD subset have no value in an INSERT

I'm not seeing that the OP is asking for OLD.* values, they are just
looking to include the result of a lookup on another table in the INSERT.

My mistake I see the OLD reference now.

My mistake.The trigger was supposed to use "new.col2" and fetch the
corresponding lookup value from the lookup table and insert that value to
the target table.

Now my question was ,in such a situation , the trigger will work fine , but
is that the optimal way of doing ? Or should we convert the query someway
such that the lookup table can be queried along with the INSERT at one shot
from the database with a single DB call? And is it true that the trigger on
the target table will suppress the batch insert and make it row by row,
even if we call it in a batch fashion?

As "thiemo" mentioned , it can be done as below method, but if we have
multiple lookup tables to be populated for multiple columns , then , how
can the INSERT query be tweaked to cater the need here? And I understand ,
the lookup table can be cached in Java and refreshed at a certain point in
time, but I was trying to understand if this can be doable by directly
querying the database, considering the lookup tables are having large data
sets in them.

Insert into tab1 (val1, val2)
Select valA, valB
From tab2
Where valC = :param1

#8Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: yudhi s (#7)
Re: Manual query vs trigger during data load

On 2024-09-14 00:54:49 +0530, yudhi s wrote:

As "thiemo" mentioned , it can be done as below method, but if we have
multiple lookup tables to be populated for multiple columns , then , how can
the INSERT query be tweaked to cater the need here?

Just use a join:
insert into target(val1, val2, val3, val4)
select :param1, cfgA.substA, :param3, cfgB.substB
from cfgA, cfgB
where cfgA.keyA = :param2 and cfgB.keyB = :param4

Or use a CTE per lookup which might be more readable:

with cA as ( select substA from cfgA where keyA = :param2 ),
cB as ( select substB from cfgB where keyB = :param4 )
insert into target(val1, val2, val3, val4)
select :param1, cA.substA, :param3, cB.substB
from cA, cB

However, I agree with Rob here. It's probably better to do the
substitution in Java.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#9yudhi s
learnerdatabase99@gmail.com
In reply to: Peter J. Holzer (#8)
Re: Manual query vs trigger during data load

On Sat, Sep 14, 2024 at 4:17 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2024-09-14 00:54:49 +0530, yudhi s wrote:

As "thiemo" mentioned , it can be done as below method, but if we have
multiple lookup tables to be populated for multiple columns , then , how

can

the INSERT query be tweaked to cater the need here?

Just use a join:
insert into target(val1, val2, val3, val4)
select :param1, cfgA.substA, :param3, cfgB.substB
from cfgA, cfgB
where cfgA.keyA = :param2 and cfgB.keyB = :param4

Or use a CTE per lookup which might be more readable:

with cA as ( select substA from cfgA where keyA = :param2 ),
cB as ( select substB from cfgB where keyB = :param4 )
insert into target(val1, val2, val3, val4)
select :param1, cA.substA, :param3, cB.substB
from cA, cB

Thank you. I will try these options.
Also we are trying to do something as below , which will separate the
tables based on the specific lookup fields for the target tables and thus
it will look simple rather than using those reference tables in the From
clause which may cause some confusion in reading the code or not sure if it
will cause cartesian. Please correct me if I'm wrong.

INSERT INTO tab_part1 (column1, column2, column3, column4, column5,
part_date)
VALUES ( :v_col1, (SELECT lookup_value FROM reference_tab1 WHERE lookup_key
= :v_col2), :v_col3, :v_col4, :v_col5, CURRENT_DATE );

#10Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: yudhi s (#9)
Re: Manual query vs trigger during data load

On 2024-09-14 21:21:45 +0530, yudhi s wrote:

On Sat, Sep 14, 2024 at 4:17 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2024-09-14 00:54:49 +0530, yudhi s wrote:

As "thiemo" mentioned , it can be done as below method, but if
we have multiple lookup tables to be populated for multiple
columns , then , how can the INSERT query be tweaked to cater
the need here?

Just use a join:
    insert into target(val1, val2, val3, val4)
    select :param1, cfgA.substA, :param3, cfgB.substB
    from cfgA, cfgB
    where cfgA.keyA = :param2 and cfgB.keyB = :param4

Or use a CTE per lookup which might be more readable:

    with cA as ( select substA from cfgA where keyA = :param2 ),
         cB as ( select substB from cfgB where keyB = :param4 )
    insert into target(val1, val2, val3, val4)
    select :param1, cA.substA, :param3, cB.substB
    from cA, cB

Thank you. I will try these options. 
Also we are trying to do something as below , which will separate the tables
based on the specific lookup fields for the target tables and thus it will look
simple rather than using those reference tables in the From clause which may
cause some confusion in reading the code or not sure if it will cause
cartesian. Please correct me if I'm wrong.

My examples do form a cartesian product, but as long as the keys are
unique, that's 1 * 1 * 1 ... * 1 = 1 rows. So that should not be a
problem in case of simple lookup tables.

That may not be immediately apparent to someone reading the code,
though. And it might fail horribly if the lookups aren't guaranteed to
return a single row.

INSERT INTO tab_part1 (column1, column2, column3, column4, column5, part_date)
VALUES ( :v_col1, (SELECT lookup_value FROM reference_tab1 WHERE lookup_key =
:v_col2), :v_col3, :v_col4, :v_col5, CURRENT_DATE ); 

Your approach is safer in that it will abort with an error if the
subquery ever returns more than one value. It will also still insert a
row (with null in column2) if the subquery returns no rows, which may or
may not be what you want (and if you don't want it you can probably
prevent it with a not null constraint). Looks good to me.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"