Bulk inserts into two (related) tables
I'm cleaning and formatting a 800-line data file to be inserted into a
database. Some of the input file fields will be inserted into an
'organizations' table letting postgres assign sequential org_id numbers.
Other fields will be inserted into a separate 'people' table associated with
each organization. The people table insert allows postgres to assign the
person_id number and this table's foreign key, people.org_id ==
organizations.org_id.
I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.
Is there a way to insert the two tables sequentially without manually adding
the organizations.org_id to the appropriate foreign key column (people.org_id)
of the people table?
Rich
On 5/21/19 9:56 AM, Rich Shepard wrote:
I'm cleaning and formatting a 800-line data file to be inserted into a
database. Some of the input file fields will be inserted into an
'organizations' table letting postgres assign sequential org_id numbers.
Other fields will be inserted into a separate 'people' table associated
with
each organization. The people table insert allows postgres to assign the
person_id number and this table's foreign key, people.org_id ==
organizations.org_id.I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.Is there a way to insert the two tables sequentially without manually
adding
the organizations.org_id to the appropriate foreign key column
(people.org_id)
of the people table?
Well you are not going to know the org_id until the organization table
is loaded, which means something like:
1) First run through file load the organizations table.
2) Build a mapping of org_id to organization.
3) Run through data file again and load people data using the mapping in
2) to provide the people.org_id.
Rich
--
Adrian Klaver
adrian.klaver@aklaver.com
Rich:
On Tue, May 21, 2019 at 6:56 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
I'm cleaning and formatting a 800-line data file to be inserted into a
database. Some of the input file fields will be inserted into an
'organizations' table letting postgres assign sequential org_id numbers.
Other fields will be inserted into a separate 'people' table associated with
each organization. The people table insert allows postgres to assign the
person_id number and this table's foreign key, people.org_id ==
organizations.org_id.
I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.
Is there a way to insert the two tables sequentially without manually adding
the organizations.org_id to the appropriate foreign key column (people.org_id)
of the people table?
From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.
You can do it in a couple steps, first do a copy (org_id, org_name,
org.others) into the organizations table,
then create a temporary table and copy (people.id, people.org_name,
people.*) into it
and then insert into people "select people.id, org.id as
people_org_id, people.* from tmp_people, orgs where
tmp_people.org_name = org.org_name)
and drop the temp table.
If the matching data is some king of code you do not want in the final
organization table, create a temporary organizations table, copy into
it, join it with the temporary people table, insert into the final
organizations table slicing with a select, drop it.
If you are using a sequence or similar thing for generating
organization ids and use the second approach, remember to use the same
sequence to generate the default values into the temporary table, or
chaos will ensue ( not really, but you'll need to advance it manually
).
This is a classic problem, the classic way to solve is that, prepare
an script which loads some temporary tables and then insert joins into
the final ones. On small data sets like yours you can just edit
everything into a single sql script.
Francisco Olarte.
On Tue, 21 May 2019, Adrian Klaver wrote:
Well you are not going to know the org_id until the organization table is
loaded, which means something like:1) First run through file load the organizations table.
2) Build a mapping of org_id to organization.
3) Run through data file again and load people data using the mapping in 2)
to provide the people.org_id.
Adrian,
That's what I planned to do. I wondered if there was a more direct way known
to those with more experience than I have. The short answer is 'no.'
Thanks,
Rich
For each row-
Insert into organizations table if the record does not exist, returning ID.
Insert into people using that ID.
Else, load all the data with empty ID column on person table,then just
update the person table afterward and drop the org name column.
Perhaps I am missing something.
On Tue, 21 May 2019, Francisco Olarte wrote:
From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.
Francisco,
Not yet with these new data.
I'll manually insert the org_id numbers from the organizations table into
the people table.
Thanks,
Rich
On 5/21/19 10:22 AM, Rich Shepard wrote:
On Tue, 21 May 2019, Adrian Klaver wrote:
Well you are not going to know the org_id until the organization table
is loaded, which means something like:1) First run through file load the organizations table.
2) Build a mapping of org_id to organization.
3) Run through data file again and load people data using the mapping
in 2) to provide the people.org_id.Adrian,
That's what I planned to do. I wondered if there was a more direct way
known
to those with more experience than I have. The short answer is 'no.'
The other way is to create the org_id for each organization ahead of
time and put it into the data file. Either way you have create the
org_id for the FK relationship, it is just a matter of where and when.
Thanks,
Rich
--
Adrian Klaver
adrian.klaver@aklaver.com
On Tue, 21 May 2019, Michael Lewis wrote:
For each row-
Insert into organizations table if the record does not exist, returning ID.
Insert into people using that ID.
Michael,
The org_id will not exist until I run the insert script.
Else, load all the data with empty ID column on person table,then just
update the person table afterward and drop the org name column.
Then I'll copy the org_id numbers to the appropriate row(s) in the people
table.
Thanks,
Rich
On Tue, 21 May 2019, Adrian Klaver wrote:
The other way is to create the org_id for each organization ahead of time
and put it into the data file. Either way you have create the org_id for
the FK relationship, it is just a matter of where and when.
Adrian,
I had thought of that but overlooked it. The max(org_id) number is 338 so I
can assign sequential numbers above that as I format the data for the insert
scripts.
Thanks for the reminder,
Rich
On 5/21/19 12:27 PM, Rich Shepard wrote:
On Tue, 21 May 2019, Michael Lewis wrote:
For each row-
Insert into organizations table if the record does not exist, returning ID.
Insert into people using that ID.Michael,
The org_id will not exist until I run the insert script.
Else, load all the data with empty ID column on person table,then just
update the person table afterward and drop the org name column.Then I'll copy the org_id numbers to the appropriate row(s) in the people
table.
A Perl/Python/etc script would solve your problem.
--
Angular momentum makes the world go 'round.
On May 21, 2019, at 9:56 AM, Rich Shepard <rshepard@appl-ecosys.com> wrote:
I could insert all new rows into the organizations table, then produce a
list of the org_id and org_name to manually insert the foreign key in the
related people table. Time and effort intense.
You can either use a script for the entire process, or, convert the people table to INSERT statements that have a SELECT for the foreign key as part of the insert.
INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org WHERE org_name=‘Main Office’))
On Tue, 21 May 2019, Will Hartung wrote:
You can either use a script for the entire process, or, convert the people
table to INSERT statements that have a SELECT for the foreign key as part
of the insert.INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id
FROM org WHERE org_name=‘Main Office’))
Thanks, Will. That's a good alternative.
Regards,
Rich
On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard@appl-ecosys.com>
wrote:
On Tue, 21 May 2019, Francisco Olarte wrote:
From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.Francisco,
Not yet with these new data.
I'll manually insert the org_id numbers from the organizations table into
the people table.
To me, this is the key to your problem what will either make this a time
saver or time waster. Somehow you are accounting for what uniquely
identifies organizations, right?
Say there are 5 fields that correspond to an organization. I assume then
you are creating only one new org_id for each unique combination of these
fields?
Then take Francisco's suggestion, only use an md5 of the organization
fields to create yourself a unique identifier. Then you can use ctid
(unique internal identifier for each row) to join back. You use SQL like
this:
SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;
Assume in example below that your unique "org" rows are the first 3 fields:
1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text,
person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;
Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');
2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name
text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4
3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back
later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM loader)
-- build hash of org fields in organizations table to join back to loader
and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM organizations)
-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)
-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;
Final data ready for the person table to be populated:
test=# table organizations;
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)
test=# table loader;
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)
Hope this helps!
Thanks,
Jeremy
On Tue, May 21, 2019 at 7:24 PM Rich Shepard <rshepard@appl-ecosys.com> wrote:
From how you say it, I assume you have some data in your original
dumps which can relate boths, lets assume it's org_name, but may be an
org-code. If you do not have it it means you cannot match people to
orgs in your data, all is lost.Not yet with these new data.
I'll manually insert the org_id numbers from the organizations table into
the people table.
I'm curious, what org_id do you put (manually) to the people? you must
have some way to match it ( like, say, "I have an organization line,
followed by lines for people in this organization"
Because if you cannot match them by hand, the Db is not going to be of use.
Knowing how you know the org for a people is crucial for a solution (
i.e., for the previous example I'll just add a synthetic org_code line
to each one and do the to temp tables trick for a long file, or, just
advance the sequence (if needed manually ( to reserve org_id )) and
them do a perl one liner to augment the data ( but I've been perling
since the 4.019 times, so this may not be as easy for others ).
Francisco Olarte.
Jeremy:
On Tue, May 21, 2019 at 11:58 PM Jeremy Finzel <finzelj@gmail.com> wrote:
Then take Francisco's suggestion, only use an md5 of the organization fields to create yourself a unique identifier. Then you can use ctid (unique internal identifier for each row) to join back. You use SQL like this:
Sadly my suggestion only works if you can ( manually ) assign an
organization line to a people line, md5, field concatenation,
everything else is just optimization.
From what the OP has already told you have a heap of people, a heap of
organizations and a magic device to assign one to the others, the
org_id assignment ( using an unknown algorithm, we do not know if he
wants sequences, texts or cat gifs as IDs ) is easy, the pairing part
is unsolvable with the data we have.
At this moment I think the only useful link for this is
http://xyproblem.info/ ( for the OP, not for U ).
Francisco Olarte.
On Wed, 22 May 2019, Francisco Olarte wrote:
I'm curious, what org_id do you put (manually) to the people? you must
have some way to match it ( like, say, "I have an organization line,
followed by lines for people in this organization"
Francisco,
The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.
With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their organization.
I'll do more thinking about this.
I don't do perl, though.
Thanks,
Rich
On Wed, May 22, 2019 at 7:40 AM Rich Shepard <rshepard@appl-ecosys.com>
wrote:
On Wed, 22 May 2019, Francisco Olarte wrote:
I'm curious, what org_id do you put (manually) to the people? you must
have some way to match it ( like, say, "I have an organization line,
followed by lines for people in this organization"Francisco,
The first data entered was in small chunks so I'd add a few rows to the
organization table, output a file of org_id and org_name, then use the
source data to associate that org_id to the people associated with it.With this new data source I want to populate the two tables more quickly. I
think the time consuming part is associating people with their
organization.
I'll do more thinking about this.I don't do perl, though.
There's absolutely no need to use anything beyond SQL here, though you
could if you want to.
I really wonder how much we are just talking past each other simply because
we don't know what your data looks like, so we can't show you how our
examples apply to your use case. If you provided a sample scrubbed data
file, this whole thread probably would have been much shorter :). Can you
do that?
You said here again the most time consuming part is associating people with
their organization. Well, that's the whole question I was trying to
optimize on. You told Francisco that the data file does not have a unique
org name that could be used as a unique organization identifier. However
you seem to have contradicted that by responding favorably to this solution:
INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
org WHERE org_name=‘Main Office’))
How can this solution work if you said org_name cannot be used to link a
person in the data file?
So, the question again becomes: does your data file have a combination of
org fields that allows you to assign a person to a unique organization?
- If the answer is still no, how could you ever assign people to a
unique organization? In that case you need to massage your data file first
before trying to build a database schema, as Francisco noted. This is
basically a non-starter for your database schema.
- If the answer is yes, that means you can use the combination of those
fields to uniquely identify an organization, and thus link people to it
later. That's the reason for the md5 - to easily use many fields in
combination as a unique id
Thanks,
Jeremy
On Wed, 22 May 2019, Jeremy Finzel wrote:
There's absolutely no need to use anything beyond SQL here, though you
could if you want to.
Jeremy,
This is a new experience for me so I didn't think of a SQL solution.
I really wonder how much we are just talking past each other simply because
we don't know what your data looks like, so we can't show you how our
examples apply to your use case. If you provided a sample scrubbed data
file, this whole thread probably would have been much shorter :). Can you
do that?
Not necessary; see below. Also, these data come from a regulator and
provided as an Excel spreadsheet. If they were extracted from a database
then that was very poorly designed because there's no consistency in how
fields/columns are formatted. This requires manual cleaning.
Each row in the source file (exported from the spreadsheet as .csv and
renamed to .txt for processing in emacs and awk) is a mixture of attributes
that belong in either or both of the organization and people tables in my
database. An awk script will extract the appropriate fields for each table.
You told Francisco that the data file does not have a unique org name that
could be used as a unique organization identifier. However you seem to
have contradicted that by responding favorably to this solution:
The org_name is not the PK; the org_id is. This was assigned by postgres
when the original rows were inserted. Now, I can add the org_id in the
values to be inserted as I know the maximum org_id number in that table.
INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM
org WHERE org_name=‘Main Office’))
Question: do I use this same syntax for each row to be inserted or can I
make it one long insert statement by separating the parenthesized values
with commas as I do when I update multiple rows in a table?
Thanks very much,
Rich
Each row in the source file (exported from the spreadsheet as .csv and
renamed to .txt for processing in emacs and awk) is a mixture of attributes
Absolutely no need to rename the .csv for those tools.
Show quoted text
that belong in either or both of the organization and people tables in my
database. An awk script will extract the appropriate fields for each table.
On 5/22/19 7:38 AM, Rich Shepard wrote:
On Wed, 22 May 2019, Jeremy Finzel wrote:
There's absolutely no need to use anything beyond SQL here, though you
could if you want to.Jeremy,
This is a new experience for me so I didn't think of a SQL solution.
I really wonder how much we are just talking past each other simply
because
we don't know what your data looks like, so we can't show you how our
examples apply to your use case. If you provided a sample scrubbed data
file, this whole thread probably would have been much shorter :). Can
you
do that?Not necessary; see below. Also, these data come from a regulator and
provided as an Excel spreadsheet. If they were extracted from a database
then that was very poorly designed because there's no consistency in how
fields/columns are formatted. This requires manual cleaning.Each row in the source file (exported from the spreadsheet as .csv and
renamed to .txt for processing in emacs and awk) is a mixture of attributes
that belong in either or both of the organization and people tables in my
database. An awk script will extract the appropriate fields for each table.
So does the people data have an organization attribute?
If so why not just assign the org_id while cleaning up the data?
You told Francisco that the data file does not have a unique org name
that
could be used as a unique organization identifier. However you seem to
have contradicted that by responding favorably to this solution:The org_name is not the PK; the org_id is. This was assigned by postgres
when the original rows were inserted. Now, I can add the org_id in the
values to be inserted as I know the maximum org_id number in that table.INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id
FROM
org WHERE org_name=‘Main Office’))Question: do I use this same syntax for each row to be inserted or can I
make it one long insert statement by separating the parenthesized values
with commas as I do when I update multiple rows in a table?Thanks very much,
Rich
--
Adrian Klaver
adrian.klaver@aklaver.com