scaling up from t1n to 60 million records

Started by Martin Mueller6 days ago6 messagesgeneral
Jump to latest
#1Martin Mueller
martinmueller@northwestern.edu

I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet on steroids that analyzes rare or defective spellings in a corpus of 65,000 texts and1.5 billion words. I typically extract data from the corpus with python scripts, turn them into tables and load them into the database.

On my Mac with 32 GB of memory performance is OK with queries that typically within seconds extract data rows from tables with up to ten million rows. If the result set is large, I suspect that most of time machine's time is spent displaying result sets. I have used indexing sparingly. While it helps, the time savings often don't matter much.

I am thinking about scaling up to table with about 60 million rows. Are there things to do or watch out for? Or should I proceed on the assumption that that 60 million records are within scope and that the added timecost is roughly linear?

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

#2Jan Karremans
karremans.ja@gmail.com
In reply to: Martin Mueller (#1)
Re: scaling up from t1n to 60 million records

Dear Martin,

I think you would be mostly good for just going ahead with this.
You might look at the size of your tables, but I expect that all to be well within safe ranges.

Cheers,
Jan

Show quoted text

On 19 May 2026, at 16:27, Martin Mueller <martinmueller@northwestern.edu> wrote:

I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet on steroids that analyzes rare or defective spellings in a corpus of 65,000 texts and1.5 billion words. I typically extract data from the corpus with python scripts, turn them into tables and load them into the database.

On my Mac with 32 GB of memory performance is OK with queries that typically within seconds extract data rows from tables with up to ten million rows. If the result set is large, I suspect that most of time machine's time is spent displaying result sets. I have used indexing sparingly. While it helps, the time savings often don't matter much.

I am thinking about scaling up to table with about 60 million rows. Are there things to do or watch out for? Or should I proceed on the assumption that that 60 million records are within scope and that the added timecost is roughly linear?

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

#3Ron
ronljohnsonjr@gmail.com
In reply to: Martin Mueller (#1)
Re: scaling up from t1n to 60 million records

On Tue, May 19, 2026 at 10:27 AM Martin Mueller <
martinmueller@northwestern.edu> wrote:

I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet
on steroids that analyzes rare or defective spellings in a corpus of 65,000
texts and1.5 billion words. I typically extract data from the corpus with
python scripts, turn them into tables and load them into the database.

On my Mac with 32 GB of memory performance is OK with queries that
typically within seconds extract data rows from tables with up to ten
million rows. If the result set is large, I suspect that most of time
machine's time is spent displaying result sets. I have used indexing
sparingly. While it helps, the time savings often don't matter much.

I am thinking about scaling up to table with about 60 million rows. Are
there things to do or watch out for?

Use the correct tool for the task at hand, even if you are not a carpenter
and thus only know how to use a hammer.

Or should I proceed on the assumption that that 60 million records are

within scope and that the added timecost is roughly linear?

In my experience, database performance shows a hockey stick graph: good
while stuff fits in memory, and then suddenly not so good.

The correct tool for full text search is PG's Full Text Search (ts_vector)
facility, paired with GIN indexes. Do you use them? Probably not, based
on your comments, but that would "keep 'everything' in memory", thus
staving off performance degradation.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Martin Mueller (#1)
Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:

I use Postgres with a GUI frontend (Aquafold) as a very large
spreadsheet on steroids that analyzes rare or defective spellings in a
corpus of 65,000 texts and1.5 billion words.  I typically extract  data
from the corpus with python scripts, turn them into tables and load them
into the database.

On my Mac with 32 GB of memory performance is OK with queries that
typically within seconds extract data rows from tables  with up to ten
million rows.  If the result set is large, I suspect that most of time
machine's time is spent displaying result sets. I have used indexing
sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

I am thinking about scaling up to table with about 60 million rows.  Are
there things to do or watch out for? Or should I proceed on the
assumption that that 60 million records are within scope and that the
added timecost is roughly linear?

Martin Mueller

Professor emeritus of English and Classics

Northwestern University

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Martin Mueller
martinmueller@northwestern.edu
In reply to: Adrian Klaver (#4)
Re: scaling up from t1n to 60 million records

Here is a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern texts with 1.5 billion words. They exist as TEI-XML files and each word is wrapped in a <w> element. Here are the first and last two words in the corpus

                  <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>

      <w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>

...

     <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>

<pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>

The corpus has many corrupt spellings and errors in the linguistic annotation. Most of them are low-frequency phenomena and occur in no more than 64 documents. In nearly all cases you have enough evidence to correct a word or its annotation if you can see the word in the middle of a text string that includes

1.
the spelling
2.
the lemma
3.
the part of speech tag
4.
a standard spelling (e.g. 'loue' for 'love')
5.
up to seven previous words
6.
up to seven next words
7.
the spelling and POS tag of the previous
8.
the lemma and POS tag of the next word
9.
the Xpath of the current work

My goal is to involve users of the corpus in identifying and correcting corrupt readings. I call this a "philological shopping cart" since the offering of a correction can be thought of as a sale. Instead of buying something, with the machine registering the who, what, when, and where of the purchase, I offer an emendation, with the machine registering the who, what, when, and where of my emendation.

My hunch is that it would not be particularly difficult to build such a philological shopping cart and that in terms of scale it would not be a big thing.

I am trying to mirror that "shopping cart" on my Mac. There are about 60 million word occurrences that occur in no more than 64 texts. The basic table has the columns described above, and half a dozen other columns for data entry and various counts. There are some helper tables. The most important of them is a simple case-insensitive list of spellings with their document frequencies. This is very useful for finding suspect spellings with queries like "show me all spellings in a low frequency range that contain 'tb' and look for words where replacing 'tb' with 'th' will find a word with a higher document frequency. That picks up spellings like 'tbe', 'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years. I have Aqua Data Studio as a front end for Postgres, currently version 17, running on a five-year old Mac with an Intel processor and 32 GB of memory. I know a lot less about the innards of a SQL database than I should.

My largest kwic table has about 15 million rows with dozen columns for each row. Except for the left and right context, the columns consists of single words or numbers. The left and right context columns rarely add up to more than 35 characters each.. I have used plain indexes for some columns, with commands like "Create index on kwics16(keyword)", where 'kwics16' is the table name. My typical routine takes a single-user interactive form: ask a query, wait for the results (typically seconds, sometimes a minute or more), and do something with the results. I know next to nothing about the size of the database or tables, and it's not something Ihave needed to worry about. There are occasional memory bottle necks, because Aqua Data Studio isn't particularly good at release memory once it's no longer used. Closing and reopening the client fixes that.

It takes an hour or so to upload a table of this kind into the database. Several tables of that size exist on my database and don't cause any trouble. I don't know at what point I would be running into constraints of an aging Mac with 32 GB of memory and a 2 TB hard drive.

I could comfortably live with what I'm doing now, dividing the data into three or four frequency ranges.

Given this information, should I try and create a single table or am I likely to run into serious constraints if I move beyond my current maximum table size of 15 million records.

Perhaps there is no clear answer, and I should just experiment. But if any reader with more knowledge of Postgres thinks that in my environment I would be skating on thin ice if I move beyond current limits, I'd be grateful to be told so.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
From: Adrian Klaver <adrian.klaver@aklaver.com>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <martinmueller@northwestern.edu>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
Subject: Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:

I use Postgres with a GUI frontend (Aquafold) as a very large
spreadsheet on steroids that analyzes rare or defective spellings in a
corpus of 65,000 texts and1.5 billion words. I typically extract data
from the corpus with python scripts, turn them into tables and load them
into the database.

On my Mac with 32 GB of memory performance is OK with queries that
typically within seconds extract data rows from tables with up to ten
million rows. If the result set is large, I suspect that most of time
machine's time is spent displaying result sets. I have used indexing
sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

I am thinking about scaling up to table with about 60 million rows. Are
there things to do or watch out for? Or should I proceed on the
assumption that that 60 million records are within scope and that the
added timecost is roughly linear?

Martin Mueller

Professor emeritus of English and Classics

Northwestern University

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Ron
ronljohnsonjr@gmail.com
In reply to: Martin Mueller (#5)
Re: scaling up from t1n to 60 million records

Indices are your friend! (Except when loading data.)

Add them on any relevant column.
https://www.postgresql.org/docs/16/sql-createindex.html

The "(expression)" clause might be useful in your situation, since it can
exclude some words from an index, exclude empty cells, index upper-case
versions of the word, etc.

On Tue, May 19, 2026 at 2:53 PM Martin Mueller <
martinmueller@northwestern.edu> wrote:

Here is a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern texts with
1.5 billion words. They exist as TEI-XML files and each word is wrapped
in a <w> element. Here are the first and last two words in the corpus

<w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>

<w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>

...

<w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>

<pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>

The corpus has many corrupt spellings and errors in the linguistic
annotation. Most of them are low-frequency phenomena and occur in no more
than 64 documents. In nearly all cases you have enough evidence to correct
a word or its annotation if you can see the word in the middle of a text
string that includes

1. the spelling
2. the lemma
3. the part of speech tag
4. a standard spelling (e.g. 'loue' for 'love')
5. up to seven previous words
6. up to seven next words
7. the spelling and POS tag of the previous
8. the lemma and POS tag of the next word
9. the Xpath of the current work

My goal is to involve users of the corpus in identifying and correcting
corrupt readings. I call this a "philological shopping cart" since the
offering of a correction can be thought of as a sale. Instead of buying
something, with the machine registering the who, what, when, and where of
the purchase, I offer an emendation, with the machine registering the who,
what, when, and where of my emendation.

My hunch is that it would not be particularly difficult to build such a
philological shopping cart and that in terms of scale it would not be a big
thing.

I am trying to mirror that "shopping cart" on my Mac. There are about 60
million word occurrences that occur in no more than 64 texts. The basic
table has the columns described above, and half a dozen other columns for
data entry and various counts. There are some helper tables. The most
important of them is a simple case-insensitive list of spellings with their
document frequencies. This is very useful for finding suspect spellings
with queries like "show me all spellings in a low frequency range that
contain 'tb' and look for words where replacing 'tb' with 'th' will find a
word with a higher document frequency. That picks up spellings like 'tbe',
'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years. I have Aqua
Data Studio as a front end for Postgres, currently version 17, running on a
five-year old Mac with an Intel processor and 32 GB of memory. I know a
lot less about the innards of a SQL database than I should.

My largest kwic table has about 15 million rows with dozen columns for
each row. Except for the left and right context, the columns consists of
single words or numbers. The left and right context columns rarely add up
to more than 35 characters each.. I have used plain indexes for some
columns, with commands like "Create index on kwics16(keyword)", where
'kwics16' is the table name. My typical routine takes a single-user
interactive form: ask a query, wait for the results (typically seconds,
sometimes a minute or more), and do something with the results. I know
next to nothing about the size of the database or tables, and it's not
something Ihave needed to worry about. There are occasional memory bottle
necks, because Aqua Data Studio isn't particularly good at release memory
once it's no longer used. Closing and reopening the client fixes that.

It takes an hour or so to upload a table of this kind into the database.
Several tables of that size exist on my database and don't cause any
trouble. I don't know at what point I would be running into constraints of
an aging Mac with 32 GB of memory and a 2 TB hard drive.

I could comfortably live with what I'm doing now, dividing the data into
three or four frequency ranges.

Given this information, should I try and create a single table or am I
likely to run into serious constraints if I move beyond my current maximum
table size of 15 million records.

Perhaps there is no clear answer, and I should just experiment. But if
any reader with more knowledge of Postgres thinks that in my environment I
would be skating on thin ice if I move beyond current limits, I'd be
grateful to be told so.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
*From: *Adrian Klaver <adrian.klaver@aklaver.com>
*Date: *Tuesday, May 19, 2026 at 09:45
*To: *Martin Mueller <martinmueller@northwestern.edu>;
pgsql-general@postgresql.org <pgsql-general@postgresql.org>
*Subject: *Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:

I use Postgres with a GUI frontend (Aquafold) as a very large
spreadsheet on steroids that analyzes rare or defective spellings in a
corpus of 65,000 texts and1.5 billion words. I typically extract data
from the corpus with python scripts, turn them into tables and load them
into the database.

On my Mac with 32 GB of memory performance is OK with queries that
typically within seconds extract data rows from tables with up to ten
million rows. If the result set is large, I suspect that most of time
machine's time is spent displaying result sets. I have used indexing
sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

I am thinking about scaling up to table with about 60 million rows. Are
there things to do or watch out for? Or should I proceed on the
assumption that that 60 million records are within scope and that the
added timecost is roughly linear?

Martin Mueller

Professor emeritus of English and Classics

Northwestern University

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!