Re: Add A Glossary

Started by Alvaro Herreraalmost 6 years ago26 messageshackersdocs
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com
hackersdocs

Thanks everybody. I have compiled together all the suggestions and the
result is in the attached patch. Some of it is of my own devising.

* I changed "instance", and made "cluster" be mostly a synonym of that.

* I removed "global SQL object" and made "SQL object" explain it.

* Added definitions for ACID, sequence, bloat, fork, FSM, VM, data page,
transaction ID, epoch.

* Changed "a SQL" to "an sql" everywhere.

* Sorted alphabetically.

* Removed caps in term names.

I think I should get this pushed, and if there are further suggestions,
they're welcome.

Dim Fontaine and others suggested a number of terms that could be
included; see https://twitter.com/alvherre/status/1246192786287865856

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

glossfixes-2.patchtext/x-diff; charset=us-asciiDownload+287-213
#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Alvaro Herrera (#1)
hackersdocs

On Thu, May 14, 2020 at 08:00:17PM -0400, Alvaro Herrera wrote:

+   <glossterm>ACID</glossterm>
+   <glossdef>
+    <para>
+     <glossterm linkend="glossary-atomicity">Atomicity</glossterm>,
+     <glossterm linkend="glossary-consistency">consistency</glossterm>,
+     <glossterm linkend="glossary-isolation">isolation</glossterm>, and
+     <glossterm linkend="glossary-durability">durability</glossterm>.
+     A set of properties of database transactions intended to guarantee validity
+     in concurrent operation and even in event of errors, power failures, etc.

I would capitalize Consistency, Isolation, Durability, and say "These four
properties" or "This set of four properties" (althought that makes this sounds
more like a fun game of DBA jeopardy).

+   <glossterm>Background writer (process)</glossterm>
<glossdef>
<para>
-     A process that continuously writes dirty pages from
+     A process that continuously writes dirty

I don't like "continuously"

+ <glossterm linkend="glossary-data-page">data pages</glossterm> from

+  <glossentry id="glossary-bloat">
+   <glossterm>Bloat</glossterm>
+   <glossdef>
+    <para>
+     Space in data pages which does not contain relevant data,
+     such as unused (free) space or outdated row versions.

"current row versions" instead of relevant ?

+  <glossentry id="glossary-data-page">
+   <glossterm>Data page</glossterm>
+   <glossdef>
+    <para>
+     The basic structure used to store relation data.
+     All pages are of the same size.
+     Data pages are typically stored on disk, each in a specific file,
+     and can be read to <glossterm linkend="glossary-shared-memory">shared buffers</glossterm>
+     where they can be modified, becoming
+     <firstterm>dirty</firstterm>.  They get clean by being written down

say "They become clean when written to disk"

+     to disk.  New pages, which initially exist in memory only, are also
+     dirty until written.
+  <glossentry id="glossary-fork">
+   <glossterm>Fork</glossterm>
+   <glossdef>
+    <para>
+     Each of the separate segmented file sets that a relation stores its
+     data in.  There exist a <firstterm>main fork</firstterm> and two secondary

"in which a relation's data is stored"

+     forks: the <glossterm linkend="glossary-fsm">free space map</glossterm>
+     <glossterm linkend="glossary-vm">visibility map</glossterm>.

missing "and" ?

+  <glossentry id="glossary-fsm">
+   <glossterm>Free space map (fork)</glossterm>
+   <glossdef>
+    <para>
+     A storage structure that keeps metadata about each data page in a table's
+     main storage space.

s/in/of/

just say "main fork"?

The free space map entry for each space stores the

for each page ?

+     amount of free space that's available for future tuples, and is structured
+     so it is efficient to search for available space for a new tuple of a given
+     size.

..to be efficiently searched to find free space..

The heap is realized within
-     <glossterm linkend="glossary-file-segment">segment files</glossterm>.
+     <glossterm linkend="glossary-file-segment">segmented files</glossterm>
+     in the relation's <glossterm linkend="glossary-fork">main fork</glossterm>.

Hm, the files aren't segmented. Say "one or more file segments per relation"

+      There also exist local objects that do not belong to schemas; some examples are
+      <glossterm linkend="glossary-extension">extensions</glossterm>,
+      <glossterm linkend="glossary-cast">data type casts</glossterm>, and
+      <glossterm linkend="glossary-foreign-data-wrapper">foreign data wrappers</glossterm>.

Don't extensions have schemas ?

+  <glossentry id="glossary-xid">
+   <glossterm>Transaction ID</glossterm>
+   <glossdef>
+    <para>
+     The numerical, unique, sequentially-assigned identifier that each
+     transaction receives when it first causes a database modification.
+     Frequently abbreviated <firstterm>xid</firstterm>.

abbreviated *as* xid

+     approximately four billion write transactions IDs can be generated;
+     to permit the system to run for longer than that would allow,

remove "would allow"

<para>
The process of removing outdated <glossterm linkend="glossary-tuple">tuple
versions</glossterm> from tables, and other closely related

actually tables or materialized views..

+  <glossentry id="glossary-vm">
+   <glossterm>Visibility map (fork)</glossterm>
+   <glossdef>
+    <para>
+     A storage structure that keeps metadata about each data page
+     in a table's main storage space.  The visibility map entry for

s/in/of/

main fork?

--
Justin

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Justin Pryzby (#2)
hackersdocs

Applied all these suggestions, and made a few additional very small
edits, and pushed -- better to ship what we have now in beta1, but
further edits are still possible.

Other possible terms to define, including those from the tweet I linked
to and a couple more:

archive
availability
backup
composite type
common table expression
data type
domain
dump
export
fault tolerance
GUC
high availability
hot standby
LSN
restore
secondary server (?)
snapshot
transactions per second

Anybody want to try their hand at a tentative definition?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v3-0001-Review-of-the-glossary.patchtext/x-diff; charset=iso-8859-1Download+316-222
#4Erik Rijkers
er@xs4all.nl
In reply to: Alvaro Herrera (#3)
hackersdocs

On 2020-05-15 19:26, Alvaro Herrera wrote:

Applied all these suggestions, and made a few additional very small
edits, and pushed -- better to ship what we have now in beta1, but
further edits are still possible.

I've gone through the glossary as committed and found some more small
things; patch attached.

Thanks,

Erik Rijkers

Show quoted text

Other possible terms to define, including those from the tweet I linked
to and a couple more:

archive
availability
backup
composite type
common table expression
data type
domain
dump
export
fault tolerance
GUC
high availability
hot standby
LSN
restore
secondary server (?)
snapshot
transactions per second

Anybody want to try their hand at a tentative definition?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

glossary-20200516.sgml.difftext/x-diff; name=glossary-20200516.sgml.diffDownload+11-11
#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Erik Rijkers (#4)
hackersdocs

On 2020-May-16, Erik Rijkers wrote:

On 2020-05-15 19:26, Alvaro Herrera wrote:

Applied all these suggestions, and made a few additional very small
edits, and pushed -- better to ship what we have now in beta1, but
further edits are still possible.

I've gone through the glossary as committed and found some more small
things; patch attached.

All pushed! Many thanks,

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#1)
hackersdocs

On 15.05.20 02:00, Alvaro Herrera wrote:

Thanks everybody. I have compiled together all the suggestions and the
result is in the attached patch. Some of it is of my own devising.

* I changed "instance", and made "cluster" be mostly a synonym of that.

In my understanding, "instance" and "cluster" should be different
things, not only synonyms. "instance" can be the term for permanently
fluctuating objects (processes and RAM) and "cluster" can denote the
more static objects (directories and files). What do you think? If you
agree, I would create a patch.

* I removed "global SQL object" and made "SQL object" explain it.

+1., but see the (huge) different spellings in patch.

bloat: changed 'current row' to 'relevant row' because not only the
youngest one is relevant (non-bloat).

data type casts: Are you sure that they are global? In pg_cast
'relisshared' is 'false'.

--

J�rgen Purtz

Attachments:

0002-glossfixes-purtz.patchtext/x-patch; charset=UTF-8; name=0002-glossfixes-purtz.patchDownload+18-14
#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jürgen Purtz (#6)
hackersdocs

On 2020-May-17, J�rgen Purtz wrote:

On 15.05.20 02:00, Alvaro Herrera wrote:

Thanks everybody. I have compiled together all the suggestions and the
result is in the attached patch. Some of it is of my own devising.

* I changed "instance", and made "cluster" be mostly a synonym of that.

In my understanding, "instance" and "cluster" should be different things,
not only synonyms. "instance" can be the term for permanently fluctuating
objects (processes and RAM) and "cluster" can denote the more static objects
(directories and files). What do you think? If you agree, I would create a
patch.

I don't think that's the general understanding of those terms. For all
I know, they *are* synonyms, and there's no specific term for "the
fluctuating objects" as you call them. The instance is either running
(in which case there are processes and RAM) or it isn't.

* I removed "global SQL object" and made "SQL object" explain it.

+1., but see the (huge) different spellings in patch.

This seems a misunderstanding of what "local" means. Any object that
exists in a database is local, regardless of whether it exists in a
schema or not. "Extensions" is one type of object that does not belong
in a schema. "Foreign data wrapper" is another type of object that does
not belong in a schema. Same with data type casts. They are *not*
global objects.

bloat: changed 'current row' to 'relevant row' because not only the youngest
one is relevant (non-bloat).

Hm. TBH I'm not sure of this term at all. I think we sometimes use the
term "bloat" to talk about the dead rows only, ignoring the free space.

data type casts: Are you sure that they are global? In pg_cast 'relisshared'
is 'false'.

I'm not saying they're global. I'm saying they're outside schemas.
Maybe this definition needs more rewording, if this bit is unclear.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#7)
hackersdocs

On 17.05.20 08:51, Alvaro Herrera wrote:

Any object that
exists in a database is local, regardless of whether it exists in a
schema or not.

This implies that the term "local" is unnecessary, just call them "SQL
object".

"Extensions" is one type of object that does not belong
in a schema. "Foreign data wrapper" is another type of object that does
not belong in a schema. ... They are*not*
global objects.

postgres_fdw is a module among many others. It's only an example for
"extensions" and has no different nature. Yes, they are not global SQL
objects because they don't belong to the cluster.

In summary we have 3 types of objects: belonging to a schema, to a
database, or to the cluster (global). Maybe, we can avoid the use of the
different names 'local SQL object' and 'global SQL object' at all and
just call them 'SQL object'. 'global SQL object' is used only once. We
could rephrase "A set of databases and accompanying global SQL objects
... " to "A set of databases and accompanying SQL objects, which exists
at the cluster level, ... "

TBH I'm not sure of this term at all. I think we sometimes use the
term "bloat" to talk about the dead rows only, ignoring the free space.

That's a good example for the necessity of the glossary. Currently we
don't have a common understanding about all of our used terms. The
glossary shall fix that and give a mandatory definition - after a
clearing discussion.

--

Jürgen Purtz

#9Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#7)
hackersdocs

On 17.05.20 08:51, Alvaro Herrera wrote:

On 15.05.20 02:00, Alvaro Herrera wrote:

Thanks everybody. I have compiled together all the suggestions and the
result is in the attached patch. Some of it is of my own devising.

* I changed "instance", and made "cluster" be mostly a synonym of that.

In my understanding, "instance" and "cluster" should be different things,
not only synonyms. "instance" can be the term for permanently fluctuating
objects (processes and RAM) and "cluster" can denote the more static objects
(directories and files). What do you think? If you agree, I would create a
patch.

I don't think that's the general understanding of those terms. For all
I know, they*are* synonyms, and there's no specific term for "the
fluctuating objects" as you call them. The instance is either running
(in which case there are processes and RAM) or it isn't.

We have the basic tools "initdb — create a new PostgreSQL database
cluster" which affects nothing but files, and we have "pg_ctl —
initialize, start, stop, or control a PostgreSQL server" which -
directly - affects nothing but processes and RAM. (Here the term
"server" collides with new definitions in the glossary. But that's
another story.)

--

Jürgen Purtz

#10Erik Rijkers
er@xs4all.nl
In reply to: Alvaro Herrera (#7)
hackersdocs

On 2020-05-17 08:51, Alvaro Herrera wrote:

On 2020-May-17, Jürgen Purtz wrote:

On 15.05.20 02:00, Alvaro Herrera wrote:

Thanks everybody. I have compiled together all the suggestions and the

* I changed "instance", and made "cluster" be mostly a synonym of that.

In my understanding, "instance" and "cluster" should be different
things,

I don't think that's the general understanding of those terms. For all
I know, they *are* synonyms, and there's no specific term for "the
fluctuating objects" as you call them. The instance is either running
(in which case there are processes and RAM) or it isn't.

For what it's worth, I've also always understood 'instance' as 'a
running database'. I admit it might be a left-over from my oracle
years:

https://docs.oracle.com/cd/E11882_01/server.112/e40540/startup.htm#CNCPT601

There, 'instance' clearly refers to a running database. When that
database is stopped, it ceases to be an instance. I've always
understood this to be the same for the PostgreSQL 'instance'. Once
stopped, it is no longer an instance, but it is, of course, still a
cluster.

I know, we don't have to do the same as Oracle, but clearly it's going
to be an ongoing source of misunderstanding if we define such a
high-level term differently.

Erik Rijkers

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Erik Rijkers (#10)
hackersdocs

On 2020-May-17, Erik Rijkers wrote:

On 2020-05-17 08:51, Alvaro Herrera wrote:

I don't think that's the general understanding of those terms. For all
I know, they *are* synonyms, and there's no specific term for "the
fluctuating objects" as you call them. The instance is either running
(in which case there are processes and RAM) or it isn't.

For what it's worth, I've also always understood 'instance' as 'a running
database'. I admit it might be a left-over from my oracle years:

https://docs.oracle.com/cd/E11882_01/server.112/e40540/startup.htm#CNCPT601

There, 'instance' clearly refers to a running database. When that database
is stopped, it ceases to be an instance.

I've never understood it that way, but I'm open to having my opinion on
it changed. So let's discuss it and maybe gather opinions from others.

I think the terms under discussion are just

* cluster
* instance
* server

We don't have "host" (I just made it a synonym for server), but perhaps
we can add that too, if it's useful. It would be good to be consistent
with historical Postgres usage, such as the initdb usage of "cluster"
etc.

Perhaps we should not only define what our use of each term is, but also
explain how each term is used outside PostgreSQL and highlight the
differences. (This would be particularly useful for "cluster" ISTM.)

It seems difficult to get this sorted out before beta1, but there's
still time before the glossary is released.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#11)
hackersdocs

On 17.05.20 17:28, Alvaro Herrera wrote:

On 2020-May-17, Erik Rijkers wrote:

On 2020-05-17 08:51, Alvaro Herrera wrote:

I don't think that's the general understanding of those terms. For all
I know, they*are* synonyms, and there's no specific term for "the
fluctuating objects" as you call them. The instance is either running
(in which case there are processes and RAM) or it isn't.

For what it's worth, I've also always understood 'instance' as 'a running
database'. I admit it might be a left-over from my oracle years:

https://docs.oracle.com/cd/E11882_01/server.112/e40540/startup.htm#CNCPT601

There, 'instance' clearly refers to a running database. When that database
is stopped, it ceases to be an instance.

I've never understood it that way, but I'm open to having my opinion on
it changed. So let's discuss it and maybe gather opinions from others.

I think the terms under discussion are just

* cluster
* instance
* server

We don't have "host" (I just made it a synonym for server), but perhaps
we can add that too, if it's useful. It would be good to be consistent
with historical Postgres usage, such as the initdb usage of "cluster"
etc.

Perhaps we should not only define what our use of each term is, but also
explain how each term is used outside PostgreSQL and highlight the
differences. (This would be particularly useful for "cluster" ISTM.)

In fact, we have reached a point where we don't have a common
understanding of a group of terms. I'm sure that we will meet some more
situations like this in the future. Such discussions, subsequent
decisions, and implementations in the docs are necessary to gain a solid
foundation - primarily for newcomers (what is my first motivation) as
well as for more complex discussions among experts. Obviously, each of
us will include his previous understanding of terms. But we also should
be open to sometimes revise old terms.

Here are my two cents.

cluster/instance: PG (mainly) consists of a group of processes that
commonly act on shared buffers. The processes are very closely related
to each other and with the buffers. They exist altogether or not at all.
They use a common initialization file and are incarnated by one command.
Everything exists solely in RAM and therefor has a fluctuating nature.
In summary: they build a unit and this unit needs to have a name of
itself. In some pages we used to use the term *instance* - sometimes in
extended forms: *database instance*, *PG instance*, *standby instance*,
*standby server instance*, *server instance*, or *remote instance*.  For
me, the term *instance* makes sense, the extensions *standby instance*
and *remote instance* in their context too.

The next essential component is the data itself. It is organized as a
group of databases plus some common management information (global,
pg_wal, pg_xact, pg_tblspc, ...). The complete data must be treated as a
whole because the management information concerns all databases. Its
nature is different from the processes and shared buffers. Of course,
its content changes, but it has a steady nature. It even survives a
'power down'. There is one command to instantiate a new incarnation of
the directory structure and all files. In summary, it's something of its
own and should have its own name. 'database' is not possible because it
consists of databases and other things. My favorite is *cluster*;
*database cluster* is also possible.

server/host: We need a term to describe the underlying hardware
respectively the virtual machine or container, where PG is running. I
suggest to use both *server* and *host*. In computer science, both have
their eligibility and are widely used. Everybody understands
*client/server architecture* or *host* in TCP/IP configuration. We
cannot change such matter of course. I suggest to use both depending on
the context, but with the same meaning: "real hardware, a container, or
a virtual machine".

--

Jürgen Purtz

(PS: I added the docs mailing list)

#13Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jürgen Purtz (#12)
hackersdocs

On Mon, 2020-05-18 at 18:08 +0200, Jürgen Purtz wrote:

cluster/instance: PG (mainly) consists of a group of processes that commonly
act on shared buffers. The processes are very closely related to each other
and with the buffers. They exist altogether or not at all. They use a common
initialization file and are incarnated by one command. Everything exists
solely in RAM and therefor has a fluctuating nature. In summary: they build
a unit and this unit needs to have a name of itself. In some pages we used
to use the term *instance* - sometimes in extended forms: *database instance*,
*PG instance*, *standby instance*, *standby server instance*, *server instance*,
or *remote instance*. For me, the term *instance* makes sense, the extensions
*standby instance* and *remote instance* in their context too.

FWIW, I feel somewhat like Alvaro on that point; I use those terms synonymously,
perhaps distinguishing between a "started cluster" and a "stopped cluster".
After all, "cluster" refers to "a cluster of databases", which are there, regardless
if you start the server or not.

The term "cluster" is unfortunate, because to most people it suggests a group of
machines, so the term "instance" is better, but that ship has sailed long ago.

The static part of a cluster to me is the "data directory".

server/host: We need a term to describe the underlying hardware respectively
the virtual machine or container, where PG is running. I suggest to use both
*server* and *host*. In computer science, both have their eligibility and are
widely used. Everybody understands *client/server architecture* or *host* in
TCP/IP configuration. We cannot change such matter of course. I suggest to
use both depending on the context, but with the same meaning: "real hardware,
a container, or a virtual machine".

On this I have a strong opinion because of my Unix mindset.
"machine" and "host" are synonyms, and it doesn't matter to the database if they
are virtualized or not. You can always disambiguate by adding "virtual" or "physical".

A "server" is a piece of software that responds to client requests, never a machine.
In my book, this is purely Windows jargon. The term "client-server architecture"
that you quote emphasized that.

Perhaps "machine" would be the preferable term, because "host" is more prone to
misunderstandings (except in a networking context).

Yours,
Laurenz Albe

#14Andrew Grillet
andrew@grillet.co.uk
In reply to: Laurenz Albe (#13)
hackersdocs

I think there needs to be a careful analysis of the language and a formal
effort to stabilise it for the future.

In the context of, say, an Oracle T series, which is partitioned into
multiple domains (virtual machines) in it, each
of these has multiple CPUs, and can run an instance of the OS which hosts
multiple virtual instances
of the same or different OSes. Som domains might do this while others do
not!

A host could be a domain, one of many virtual machines, or it could be one
of many hosts on that VM
but even these hosts could be virtual machines that each runs several
virtual servers!

Of course, PostgreSQL can run on any tier of this regime, but the
documentation at least needs to be consistent
about language.

A "machine" should probably refer to hardware, although I would accept that
a domain might count as "virtual
hardware" while a host should probably refer to a single instance of OS.

Of course it is possible for a single instance of OS to run multiple
instances of PostgreSQL, and people do this. (I have
in the past).

Slightly more confusingly, it would appear possible for a single instance
of an OS to have multiple IP addresses
and if there are multiple instances of PostgreSQL, they may serve different
IP Addresses uniquely, or
share them. I think this case suggests that a host probably best describes
an OS instance. I might be wrong.

The word "server" might be an instance of any of the above, or a waiter
with a bowl of soup. It is best
reserved for situations where clarity is not required.

If you are new to all this, I am sure it is very confusing, and
inconsistent language is not going to help.

Andrew

AFAICT

On Tue, 19 May 2020 at 07:17, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Mon, 2020-05-18 at 18:08 +0200, Jürgen Purtz wrote:

cluster/instance: PG (mainly) consists of a group of processes that

commonly

act on shared buffers. The processes are very closely related to each

other

and with the buffers. They exist altogether or not at all. They use a

common

initialization file and are incarnated by one command. Everything exists
solely in RAM and therefor has a fluctuating nature. In summary: they

build

a unit and this unit needs to have a name of itself. In some pages we

used

to use the term *instance* - sometimes in extended forms: *database

instance*,

*PG instance*, *standby instance*, *standby server instance*, *server

instance*,

or *remote instance*. For me, the term *instance* makes sense, the

extensions

*standby instance* and *remote instance* in their context too.

FWIW, I feel somewhat like Alvaro on that point; I use those terms
synonymously,
perhaps distinguishing between a "started cluster" and a "stopped cluster".
After all, "cluster" refers to "a cluster of databases", which are there,
regardless
if you start the server or not.

The term "cluster" is unfortunate, because to most people it suggests a
group of
machines, so the term "instance" is better, but that ship has sailed long
ago.

The static part of a cluster to me is the "data directory".

server/host: We need a term to describe the underlying hardware

respectively

the virtual machine or container, where PG is running. I suggest to use

both

*server* and *host*. In computer science, both have their eligibility

and are

widely used. Everybody understands *client/server architecture* or

*host* in

TCP/IP configuration. We cannot change such matter of course. I suggest

to

use both depending on the context, but with the same meaning: "real

hardware,

a container, or a virtual machine".

On this I have a strong opinion because of my Unix mindset.
"machine" and "host" are synonyms, and it doesn't matter to the database
if they
are virtualized or not. You can always disambiguate by adding "virtual"
or "physical".

A "server" is a piece of software that responds to client requests, never
a machine.
In my book, this is purely Windows jargon. The term "client-server
architecture"
that you quote emphasized that.

Perhaps "machine" would be the preferable term, because "host" is more
prone to
misunderstandings (except in a networking context).

Yours,
Laurenz Albe

#15Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#13)
hackersdocs

On 2020-05-19 08:17, Laurenz Albe wrote:

The term "cluster" is unfortunate, because to most people it suggests a group of
machines, so the term "instance" is better, but that ship has sailed long ago.

I don't see what would stop us from renaming some things, with some care.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Jürgen Purtz
juergen@purtz.de
In reply to: Laurenz Albe (#13)
hackersdocs

On 19.05.20 08:17, Laurenz Albe wrote:

On Mon, 2020-05-18 at 18:08 +0200, Jürgen Purtz wrote:

cluster/instance: PG (mainly) consists of a group of processes that commonly
act on shared buffers. The processes are very closely related to each other
and with the buffers. They exist altogether or not at all. They use a common
initialization file and are incarnated by one command. Everything exists
solely in RAM and therefor has a fluctuating nature. In summary: they build
a unit and this unit needs to have a name of itself. In some pages we used
to use the term *instance* - sometimes in extended forms: *database instance*,
*PG instance*, *standby instance*, *standby server instance*, *server instance*,
or *remote instance*. For me, the term *instance* makes sense, the extensions
*standby instance* and *remote instance* in their context too.

FWIW, I feel somewhat like Alvaro on that point; I use those terms synonymously,
perhaps distinguishing between a "started cluster" and a "stopped cluster".
After all, "cluster" refers to "a cluster of databases", which are there, regardless
if you start the server or not.

The term "cluster" is unfortunate, because to most people it suggests a group of
machines, so the term "instance" is better, but that ship has sailed long ago.

The static part of a cluster to me is the "data directory".

cluster/instance: The different nature (static/dynamic) of what I call
"cluster" and "instance" as well as the existence of the two commands
"initdb — create a new PostgreSQL database cluster" and "pg_ctl —
initialize, start, stop, or control a PostgreSQL server" confirms me in
my opinion that we need two different terms for them. Those two terms
shall not be synonym to each other, they label distinct things. If
people prefer "data directory" instead of "cluster", this is ok for me.

There are situations where we need a single term for both of them.
"Instance and its data directory" or "Instance and its cluster" are too
wordy. In many cases we use "database server" or "server" in this sense.
Imo "Server" is too short and ambiguous. "database server", the plural
form "databases server", or the new term "cluster server", which is more
accurate, would be ok for me. (Similar to "server", the term "cluster"
is also used in many different contexts - but only outside of the PG
world; within our context "cluster" is not ambiguous.)

server/host: We need a term to describe the underlying hardware respectively
the virtual machine or container, where PG is running. I suggest to use both
*server* and *host*. In computer science, both have their eligibility and are
widely used. Everybody understands *client/server architecture* or *host* in
TCP/IP configuration. We cannot change such matter of course. I suggest to
use both depending on the context, but with the same meaning: "real hardware,
a container, or a virtual machine".

On this I have a strong opinion because of my Unix mindset.
"machine" and "host" are synonyms, and it doesn't matter to the database if they
are virtualized or not. You can always disambiguate by adding "virtual" or "physical".

A "server" is a piece of software that responds to client requests, never a machine.
In my book, this is purely Windows jargon. The term "client-server architecture"
that you quote emphasized that.

Perhaps "machine" would be the preferable term, because "host" is more prone to
misunderstandings (except in a networking context).

server/host: I agree that we are not interested in the question whether
there is real hardware or any virtualization container. We are even not
interested in the operating system. Our primary concern is the existence
of a port of the Internet Protocol. But is the term "server" appropriate
to name an IP-port? Additionally, "server" is used for other meanings:
a) the previously mentioned "database server" b) a (virtual) machine:
"server-side", "... the file ... loaded by the server ..." c) binaries
"... the server must be built with SSL support ..." d) whenever it seems
to be appropriate: "standby server", "... the server parses query ...",
"server configuration", "server process".

Because of its ambiguous usage, the definition of "server" must clarify
the allowed meanings. What's about:

--

server: Depending on the context, the term *server* denotes:

* An IP-port which is offered by any OS.   ?????
* A - possibly virtualized - machine
* An abbreviation for the slightly longer term "database(s)/cluster
server"  ??? this will support the readability, but not the clarity ???
* More ?

--

The term "host" is used mainly for IP configuration "host name", "host
address" and in the context of compiling "host language", "host
variable". These are clear situations and can be defined easily.

#17Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jürgen Purtz (#16)
hackersdocs

On Wed, 2020-05-20 at 13:17 +0200, Jürgen Purtz wrote:

FWIW, I feel somewhat like Alvaro on that point; I use those terms synonymously,
perhaps distinguishing between a "started cluster" and a "stopped cluster".
After all, "cluster" refers to "a cluster of databases", which are there, regardless
if you start the server or not.

The term "cluster" is unfortunate, because to most people it suggests a group of
machines, so the term "instance" is better, but that ship has sailed long ago.

The static part of a cluster to me is the "data directory".

cluster/instance: The different nature (static/dynamic) of what I
call "cluster" and "instance" as well as the existence of the two
commands "initdb — create a new PostgreSQL database cluster" and
"pg_ctl — initialize, start, stop, or control a PostgreSQL server"
confirms me in my opinion that we need two different terms for
them.

I think that the "pg_ctl" example does not apply:
It does not talk about starting the cluster, but about starting the server process,
that is "server" in the way I understand it.

There are situations where we need a single term for both of
them. "Instance and its data directory" or "Instance and its
cluster" are too wordy. In many cases we use "database server" or
"server" in this sense. Imo "Server" is too short and ambiguous.
"database server", the plural form "databases server", or the new
term "cluster server", which is more accurate, would be ok for me.
(Similar to "server", the term "cluster" is also used in many
different contexts - but only outside of the PG world; within our
context "cluster" is not ambiguous.)

That does not feel right to me.

"cluster server", ouch. "databases server", ouch as well.

I never felt the term "cluster" was unclear in these contexts.
Sometimes it means "data directory", sometimes it is used for "server process",
but I think few people would think one cound connect to a data directory
or create a process in a directory (initdb).

I think clarity is a Good Thing, but it can be overdone.

server/host: We need a term to describe the underlying hardware respectively
the virtual machine or container, where PG is running. I suggest to use both
*server* and *host*. In computer science, both have their eligibility and are
widely used. Everybody understands *client/server architecture* or *host* in
TCP/IP configuration. We cannot change such matter of course. I suggest to
use both depending on the context, but with the same meaning: "real hardware,
a container, or a virtual machine".

On this I have a strong opinion because of my Unix mindset.
"machine" and "host" are synonyms, and it doesn't matter to the database if they
are virtualized or not. You can always disambiguate by adding "virtual" or "physical".

A "server" is a piece of software that responds to client requests, never a machine.
In my book, this is purely Windows jargon. The term "client-server architecture"
that you quote emphasized that.

Perhaps "machine" would be the preferable term, because "host" is more prone to
misunderstandings (except in a networking context).

server/host: I agree that we are not interested in the question
whether there is real hardware or any virtualization container. We
are even not interested in the operating system. Our primary
concern is the existence of a port of the Internet Protocol. But
is the term "server" appropriate to name an IP-port? Additionally,
"server" is used for other meanings: a) the previously mentioned
"database server" b) a (virtual) machine: "server-side", "... the
file ... loaded by the server ..." c) binaries "... the server
must be built with SSL support ..." d) whenever it seems to be
appropriate: "standby server", "... the server parses query ...",
"server configuration", "server process".

You are most thorough :^)

Because of its ambiguous usage, the definition of "server" must
clarify the allowed meanings. What's about:

server: Depending on the context, the term *server* denotes:

An IP-port which is offered by any OS. ?????

A port is a server? No way.

A - possibly virtualized - machine

It might be good to disambiguate that, but I don't think that the PostgreSQL
documentation should use the word "server" to mean "machine".

An abbreviation for the slightly longer term
"database(s)/cluster server" ??? this will support the
readability, but not the clarity ???

"Server" is short for "database server" and is a set of processes that listen
for and handle incoming database client requests.

I think that covers all the meanings you quoted from the documentation,
except c), where it is used as shorthand for "server executable".

Yours,
Laurenz Albe

#18Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#11)
hackersdocs

On 17.05.20 17:28, Alvaro Herrera wrote:

I think the terms under discussion are just

* cluster
* instance
* server

Despite the short period of its existence the glossary achieved some
importance, see:
/messages/by-id/b8e12875ebec9e6d3107df5fa1129e1e@postgrespro.ru
. We have to be careful with publications. It's not acceptable that we
change definitions from release to release. Therefore IMO we should mark
or even ignore such terms for which we cannot reach consensus.

Can you agree to the following definitions? If no, we can alternatively
formulate for each of them: "Under discussion - currently not defined".
My proposals are inspired by chapter 2.2 Concepts: "Tables are grouped
into databases, and a collection of databases managed by a single
PostgreSQL server instance constitutes a database cluster."

- "Database" (No change to existing definition): "A named collection of
SQL objects."

- "Database Cluster", "Cluster" (New definition and rearrangements of
some sentences): "A collection of related databases, and their common
static and dynamic meta-data.

This term is sometimes used to refer to an instance.

(Don't confuse the term CLUSTER with the SQL command CLUSTER.)"

- "Data Directory" (Replaced 'instance' by 'cluster'): "The base
directory on the filesystem of a server that contains all data files and
subdirectories associated with a cluster (with the exception of
tablespaces). The environment variable PGDATA is commonly used to refer
to the data directory.

A cluster's storage space comprises the data directory plus any
additional tablespaces.

For more information, see Section 68.1."

- "Database Server", "Instance" (Major changes): "A group of backend and
auxiliary processes that communicate using a common shared memory area.
One postmaster process manages the instance; one instance manages
exactly one cluster with all its databases. Many instances can run on
the same server as long as their TCP ports do not conflict.

The instance handles all key features of a DBMS: read and write access
to files and shared memory, assurance of the ACID properties,
connections to client processes, privilege verification, crash recovery,
replication, etc."

- "Server" (No change to existing definition): "A computer on which
PostgreSQL instances run. The term server denotes real hardware, a
container, or a virtual machine.

This term is sometimes used to refer to an instance or to a host."

- "Host" (No change to existing definition): "A computer that
communicates with other computers over a network. This is sometimes used
as a synonym for server. It is also used to refer to a computer where
client processes run."

--

Jürgen Purtz

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jürgen Purtz (#18)
hackersdocs

On 2020-Jun-09, J�rgen Purtz wrote:

Can you agree to the following definitions? If no, we can alternatively
formulate for each of them: "Under discussion - currently not defined". My
proposals are inspired by chapter 2.2 Concepts: "Tables are grouped into
databases, and a collection of databases managed by a single PostgreSQL
server instance constitutes a database cluster."

After sleeping on it a few more times, I don't oppose the idea of making
"instance" be the running state and "database cluster" the on-disk stuff
that supports the instance. Here's a patch that does things pretty much
along the lines you suggested.

I made small adjustments to "SQL objects":

* SQL objects in schemas were said to have their names unique in the
schema, but we failed to say anything about names of objects not in
schemas and global objects. Added that.

* Had example object types for global objects and objects not in
schemas, but no examples for objects in schemas. Added that.

Some programs whose output we could tweak per this:
pg_ctl

pg_ctl is a utility to initialize, start, stop, or control a PostgreSQL server.
-D, --pgdata=DATADIR location of the database storage area

to:

pg_ctl is a utility to initialize or control a PostgreSQL database cluster.
-D, --pgdata=DATADIR location of the database directory

pg_basebackup:

pg_basebackup takes a base backup of a running PostgreSQL server.

to:

pg_basebackup takes a base backup of a PostgreSQL instance.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

glossary.patchtext/x-diff; charset=us-asciiDownload+54-32
#20Justin Pryzby
pryzby@telsasoft.com
In reply to: Alvaro Herrera (#19)
hackersdocs

On Tue, Jun 16, 2020 at 08:09:26PM -0400, Alvaro Herrera wrote:

diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 25b03f3b37..e29b55e5ac 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -395,15 +395,15 @@
<para>
The base directory on the filesystem of a
<glossterm linkend="glossary-server">server</glossterm> that contains all
-     data files and subdirectories associated with an
-     <glossterm linkend="glossary-instance">instance</glossterm> (with the
-     exception of <glossterm linkend="glossary-tablespace">tablespaces</glossterm>).
+     data files and subdirectories associated with a
+     <glossterm linkend="glossary-db-cluster">database cluster</glossterm>
+     (with the exception of
+     <glossterm linkend="glossary-tablespace">tablespaces</glossterm>).

and (optionally) WAL

+  <glossentry id="glossary-db-cluster">
+   <glossterm>Database cluster</glossterm>
+   <glossdef>
+    <para>
+     A collection of databases and global SQL objects,
+     and their common static and dynamic meta-data.

metadata

@@ -1245,12 +1255,17 @@
<glossterm linkend="glossary-sql-object">SQL objects</glossterm>,
which all reside in the same
<glossterm linkend="glossary-database">database</glossterm>.
-     Each SQL object must reside in exactly one schema.
+     Each SQL object must reside in exactly one schema
+     (though certain types of SQL objects exist outside schemas).

(except for global objects which ..)

<para>
The names of SQL objects of the same type in the same schema are enforced
to be unique.
There is no restriction on reusing a name in multiple schemas.
+     For local objects that exist outside schemas, their names are enforced
+     unique across the whole database.  For global objects, their names

I would say "unique within the database"

+     are enforced unique across the whole
+     <glossterm linkend="glossary-db-cluster">database cluster</glossterm>.

and "unique within the whole db cluster"

Most local objects belong to a specific
-      <glossterm linkend="glossary-schema">schema</glossterm> in their containing database.
+      <glossterm linkend="glossary-schema">schema</glossterm> in their
+      containing database, such as
+      <glossterm linkend="glossary-relation">all types of relations</glossterm>,
+      <glossterm linkend="glossary-function">all types of functions</glossterm>,

Maybe say: >Relations< (all types), and >Functions< (all types)

used as the default one for all SQL objects, called <literal>pg_default</literal>.

"the default" (remove "one")

--
Justin

#21Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#19)
hackersdocs
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jürgen Purtz (#21)
hackersdocs
#23Erik Rijkers
er@xs4all.nl
In reply to: Alvaro Herrera (#22)
hackersdocs
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Erik Rijkers (#23)
hackersdocs
#25Jürgen Purtz
juergen@purtz.de
In reply to: Alvaro Herrera (#24)
hackersdocs
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jürgen Purtz (#25)
hackersdocs