Reg: Help to understand the source code

Started by Preethi Sabout 6 years ago7 messagesgeneral

preethi7091@gmail.com

about 6 years ago

Hello,

I am fairly new to postgres and I am trying to understand how the data is
processed during the insert from buffer to the disk. Can someone help me
with that? Also, I would like to see source code workflow. Can someone help
me with finding the source code for the data insertion/modification
workflow.

Thank you for helping a beginner.

Adrian Klaver

adrian.klaver@aklaver.com

about 6 years ago

In reply to: Preethi S (#1)

Re: Reg: Help to understand the source code

On 4/23/20 8:44 AM, Preethi S wrote:

Hello,

I am fairly new to postgres and I am trying to understand how the data
is processed during the insert from buffer to the disk. Can someone help
me with that? Also, I would like to see source code workflow. Can
someone help me with finding the source code for the data
insertion/modification workflow.

Have you looked at?:

https://www.postgresql.org/developer/backend/

Thank you for helping a beginner.

--
Adrian Klaver
adrian.klaver@aklaver.com

Preethi S

preethi7091@gmail.com

about 6 years ago

In reply to: Adrian Klaver (#2)

Re: Reg: Help to understand the source code

Hello Adrian,

Thank you for the quick reply. This link is indeed helpful. This link
explains how is a query processed. I am aware of how the query processing
happens.

In addition, I am looking for how the data processed, when data is
inserted/modified, does the new data gets written to shared buffer -> WAL
-> disk ?

I would like to see the code that does this. (For example, data written
into shared_buffer, wal_buffer, wal_segments and then fsync)

On Thu, Apr 23, 2020 at 10:49 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 4/23/20 8:44 AM, Preethi S wrote:

Hello,

I am fairly new to postgres and I am trying to understand how the data
is processed during the insert from buffer to the disk. Can someone help
me with that? Also, I would like to see source code workflow. Can
someone help me with finding the source code for the data
insertion/modification workflow.

Have you looked at?:

https://www.postgresql.org/developer/backend/

Thank you for helping a beginner.

--
Adrian Klaver
adrian.klaver@aklaver.com

Rob Sargent

robjsargent@gmail.com

about 6 years ago

In reply to: Preethi S (#3)

Re: Reg: Help to understand the source code

On 4/23/20 10:28 AM, Preethi S wrote:

Hello Adrian,

Thank you for the quick reply. This link is indeed helpful. This link
explains how is a query processed. I am aware of how the query
processing happens.

In addition, I am looking for how the data processed, when data is
inserted/modified, does the new data gets written to shared buffer ->
WAL -> disk ?

I would like to see the code that does this. (For example, data
written into shared_buffer, wal_buffer, wal_segments and then fsync)

What tools are you using to examine the code?

Preethi S

preethi7091@gmail.com

about 6 years ago

In reply to: Rob Sargent (#4)

Re: Reg: Help to understand the source code

I am using doxygen

On Thu, Apr 23, 2020 at 11:31 AM Rob Sargent <robjsargent@gmail.com> wrote:

Show quoted text

On 4/23/20 10:28 AM, Preethi S wrote:

Hello Adrian,

Thank you for the quick reply. This link is indeed helpful. This link
explains how is a query processed. I am aware of how the query
processing happens.

In addition, I am looking for how the data processed, when data is
inserted/modified, does the new data gets written to shared buffer ->
WAL -> disk ?

I would like to see the code that does this. (For example, data
written into shared_buffer, wal_buffer, wal_segments and then fsync)

What tools are you using to examine the code?

Paul Jungwirth

pj@illuminatedcomputing.com

about 6 years ago

In reply to: Preethi S (#1)

Re: Reg: Help to understand the source code

On 4/23/20 8:44 AM, Preethi S wrote:

I am fairly new to postgres and I am trying to understand how the data
is processed during the insert from buffer to the disk. Can someone help
me with that? Also, I would like to see source code workflow. Can
someone help me with finding the source code for the data
insertion/modification workflow.

I'm also a Postgres hacker newbie, but I've spent some time adding
SQL:2011 FOR PORTION OF support to UPDATE/DELETE, so I've gone through
that learning process. (I should say "going through". :-)

I'd say be prepared to spend a *lot* of time reading the code.
Personally I use `grep -r` a lot and just read and read. For specifics
you can use a debugger or insert `ereport(NOTICE, (errmsg("something
%s", foo)))` and run queries (or the test suite). Also many subfolders
have an extensive README that will guide you. Some of the READMEs may
take an hour or more to get through and understand, but reading them is
worth it.

It helped me a lot to spend several years writing occasional Postgres C
extensions before really doing anything in the core codebase. There are
lots of basics you learn that way. There are a bunch of articles and
presentations out there about that you might find helpful.

Postgres processes queries in several steps:

- parse
- analyze
- rewrite
- plan
- optimize
- execute

The parse step is a bison grammar (look for gram.y). Basically it fills
in structs cutting up what the user typed.

The analyze step starts to make sense of the parse results. Look at
parser/analyze.c. It maps input strings to database objects---for
example looking up table/column names (and making sure they really
exist). Here you're sort of just copying things from the parse structs
to different structs. You're building up Node trees that later steps can
use. I think the analyze step is often considered to be still part of
the parse phase.

It seems like each SQL "clause" has its own transformFoo function, so
probably you'll want to add your own (transformMyAwesomeFeatureClause)
and then call it from its "parent" (e.g. transformUpdateStmt).

If you add new Node types you'll need to edit nodes/*funcs.c and also
probably teach some switch statements how to handle them. If you are
filling in a struct but then later in the pipeline find that what you
wrote isn't there anymore, you probably forgot to implement a copy function.

The rewrite/plan/optimize steps aren't things you need to worry about
too much if you're interested in DML, but you can read more about them
in the source code. Especially rewrite is pretty niche (views and RULEs).

The execute step is the most challenging I think. It has its own Node
trees and also keeps an execution state. Probably you'll need to look at
src/backend/executor/nodeModifyTable.c among others. You'll also need to
learn about TupleTableSlots. (If anyone here has a good learning
resource for TTS I would also be glad to read it.)

I'm afraid this description is comically dumbed down, but hopefully it
can be something like a map. I'd probably just take an UPDATE statement
and try to trace it through the pipeline, and maybe experiment with
small changes along the way. You can add things to src/test/regress as
you go.

And the mailing list is a very friendly place to ask questions.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com

Preethi S

preethi7091@gmail.com

about 6 years ago

In reply to: Paul Jungwirth (#6)

Re: Reg: Help to understand the source code

Thank you Paul! This certainly helps.

On Thu, Apr 23, 2020 at 12:26 PM Paul Jungwirth <pj@illuminatedcomputing.com>
wrote:

Show quoted text

On 4/23/20 8:44 AM, Preethi S wrote:

I am fairly new to postgres and I am trying to understand how the data
is processed during the insert from buffer to the disk. Can someone help
me with that? Also, I would like to see source code workflow. Can
someone help me with finding the source code for the data
insertion/modification workflow.

I'm also a Postgres hacker newbie, but I've spent some time adding
SQL:2011 FOR PORTION OF support to UPDATE/DELETE, so I've gone through
that learning process. (I should say "going through". :-)

I'd say be prepared to spend a *lot* of time reading the code.
Personally I use `grep -r` a lot and just read and read. For specifics
you can use a debugger or insert `ereport(NOTICE, (errmsg("something
%s", foo)))` and run queries (or the test suite). Also many subfolders
have an extensive README that will guide you. Some of the READMEs may
take an hour or more to get through and understand, but reading them is
worth it.

It helped me a lot to spend several years writing occasional Postgres C
extensions before really doing anything in the core codebase. There are
lots of basics you learn that way. There are a bunch of articles and
presentations out there about that you might find helpful.

Postgres processes queries in several steps:

- parse
- analyze
- rewrite
- plan
- optimize
- execute

The parse step is a bison grammar (look for gram.y). Basically it fills
in structs cutting up what the user typed.

The analyze step starts to make sense of the parse results. Look at
parser/analyze.c. It maps input strings to database objects---for
example looking up table/column names (and making sure they really
exist). Here you're sort of just copying things from the parse structs
to different structs. You're building up Node trees that later steps can
use. I think the analyze step is often considered to be still part of
the parse phase.

It seems like each SQL "clause" has its own transformFoo function, so
probably you'll want to add your own (transformMyAwesomeFeatureClause)
and then call it from its "parent" (e.g. transformUpdateStmt).

If you add new Node types you'll need to edit nodes/*funcs.c and also
probably teach some switch statements how to handle them. If you are
filling in a struct but then later in the pipeline find that what you
wrote isn't there anymore, you probably forgot to implement a copy
function.

The rewrite/plan/optimize steps aren't things you need to worry about
too much if you're interested in DML, but you can read more about them
in the source code. Especially rewrite is pretty niche (views and RULEs).

The execute step is the most challenging I think. It has its own Node
trees and also keeps an execution state. Probably you'll need to look at
src/backend/executor/nodeModifyTable.c among others. You'll also need to
learn about TupleTableSlots. (If anyone here has a good learning
resource for TTS I would also be glad to read it.)

I'm afraid this description is comically dumbed down, but hopefully it
can be something like a map. I'd probably just take an UPDATE statement
and try to trace it through the pipeline, and maybe experiment with
small changes along the way. You can add things to src/test/regress as
you go.

And the mailing list is a very friendly place to ask questions.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com