a question about relkind of RelationData handed over to heap_update function
Dear hackers,
I’m modifying backend source codes of pgsql.
While inspecting the heap_update function (src/backend/access/heapam.c),
I found that the relkind fields of all RelationData which is handed over to
heap_update are all the same as ‘r’.
I want to distinguish normal relation (actual table) from primary index
relation (primary indexes of some tables).
As you know, there are 6 different relkinds (I,r,S,u,t,v,c).
I guess primary index relation’s relkind’d be the same as normal
relation’s (i.e. ‘r’).
Is there any way I can distinguish normal relation from primary index
relation in the heap_update function?
In the following code, I want to make ‘doIcl = false’ for the primary
index relation.
Thank you for reading this.
-------------- -------------- -------------- -------------- -------------- -
------------- -------------- -------------- -------------- -------------- --
------------ -------------- -------------- -------------- -------------- ---
-----------
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
ItemPointer ctid,
TransactionId *update_xmax,
CommandId cid, Snapshot
crosscheck, bool wait)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
Bitmapset *hot_attrs;
ItemId lp;
HeapTupleData oldtup;
HeapTuple heaptup;
Page page;
Buffer buffer, newbuf;
bool need_toast, already_marked;
Size newtupsize, pagefree;
bool have_tuple_lock = false;
bool iscombo;
bool use_hot_update = false;
bool all_visible_cleared = false;
bool all_visible_cleared_new = false;
/* hongs added; variables */
#ifdef USE_ICL
bool doIcl = false, newDoIcl = false;
BufferDesc *bufHdr = NULL;
BufferDesc *newBufHdr = NULL; //for inserting
icl log of PageSetLSN
Page newpage; //for inserting icl log of PageSetLSN
ItemId newlp;
if(relation->rd_rel->relkind != 'r') {
doIcl = true;
}
else
doIcl = false;
#endif
-------------- -------------- -------------- -------------- -------------- -
------------- -------------- -------------- -------------- -------------- --
------------ -------------- -------------- -------------- -------------- ---
-----------
- Best Regards
Hongchan
( <mailto:fallsmal@cs.yonsei.ac.kr> fallsmal@cs.yonsei.ac.kr, (02)2123-
7757) -
=?ks_c_5601-1987?B?s+vIq8L5?= <fallsmal@cs.yonsei.ac.kr> writes:
I found that the relkind fields of all RelationData which is handed over to
heap_update are all the same as ��r��.
Well, yeah: heap_update is applied to heaps (ordinary tables). Not indexes.
The indexes are generally updated in a separate operation afterwards.
I want to distinguish normal relation (actual table) from primary index
relation (primary indexes of some tables).
Perhaps you should take about three steps back and explain what it is
you want to do, because heap_update is probably not the right place
to be doing it.
regards, tom lane
Dear tom lane and hackers,
I am sorry, I should have explained the reason.
Actually, I'm not modifying the backend source code.
Since I am not a native speaker, I am not good at writing in English.
I'm just trying to make my own pgsql code for my research purpose.
Later, if my research turns out successful, then I can contribute in enhancing pgsql at that time
by concretely implementing it.
I'm researching on DBMS I/O performance issues regarding flash memory and flash-SSDs.
Flash-memory has asynchronous read/write latency, and flash-SSDs as well.
Therefore, reducing random-writes to flash based storage is quite a issue.
What I am trying to do now is to examine the real dirty portion of buffer pages to be flushed like the following.
page 1
-------------
| | dportion1 (real dirty portion 1) ranges between 20 ~ 80
| dportion1 |
| | dportion2 (real dirty portion 2) ranges between 8190 ~ 8192
| |
| dportion2 |
-------------
Since there are many different kinds of page-updates such as updates to local buffer, temp relation, indexes, toasted attributes, and so forth.
It would be a big burden to me if I inspect all that codes.
Therefore, I decided to make a start point as inspecting only updates to the ordinary tables.
I added a log array field to BufferDesc struct, and added logs to the designated bufferDesc of the updated buffer
when it comes to ordinary table updates (The logs specifies the real dirty portion ranges of the buffer).
So far, I covered (at least I thought I covered ..) several functions such as heap_udpate, heap_insert, heap_delete, heap_inplace_update,
, heap_lock_tuple, heap_page_prune, heap_page_prune_execute, heap_lock_tuple, pageAddItem, pageRepairFragmenation,
putRelationTuple.
Until now I didn't care about vacuum-related function since I turned off the autovacuum option in the conf file.
I think it's too early to tell how my idea is going to work. When I am ready to confidently say that my idea
can enhance the pgsql's performance a little bit at less expense of losing other features, I will submit a proposal.
It's, for sure, not easy to grasp how the backend works, though.
Several articles and wiki pages helped me a lot, and especially well-annotated codes was the most helpful.
What I have been going through helped me a lot to understand the internal of DBMSs, and actually it was fun to read
the real working codes of a DBMS.
In the aspect that this remarkable open-source DBMS codes are so well maintained and continuously enhanced by this community
that many people including me can study and participate in, I really thank you and hackers.
About the question, I think I am confused a little. I don't know why, but still the debug routine of my code says that
the log inserted in heap_update belongs to a primary index relation. I will figure it out.
- Best Regards
Hongchan Roh -
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, October 26, 2009 12:07 AM
To: 노홍찬
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] a question about relkind of RelationData handed over to heap_update function
=?ks_c_5601-1987?B?s+vIq8L5?= <fallsmal@cs.yonsei.ac.kr> writes:
I found that the relkind fields of all RelationData which is handed over to
heap_update are all the same as ‘r’.
Well, yeah: heap_update is applied to heaps (ordinary tables). Not indexes.
The indexes are generally updated in a separate operation afterwards.
I want to distinguish normal relation (actual table) from primary index
relation (primary indexes of some tables).
Perhaps you should take about three steps back and explain what it is
you want to do, because heap_update is probably not the right place
to be doing it.
regards, tom lane
Import Notes
Resolved by subject fallback
On Mon, 26 Oct 2009, ??? wrote:
What I am trying to do now is to examine the real dirty portion of
buffer pages to be flushed like the following.
You can trivially use pg_buffercache for view this, and its code in
contrib/pg_buffercache will show you how to navigate the buffer cache data
too. There's example of how to use it in the documentation for that
module and I've got some additional ones on my web page at
http://www.westnet.com/~gsmith/content/postgresql in the slides and
examples for "Inside the PostgreSQL Buffer Cache".
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Dear Greg Smith,
Thank you for letting me know about the presentations in your homepage.
It's going to be much helpful in understanding the internal of postgresql further.
- Best Regards
Hongchan Roh -
-----Original Message-----
From: Greg Smith [mailto:gsmith@gregsmith.com]
Sent: Monday, October 26, 2009 5:32 AM
To: 노홍찬
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] a question about relkind of RelationData handed over to heap_update function
On Mon, 26 Oct 2009, ??? wrote:
What I am trying to do now is to examine the real dirty portion of
buffer pages to be flushed like the following.
You can trivially use pg_buffercache for view this, and its code in
contrib/pg_buffercache will show you how to navigate the buffer cache data
too. There's example of how to use it in the documentation for that
module and I've got some additional ones on my web page at
http://www.westnet.com/~gsmith/content/postgresql in the slides and
examples for "Inside the PostgreSQL Buffer Cache".
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sun, Oct 25, 2009 at 9:37 AM, 노홍찬 <fallsmal@cs.yonsei.ac.kr> wrote:
What I am trying to do now is to examine the real dirty portion of buffer pages to be flushed like the following.
page 1
-------------
| | dportion1 (real dirty portion 1) ranges between 20 ~ 80
| dportion1 |
| | dportion2 (real dirty portion 2) ranges between 8190 ~ 8192
| |
| dportion2 |
-------------Since there are many different kinds of page-updates such as updates to local buffer, temp relation, indexes, toasted attributes, and so forth.
It would be a big burden to me if I inspect all that codes.
Therefore, I decided to make a start point as inspecting only updates to the ordinary tables.
I added a log array field to BufferDesc struct, and added logs to the designated bufferDesc of the updated buffer
when it comes to ordinary table updates (The logs specifies the real dirty portion ranges of the buffer).
I would think you would want to modify MarkBufferDirty to take a start
and end point and store that in your log. Then modify every existing
MarkBufferDirty operation that you can to specify the range that the
subsequent operation is going to modify. You're going to run into
problems where you have code which looks like:
- mark buffer dirty
- do some work which modifies a predictable portion
- if (some rare condition)
- do some more work which modifies other parts of the buffer
The "some more work" may be some function call which doesn't usually
do much either.
So you may end up having to restructure a lot of code so that every
function is responsible for marking the buffer range dirty itself
instead of assuming it's already been marked.
--
greg
Dear Greg Stark,
Totally, right. I want to record the all updated region.
So, doing some work is not doing a little work.
But, I am trying to not touch the existing codes as much as I can.
Therefore, I mostly added my code, I didn't changed markDirtyBuffer function at all, but, of course, I have created a function that is supposed to work similarly to what you mentioned.
I am sorry that I couldn't understand the following sentence's meaning (The "some more work" may be some function call which doesn't usually do much either.).
What did you mean in that sentence? Please excuse my poor English understanding, and it would be great if you can explain the meaning more again.
Until now, it's like this, I have appended several fields to BufferDesc structure, and my own structure (IclNewLog) is used for recording those dirty regions.
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
typedef struct sbufdesc {
BufferTag tag; /* ID of page contained in buffer */
BufFlags flags; /* see bit definitions above */
uint16 usage_count; /* usage counter for clock sweep code */
unsigned refcount; /* # of backends holding pins on buffer */
int wait_backend_pid; /* backend PID of pin-count waiter */
slock_t buf_hdr_lock; /* protects the above fields */
int buf_id; /* buffer's index number (from 0) */
int freeNext; /* link in freelist chain */
LWLockId io_in_progress_lock; /* to wait for I/O to complete */
LWLockId content_lock; /* to lock access to buffer contents */
/* hongs added */
#ifdef USE_ICL
bool isBufferPageNewOrXlogRead;
int icl_length;
IclNewLog icl_logs[ICL_LEN_LIMIT];
#endif
/* hongs added */
} BufferDesc;
typedef struct IclNewLog {
int change_start;
int change_end;
uint32 file; //for ICL_DEBUG
int line; //for ICL_DEBUG
int icl_log_global_seq; //for ICL_DEBUG
} IclNewLog;
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
* a part of heap_update function *
Line number: 2761: oldtup.t_data->t_ctid = heaptup->t_self;
/* hongs added; ICL logs oldtuple's tupleheader */
#ifdef USE_ICL
if(doIcl) {
LockBufHdr(bufHdr); //buffer header lock and buffer content lock is separate, so I guess the buffer header lock is needed
if(bufHdr->icl_length < ICL_LEN_LIMIT-1) {
bufHdr->icl_logs[bufHdr->icl_length].change_start = lp->lp_off;
bufHdr->icl_logs[bufHdr->icl_length].change_end = lp->lp_off + sizeof(HeapTupleHeaderData);
bufHdr->icl_logs[bufHdr->icl_length].file = HEAPAM;
bufHdr->icl_logs[bufHdr->icl_length].line = 3003;
IclAssert( IsIclLogValid(bufHdr->icl_logs[bufHdr->icl_length]) ); //making sure of the correctness of the logsize
bufHdr->icl_length++;
}
UnlockBufHdr(bufHdr);
}
#endif
/* hongs added end */
Line number: 2762: if (newbuf != buffer)
Line number: 2763: MarkBufferDirty(newbuf);
Line number: 2764: MarkBufferDirty(buffer);
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
I named the log "icl log".
The above code is recording "the update to the old tuple's tuple header" into the log array field of the buffer descriptor whose buffer page is supposed to be marked dirty.
I'm not interested in the buffers frequently updated. I'm interested in the buffers to be flushed having very small amount of genuine update areas.
Since, pgsql's update policy uses MVCC time-shapshot model, so every update causes the update of old tuple's header (changing the xmax field of it).
There might be some buffer pages to be flushed which have only one or two small regions of genuine updates like updated xmax field or updated XLogRecPtr.
I think, purely in my opinion, those flush operations that have small amount of genuine update regions are inefficient.
However, it's not the only problems of pgsql, though. The in-place update operations of every DBMS have similar problems.
I think pgsql's update logic is less problematic than others,
since the main updates (not old tuple's header update but the real tuples) could be piled up in a buffer page (not in scattered pages),
and the hot-update mechanism addresses the previous problems of time-snapshot MVCC well in pgsql.
Therefore, I limited the maximum log array size as 20. If I apply some log merge logic (cuz there would be many logs which can be merged together like 8152 ~ 8172 and 8162 ~ 8192 -> 8151 ~ 8192)
, then the array size would be enough to locate the buffers having small genuine update regions. I don't care about the buffers which has logs more than the maximum log array size.
It's an example, current codes doesn't look like this.
I am trying to not touch the previous codes but only append my logic, so that later my code can be patched as an additional module for specific purpose like flash based storage.
I want to emphasize this once more, this attempt is not for the pgsql patch or pgsql enhancement but for my own research purpose, at least for now.
Besides, this try is just a preparation for my research idea to be implemented.
Therefore, if you see much of inefficiency and stupidness in this try, please understand that.
Later, when I am confident to show the total picture of my idea and working codes (at least after passing through the regression test and my own tests using dbt2-benchmark),
I'll present it to you, and hackers.
I really thank your interest in my try.
For the original query, I found my mistake. I confused relation oid with relNode (of relFileNode). Sorry for the hasty question.
Thank you for reading this.
- Best Regards
Hongchan Roh -
-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Tuesday, October 27, 2009 2:22 AM
To: 노홍찬
Cc: pgsql-hackers@postgresql.org
Subject: Re: a question about relkind of RelationData handed over to heap_update function
On Sun, Oct 25, 2009 at 9:37 AM, 노홍찬 <fallsmal@cs.yonsei.ac.kr> wrote:
What I am trying to do now is to examine the real dirty portion of buffer pages to be flushed like the following.
page 1
-------------
| | dportion1 (real dirty portion 1) ranges between 20 ~ 80
| dportion1 |
| | dportion2 (real dirty portion 2) ranges between 8190 ~ 8192
| |
| dportion2 |
-------------Since there are many different kinds of page-updates such as updates to local buffer, temp relation, indexes, toasted attributes, and so forth.
It would be a big burden to me if I inspect all that codes.
Therefore, I decided to make a start point as inspecting only updates to the ordinary tables.
I added a log array field to BufferDesc struct, and added logs to the designated bufferDesc of the updated buffer
when it comes to ordinary table updates (The logs specifies the real dirty portion ranges of the buffer).
I would think you would want to modify MarkBufferDirty to take a start
and end point and store that in your log. Then modify every existing
MarkBufferDirty operation that you can to specify the range that the
subsequent operation is going to modify. You're going to run into
problems where you have code which looks like:
- mark buffer dirty
- do some work which modifies a predictable portion
- if (some rare condition)
- do some more work which modifies other parts of the buffer
The "some more work" may be some function call which doesn't usually
do much either.
So you may end up having to restructure a lot of code so that every
function is responsible for marking the buffer range dirty itself
instead of assuming it's already been marked.
--
greg