提交 e95a8ada 编写于 作者: H Heikki Linnakangas 提交者: Xin Zhang

Remove unused fields, and README.

上级 2aa8a67e
......@@ -190,13 +190,9 @@ OpenAOSegmentFile(Relation rel,
void
CloseAOSegmentFile(MirroredAppendOnlyOpen *mirroredOpen)
{
bool mirrorDataLossOccurred; // UNDONE: We need to do something now...
Assert(mirroredOpen->primaryFile > 0);
MirroredAppendOnly_Close(
mirroredOpen,
&mirrorDataLossOccurred);
MirroredAppendOnly_Close(mirroredOpen);
}
/*
......@@ -206,8 +202,6 @@ void
TruncateAOSegmentFile(MirroredAppendOnlyOpen *mirroredOpen, Relation rel, int64 offset)
{
int primaryError;
bool mirrorDataLossOccurred; // We'll look at this at close time.
char *relname = RelationGetRelationName(rel);
Assert(mirroredOpen->primaryFile > 0);
......@@ -217,11 +211,9 @@ TruncateAOSegmentFile(MirroredAppendOnlyOpen *mirroredOpen, Relation rel, int64
* Call the 'fd' module with a 64-bit length since AO segment files
* can be multi-gigabyte to the terabytes...
*/
MirroredAppendOnly_Truncate(
mirroredOpen,
offset,
&primaryError,
&mirrorDataLossOccurred);
MirroredAppendOnly_Truncate(mirroredOpen,
offset,
&primaryError);
if (primaryError != 0)
ereport(ERROR,
(errmsg("\"%s\": failed to truncate data after eof: %s",
......
CHANGETRACKING
CT phase as the name implies tracks the changes performed only on the
bufpool with some exceptions like PT. In this phase the mirror is down
and hence the CT log helps to remember the changes in a concise way,
which helps the resync phase to copy changes when the mirror is up. In
CT mode only the main filerep and the CT recovery processes run.
CT Recovery
============
The FileRep CT recovery process on the primary is spawned by the
filerep main process. It is the only filerep main’s child process (in
CT mode) that is active because there is no mirror to communicate data
to.
This recovery process performs two functions i.e. changetracking
transition and then compaction of the changetracking log generated
during modification to primary data in changetracking mode.
During this CT transition phase, it seems like the database IO is
ceased. (XXX Need to confirm).
CT transition – FileRepPrimary_StartRecoveryInChangeTracking():
1. The system can transition to changetracking mode either from InSync
mode or InResync mode. Depending on that –
If (the system completed transition from resync to insync) then Drop
all the old CT log and CT meta file because the changes tracked in
them are no longer required. (A little complex to understand - Refer to
ResyncTransition, ResyncMgr, Insync transition from resync manager
part)
else (the system did not complete resync transition) then based on if
it was full or incr resync mark the resync type in CT meta file
In case the system has transitioned from insync to changetracking
mode, the CT files may or may not be present depending on if there was
any past transition from resync to insync mode. If there are no CT
files just create the CT meta file and mark incr resync (The reasoning
why we need to make incr resync is still unclear)
2. XlogInChangetrackingTransition()
Set the filerep state to changetrackingintransition.
Before the system can add records to CT full log during runtime (info
about CT log types can be found under normal IO on primary section),
it needs to initialize the CT full log. Initializing the CT full log
means reading Xlog records from latest checkpoint/redo location till
the end of the Xlog. The reason behind this will be clear soon.
Every time a record is inserted into the xlog buffer in mem, the
system maintains the last location where the record was inserted in
all the phases insync, ct or inresync. Let's call this location
'lastchangetrackingendloc'. This location points to the last record in
the xlog which may or may not be on the disk depending on if the xlog
was flushed or not.
So, flush it on the disk till
'lastchangetrackingendloc'. (XlogFileRepFlushCache())
Changetracking_CreateInitialFromPreviousCheckpoint() - Once flushed,
copy the records from the xlog from the latest redo/checkpoint
location to ‘lastchangetrackingendloc’ to the CT full file. Only
interesting records are copied. Info about these records will be
detailed under normal IO operations. The need to copy records from
latest ckpt till the end is that there is no guarantee from the latest
checkpoint which xlog records and thus the changes were flushed to the
mirror.
Here’s a special case: - During runtime in CT mode, there is a chance
that records logged in xlog buffer may be not be on the disk but the
CT buffer can get full and flush these records on the disk. In short,
xlog record ‘A’ in xlog buffer in mem but is on disk at CT log. Now if
the system restarts due to some reason, the xlog on disk and CT log on
disk will be out of sync. To resolve it this function takes care by
converting the CT full log to CT transient log. This transient CT file
will be processed as part of the compaction function in the next phase
of the recovery process. (‘lastchangetrackingendloc’ during restart is 0)
3. Set filrep subprocess state to ready indicating the transition is
complete.
4. Resume IO on the database.
CT logs compaction – FileRepPrimary_RunChangeTrackingCompaction()
------------------
Please refer first to the types of CT log files detailed under normal
IO during CT.
Compaction shrinks the full CT file so that the system can avoid
processing of large amount of records and thus at the same save disk
space. There are too many details on how compaction works. Let's try to
understand it by simple illustration –
Initially lets assume that following records were added to the CT full
log file:
Relfilenode [tblspc,db,rel] Block num Xlog loc PTTID PTSNUM
[1,2,3] 2 A 10 10
[1,2,3] 3 B 10 10
[1,2,4] 1 C 12 11
[1,2,3] 3 D 10 10
Now we compact this when either the largest lsn in the CT full file is
> than the lsn on the disk (for e.g. during system restart) or if the
CT full log file goes beyond threshold size.
For compaction, the system first renames the CT full log file to CT
transient file. This transient file is queried in such a way that only
the one record per unique combination of relfilenode, blocknum, PTTID,
PTSNUM are generated with max xlog loc. These new records are added to
the CT compact file. That’s why it is called compact. This reduces the
number of records logged for a block under a relfilenode. So, per the
example above the compact CT file would look like as follows:
Relfilenode [tblspc,db,rel] Block num Xlog loc PTTID PTSNUM
[1,2,3] 2 A 10 10
[1,2,4] 1 C 12 11
[1,2,3] 3 D 10 10
Changetracking - IO on primary
------------------------------
During CT mode, the IO on the primary seg performs normally except
that changes are not pushed to the mirror (naturally) but are logged
into CT log. Only bufpool changes are logged into the log (heap,
btree, bitmap, gist). But note that PT, GpIDRel and SharedDependRelId
relation changes are not logged in the CT log as PT are always
resynced in the scan incremental way (in short full copy).
CT logs are of 4 types: CT Meta, CT full, CT Compact and CT
Transient.
The "Full" log includes all the changetracking records that were
accumulated from the xlog. The full log is the first "landing place"
for all the relevant xlog records. Every record in the full log has a
format [relfilenode, xlogloc, bufpoolblocknum, pers tid, pers serial
num].
The "Compact" log is a much smaller version and includes only the
necessary change tracking records that are needed for resync. This is
essentially a shrinked down version of all the records that ever
landed in the full log.
The "Transient" log is a temporary log file that is used to help with
the process of compacting the full and compact logs. In general, when
we want to compact the full or compact log files, we rename them to be
"transient" and then do the actual compacting into the new compact
file. (More details on compacting – refer CT recovery process)
The META file is where we store information about if it's a full
resync and the last resync lsn, and other meta info.
ChangeTracking_AddRecordFromXlog() - On an xlog insert based on the
xlog record a changetracking log is added to CT log iff it’s a bufpool
change (no PT, other changes) as said before.
RESYNCHRONIZATION
Resync phase comes after CT phase and ideally goes into InSync
phase. Any events while resync is in progress can keep it in resync
phase itself or transition to CT. Resync relies on 2 main components
resync manager and resync workers.
Resync Manager
==============
Resync manager is one of the two important entities that initiates the
whole resync process when gprecoverseg –a/-f is run. It performs all
foundation work that includes creation/dropping of fsys objects based
on what changed on primary when the mirror was down. The main work of
carrying the changes to the mirror is performed by the resync
workers. Once all the fsys objects are resynced , the resync manager
transitions the segment into insync phase.
Resync manager initiates a phase called ‘Resync transition’ (or in
some case its called resync recovery). In this it gets ready by
marking PT entries either to go full copy or incremental resync. After
resync transition, it performs accounting of relations which undergo
changes as part of resync operation and once all of them are
processed(resynced), it quits indicating the system is in InSync
mode. At last, the resync manager performs transition to insync phase
where all the flat files are mirrored because they are not mirrored
(not copied to the mirror) during normal resync phase (resync
transition and accounting).
Under InResyncTransition:
1. Acquire mirror lock and then do primaryMirrorSetIOSuspended(TRUE)
Database transitions to suspended state, IO activity on the segment is
suspended. We suspend the IO on the primary segment so that the system
can track atomically the correct last LSN tracked in CT log.
2. ChangeTracking_RecordLastChangeTrackedLoc()
Store last change tracked LSN into shared memory and in Change
Tracking meta file (LSN has to be fsync-ed to persistent
media). Let's call it as 'endresync LSN'. Remember 'endResync LSN'
for future reference. Also make sure to write any remaining
changetracking full log file data and fsync it.
3. If the mode is Full Resync then
PersistentFileSysObj_MarkWholeMirrorFullCopy ()
Set mirrorexistencestate in all PT (database, filespace, relation)
tuples to
if persistent state is (*createpending or created)
then set to mirrordownbeforecreate if persistent state is (droppending
or abortingcreate) if (mirrorexistencestate =
mirroronlydropremains) Free the tuple in that Persistent Table as
the fsys object has already been dropped on primary and the mirror
is already all cleaned ? XXX: Where do we clean the whole mirror
(on the mirror itself, maybe ?)
Else Skip that PT tuple from
processing Calculate number of blocks to be resynced if the fsys
object is a relation Set MirrorDataSyncstate for that tuple to
FullCopy
4. If not in Full Resync Mode then:
PersistentFileSysObj_MarkSpecialScanIncremental(): First, mark the
special persistent tables and others as 'Scan Incremental'. These
include OIDs {5090, 5091, 5092, 5093, and 5096} and others (see
GpPersistent_SkipXLogInfo). And, mark any Buffer Pool managed
relations that were physically truncated as 'Scan incremental'
because we don't know how to process old changes in the change
tracking log. At the same time calculate total number of blocks in
these relations and add them to be resynced. *Setting
MirrorDataSyncState to BufferPoolScanIncremental*
PersistentFileSysObj_MarkPageIncrementalFromChangeLog(): First,
compact any remaining changetracking full files (Refer to CT
compaction in the CT phase section). Query the compact CT log to get
unique pair of PTTID and PTSN along with count of distinct blocks
change tracked and at the same time where PTSN is max. See this e.g if
at some point in time table ‘a’ was at PTTID/SN (2,13)/19, then its 10
blocks were change tracked, then it was dropped and the free tuple was
re-used by table ‘b’. Now table b will have PTTID/SN (2,13)/20. Now if
Table b was changed tracked for 8 blocks out query should return
(2,13)/20 and 8 (NOT (2,13)/19 & 10). The query output is [PTTID,
PTSN, Block #]. Using this update tuples in relation PT with all such
blocks # at mirror_bufpool_resync_changed_page_count iff that tuple
has not been marked for scan incremental and some other valid
constraints are satisfied. Else skip updating the tuple. Add all this
block count to blocks to synchronize. *Set MirrorDataSyncState to
BufferPoolPageIncremental*
PersistentFileSysObj_MarkAppendonlyCatchup(): For all the fsys
objects in relation PT which are AO/CO and they are created and their
lossEOF < NewEOF mark them as AppednonlyCatchup. *Set
MirroDataSyncState to AppendonlyCatchup*
XXX: There is a comment in the code which explains why we don’t look at
'create pending' AO. Not clear yet.
5. PersistentFileSysObj_MirrorReDrop() – Do Mirror drops from
relations to filespaces (bottom to top) using corresponding PT
entries. Scan thru the each PT and do the following:
If (PersistentState is DropPending or Aborting create) and
(mirrorExsState is onlyMirroDropRemians)
then Perform drops of fsys objects on Mirror
Mark the PT entry to dropped and thus Free the PT entry (tuple)
*Freeing the PT tuple due to drop*
XXX: Funny comment on duplicate AO entries in PT (Note this for future ref)
6. PersistentFileSysObj_MirrorReCreate() – Do Mirror creates from
FileSpaces to relations (top to bottom) using corresponding PT
entries. Scan thru each PT entry and do the following – If
Persistent State is *createPending or created and
MirroExsState is MirrDwnBefCrt or MirrDwnDurCrt then Do re
create of fsys object Else if MirrorExsState is MirroRCreated
then Make sure that it really exists on the mirror Else
request the user to run full resync
7. Set File State to FileRepStateReady (XXX: What’s the reason to set it
to Ready?)
8. PersistentFileSysObj_MarkMirrorRecreated(): Mark in PT that the
Mirror fsys objects have been recreated. In short it updates the PT
entries for fsys objects created in step 6 above as created or create
pending. (Top to bottom approach used here) Setting
MirrorExistanceState to either created or create pending (XXX: Why
didn’t we do it as part of step 6 itself?)
9. Mark in CT Meta data file that resync transition is complete. And
at the same time RESET full resync.
10. Release mirror lock and resume database IO
B. Run the Resync Manager (accounting)
--------------------------------------
Inside a loop, PersistentFileSysObj_ResynchronizeScan():
Scan thru the relation PT and get a non-free entry which is not
BulkLoadCreatePending with any kind of mirror create state
(mirrocreated, mirrordown beforecreate etc). These consist of
entries we got updated from the resync transition phase.
But if this entry has mirror data sync state as synced or none
then we can skip and go to the next entry. Thus if the entry
passes above constraints, we lock that relation (obtained from
the entry) to avoid any truncates or drops.
We add this entry to a hash table (fileRepReSyncHash) used in
the actual resync process by the resync workers.
As we move forward, we see if the entries from the hash are
processed by the worker task (those relations are completely
resynced). If yes, delete them from the hash table and thus update the
PT entry to indicate that the resync is completed for that
relation/segnum combination. Setting MirroDataSyncState to DataSynced
AND If it is a bufpool relation reset resync_changed_page_count,
bufpool_resync_ckpt_loc and bufpool_resync_ckpt_block_num Else if AO
then set ao_loss_eof to the ao_new_eof
C. Resync Manager Resync to Insync Transition
---------------------------------------------
MirroredFlatFile_DropFileFromDir():
Drop all flat files used during normal runtime on the mirror like
everything inside two_phase dir, clog dir, distributed log dir, xlog
dir etc. This avoids the need to do incremental changes to them. In
short we do full copy of such files.
Create Checkpoint():
Creating a checkpoint at this stage helps to flush all the buffers
(wal, heap etc), control file, other flat files in mem onto the disk
and thus on the mirror. One hack in this routine helps to resync the
deleted files in the previous step to get the mirror upto speed. I
hate such code.
Create_checkpoint():
We create again a new checkpoint in order to mirror pg_control with
the last checkpoint position in the xlog file that is mirrored.
XXX: No clue yet
FileRepPrimary_MirrorInSyncTransition:
Transition the mirror from resync to insync phase by sending a message.
ChangeTracking_MarkTransitionToInsyncCompleted:
Mark in the change tracking meta data file that transition to insync is
complete and fsync the change.
Resync Worker
===============
Resync workers do the actual work of resyncing the relation which was
change tracked. Normally, we have 4 worker processes working at the
same time. But their number is configurable.
Inside a loop till the system can process relations, do
FileRepPrimary_GetResyncEntry():
Find from the hash table created by the resync manager entries
which have bufpoolpageincremental state or entries which have
either scan incremental, full copy or ao catchup. A set of
relations with bufpool incremental will be processed by the
current worker at a time but only one either scan incremental or
full resync or ao catchup case will be processed by the worker
task at a time.
FileRepPrimary_ResyncWrite():
Resync relations which are either scan incremental (all PT) or ao
catchup (ao tables) or full copy (both heap and ao including PT). For
bufpool relstoragemgr (heap tables) either scan incremental or full
copy, for every block/page in the relation (Total number of block =
'X') 1. if (Page LSM <= endResync LSN) and ( relation’s
mirrorBufPoolResyncCkptLoc <= Page LSN) then write that page/block on
the mirror (XXX: Do we ever use mirrorBufPoolResyncCkptLoc or even set
it?)
Now, truncate from the relation on the mirror upto the block # 'X'. (XXX:
No clue why we need to do this here.)
Flush all the buffers in the bufpool for this relation both on the
mirror and the primary.
For ao type relations divide the remaining part of the ao seg file
(neweof – losseof) to be resynced into 64k blocks and then write such
blocks to the mirror till the neweof. Then flush all such data on the
mirror.
Then update the entry in the hash created by the resync manager that
the worker has completed resync operations for that relation/seg file
combination.
Note here that the system does not update the PT tuples related to the
relations here but it will be done by the resync manager before it
removes the entry from the hash table. Refer to the resync manager
working. FYI: This is the function that resyncs the PTs and Full
resync relations.
FileRepPrimary_ResyncBufferPoolIncrementalWrite():
For a given relation which needs bufpoolpageincremental
(incremental) resync, find from the ‘CT compact log’ entries for all
the blocks that were modified when the system was in changetracking
mode. CT compact log has to be used as it has shirked version of CT
full log. For e.g one entry from this log would look like this
[relfilenode, block #, highest LSN, PTTID, PTSN]. For each such block
that was changed during changetracking phase:
If (page/block LSN == LSN obtained from CT compact log for
that block) Write that block to mirror
Else if the
page LSN is greater than CT compact log LSN, skip such
pages/block processing for now (See below)
But, make sure that the LSN obtained from the CT compact log is NOT
greater than Page/Block LSN i.e. we have an error/issue.
Now perform immediate buf pool sync for that relation which will apply
changes both on the primary and the mirror for the blocks we skipped
before and make them persist on the media.
Update the entry in hash table indicating that incremental resync has
been performed for the relation.
......@@ -369,10 +369,7 @@ AppendOnlyStorageWrite_OpenFile(AppendOnlyStorageWrite *storageWrite,
seekResult = FileSeek(file, logicalEof, SEEK_SET);
if (seekResult != logicalEof)
{
bool mirrorDataLossOccurred;
MirroredAppendOnly_Close(&storageWrite->bufferedAppend.mirroredOpen,
&mirrorDataLossOccurred);
MirroredAppendOnly_Close(&storageWrite->bufferedAppend.mirroredOpen);
ereport(ERROR,
(errcode(ERRCODE_IO_ERROR),
......
......@@ -41,13 +41,8 @@
* un-mirrored bulk initial data.
*/
void
MirroredAppendOnly_Flush(
MirroredAppendOnlyOpen *open,
/* The open struct. */
int *primaryError,
bool *mirrorDataLossOccurred)
MirroredAppendOnly_Flush(MirroredAppendOnlyOpen *open,
int *primaryError)
{
Assert(open != NULL);
Assert(open->isActive);
......@@ -58,9 +53,6 @@ MirroredAppendOnly_Flush(
{
*primaryError = errno;
}
*mirrorDataLossOccurred = open->mirrorDataLossOccurred;
/* Keep reporting-- it may have occurred anytime during the open session. */
}
/*
......@@ -68,11 +60,7 @@ MirroredAppendOnly_Flush(
*
*/
void
MirroredAppendOnly_Close(
MirroredAppendOnlyOpen *open,
/* The open struct. */
bool *mirrorDataLossOccurred)
MirroredAppendOnly_Close(MirroredAppendOnlyOpen *open)
{
Assert(open != NULL);
Assert(open->isActive);
......@@ -82,16 +70,12 @@ MirroredAppendOnly_Close(
FileClose(open->primaryFile);
*mirrorDataLossOccurred = open->mirrorDataLossOccurred;
/* Keep reporting-- it may have occurred anytime during the open session. */
open->isActive = false;
open->primaryFile = 0;
}
void
MirroredAppendOnly_Drop(
RelFileNode *relFileNode,
MirroredAppendOnly_Drop(RelFileNode *relFileNode,
int32 segmentFileNum,
int *primaryError)
{
......@@ -210,20 +194,13 @@ MirroredAppendOnly_Truncate(
/* The open struct. */
int64 position,
/* The position to cutoff the data. */
int *primaryError,
bool *mirrorDataLossOccurred)
int *primaryError)
{
*primaryError = 0;
*mirrorDataLossOccurred = false;
errno = 0;
if (FileTruncate(open->primaryFile, position) < 0)
*primaryError = errno;
*mirrorDataLossOccurred = open->mirrorDataLossOccurred;
/* Keep reporting-- it may have occurred anytime during the open session. */
}
/*
......
......@@ -38,22 +38,8 @@ typedef struct MirroredAppendOnlyOpen
uint32 segmentFileNum;
char mirrorFilespaceLocation[MAXPGPATH+1];
// UNDONE: Consider palloc'ing this instead of statically allocating
// UNDONE: to use less stack space...
File primaryFile;
bool create;
bool primaryOnlyToLetResynchronizeWork;
bool mirrorOnly;
bool copyToMirror;
bool mirrorDataLossOccurred;
} MirroredAppendOnlyOpen;
// -----------------------------------------------------------------------------
......@@ -63,22 +49,13 @@ typedef struct MirroredAppendOnlyOpen
/*
* Flush an Append-Only relation file.
*/
extern void MirroredAppendOnly_Flush(
MirroredAppendOnlyOpen *open,
/* The open struct. */
int *primaryError,
bool *mirrorDataLossOccurred);
extern void MirroredAppendOnly_Flush(MirroredAppendOnlyOpen *open,
int *primaryError);
/*
* Close an Append-Only relation file.
*/
extern void MirroredAppendOnly_Close(
MirroredAppendOnlyOpen *open,
/* The open struct. */
bool *mirrorDataLossOccurred);
extern void MirroredAppendOnly_Close(MirroredAppendOnlyOpen *open);
extern void MirroredAppendOnly_Drop(
......@@ -111,13 +88,9 @@ extern void MirroredAppendOnly_Append(
extern void MirroredAppendOnly_Truncate(
MirroredAppendOnlyOpen *open,
/* The open struct. */
int64 position,
/* The position to cutoff the data. */
int *primaryError,
bool *mirrorDataLossOccurred);
int *primaryError);
extern void ao_create_segfile_replay(XLogRecord *record);
extern void ao_insert_replay(XLogRecord *record);
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册