AOF 持久化和 RDB 持久化的最主要區(qū)別在于,前者記錄了數(shù)據(jù)的變更,而后者是保存了數(shù)據(jù)本身。本篇主要講的是AOF 持久化,了解 AOF 的數(shù)據(jù)組織方式和運作機制。Redis 主要在 aof.c 中實現(xiàn) AOF 的操作。
同樣,AOF 持久化也會涉及文件的讀寫,會用到數(shù)據(jù)結(jié)構(gòu) rio。關(guān)于 rio 已經(jīng)在上一個篇章已經(jīng)講述,在此不做展開。
假設(shè) redis 內(nèi)存有「name:Jhon」的鍵值對,那么進行 AOF 持久化后,AOF 文件有如下內(nèi)容:
*2 # 2 個參數(shù)
$6 # 第一個參數(shù)長度為6
SELECT # 第一個參數(shù)
$1 # 第二參數(shù)長度為1
8 # 第二參數(shù)
*3 # 3 個參數(shù)
$3 # 第一個參數(shù)長度為4
SET # 第一個參數(shù)
$4 # 第二參數(shù)長度為4
name # 第二個參數(shù)
$4 # 第三個參數(shù)長度為4
Jhon # 第二參數(shù)長度為4
所以對上面的內(nèi)容進行恢復(fù),能得到熟悉的一條 Redis 命令:SELECT 8;SET name Jhon. 可以想象的是,Redis 遍歷內(nèi)存數(shù)據(jù)集中的每個 key-value 對,依次寫入磁盤中;Redis 啟動的時候,從 AOF 文件中讀取數(shù)據(jù),恢復(fù)數(shù)據(jù)。
和 redis RDB 持久化運作機制不同,redis AOF 有后臺執(zhí)行和邊服務(wù)邊備份兩種方式。
http://wiki.jikexueyuan.com/project/redis/images/redis18.png" alt="" />
1)AOF 后臺執(zhí)行的方式和 RDB 有類似的地方,fork 一個子進程,主進程仍進行服務(wù),子進程執(zhí)行AOF 持久化,數(shù)據(jù)被dump 到磁盤上。與 RDB 不同的是,后臺子進程持久化過程中,主進程會記錄期間的所有數(shù)據(jù)變更(主進程還在服務(wù)),并存儲在 server.aof_rewrite_buf_blocks 中;后臺子進程結(jié)束后,Redis 更新緩存追加到 AOF 文件中,是 RDB 持久化所不具備的。
來說說更新緩存這個東西。Redis 服務(wù)器產(chǎn)生數(shù)據(jù)變更的時候,譬如 set name Jhon,不僅僅會修改內(nèi)存數(shù)據(jù)集,也會記錄此更新(修改)操作,記錄的方式就是上面所說的數(shù)據(jù)組織方式。
更新緩存可以存儲在 server.aofbuf 中,你可以把它理解為一個小型臨時中轉(zhuǎn)站,所有累積的更新緩存都會先放入這里,它會在特定時機寫入文件或者插入到server.aof-rewrite_buf_blocks 下鏈表(下面會詳述);server.aofbuf 中的數(shù)據(jù)在 propagrate() 添加,在涉及數(shù)據(jù)更新的地方都會調(diào)用propagrate() 以累積變更。更新緩存也可以存儲在 server.aof-rewrite_buf_blocks,這是一個元素類型為 struct aofrwblock 的鏈表,你可以把它理解為一個倉庫,當(dāng)后臺有AOF 子進程的時候,會將累積的更新緩存(在 server.aof_buf 中)插入到鏈表中,而當(dāng) AOF 子進程結(jié)束,它會被整個寫入到文件。兩者是有關(guān)聯(lián)的。
這里的意圖即是不用每次出現(xiàn)數(shù)據(jù)變更的時候都觸發(fā)一個寫操作,可以將寫操作先緩存到內(nèi)存中,待到合適的時機寫入到磁盤,如此避免頻繁的寫操作。當(dāng)然,完全可以實現(xiàn)讓數(shù)據(jù)變更及時更新到磁盤中。兩種做法的好壞就是一種博弈了。
下面是后臺執(zhí)行的主要代碼:
// 啟動后臺子進程,執(zhí)行AOF 持久化操作。bgrewriteaofCommand(),startAppendOnly(),
// serverCron() 中會調(diào)用此函數(shù)
/* This is how rewriting of the append only file in background works:
**1) The user calls BGREWRITEAOF
* 2) Redis calls this function, that forks():
* * 2a) the child rewrite the append only file in a temp file.
* 2b) the parent accumulates differences in server.aof_rewrite_buf.
* 3) When the child finished '2a' exists.
* 4) The parent will trap the exit code, if it's OK, will append the
* data accumulated into server.aof_rewrite_buf into the temp file, and
* finally will rename(2) the temp file in the actual file name.
* The the new file is reopened as the new append only file. Profit!
*/
int rewriteAppendOnlyFileBackground(void) {
pid_t childpid;
long long start;
// 已經(jīng)有正在執(zhí)行備份的子進程
if (server.aof_child_pid != -1) return REDIS_ERR;
start = ustime();
if ((childpid = fork()) == 0) {
char tmpfile[256];
// 子進程
/* Child */
// 關(guān)閉監(jiān)聽
closeListeningSockets(0);
// 設(shè)置進程title
redisSetProcTitle("redis-aof-rewrite");
// 臨時文件名
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
// 開始執(zhí)行AOF 持久化
if (rewriteAppendO nlyFile(tmpfile) == REDIS_OK) {
// 臟數(shù)據(jù),其實就是子進程所消耗的內(nèi)存大小
// 獲取臟數(shù)據(jù)大小
size_t private_dirty = zmalloc_get_private_dirty();
// 記錄臟數(shù)據(jù)
if (private_dirty) {
redisLog(REDIS_NOTICE,
"AOF rewrite: %zu MB of memory used by copy-on-write",
private_dirty/(1024*1024));
}
exitFromChild(0);
} else {
exitFromChild(1);
}
} else {
/* Parent */
server.stat_fork_time = ustime()-start;
if (childpid == -1) {
redisLog(REDIS_WARNING,
"Can't rewrite append only file in background: fork: %s",
strerror(errno));
return REDIS_ERR;
}
redisLog(REDIS_NOTICE,
"Background append only file rewriting started by pid %d",childpid);
// AOF 已經(jīng)開始執(zhí)行,取消AOF 計劃
server.aof_rewrite_scheduled = 0;
// AOF 最近一次執(zhí)行的起始時間
server.aof_rewrite_time_start = time(NULL);
// 子進程ID
server.aof_child_pid = childpid;
updateDictResizePolicy();
// 因為更新緩存都將寫入文件,要強制產(chǎn)生選擇數(shù)據(jù)集的指令SELECT ,以防出現(xiàn)數(shù)據(jù)
// 合并錯誤。
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
replicationScriptCacheFlush();
return REDIS_OK;
}
return REDIS_OK; /* unreached */
}
如上,子進程執(zhí)行 AOF 持久化,父進程則會記錄一些 AOF 的執(zhí)行信息。下面來看看 AOF 持久化具體是怎么做的?
// AOF 持久化主函數(shù)。只在rewriteAppendOnlyFileBackground() 中會調(diào)用此函數(shù)
/* Write a sequence of commands able to fully rebuild the dataset into
* "filename". Used both by REWRITEAOF and BGREWRITEAOF.
**
In order to minimize the number of commands needed in the rewritten
* log Redis uses variadic commands when possible, such as RPUSH, SADD
* and ZADD. However at max REDIS_AOF_REWRITE_ITEMS_PER_CMD items per time
* are inserted using a single command. */
int rewriteAppendOnlyFile(char *filename) {
dictIterator *di = NULL;
dictEntry *de;
rio aof;
FILE *fp;
char tmpfile[256];
int j;
long long now = mstime();
/* Note that we have to use a different temp name here compared to the
* one used by rewriteAppendOnlyFileBackground() function. */
snprintf(tmpfile,256,"temp-rewriteaof-%d.aof", (int) getpid());
// 打開文件
fp = fopen(tmpfile,"w");
if (!fp) {
redisLog(REDIS_WARNING, "Opening the temp file for AOF rewrite in"
"rewriteAppendOnlyFile(): %s", strerror(errno));
return REDIS_ERR;
}
// 初始化rio 結(jié)構(gòu)體
rioInitWithFile(&aof,fp);
// 如果設(shè)置了自動備份參數(shù),將進行設(shè)置
if (server.aof_rewrite_incremental_fsync)
rioSetAutoSync(&aof,REDIS_AOF_AUTOSYNC_BYTES);
// 備份每一個數(shù)據(jù)集
for (j = 0; j < server.dbnum; j++) {
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
redisDb *db = server.db+j;
dict *d = db->dict;
if (dictSize(d) == 0) continue;
// 獲取數(shù)據(jù)集的迭代器
di = dictGetSafeIterator(d);
if (!di) {
fclose(fp);
return REDIS_ERR;
}
// 寫入AOF 操作碼
/* SELECT the new DB */
if (rioWrite(&aof,selectcmd,sizeof(selectcmd)-1) == 0) goto werr;
// 寫入數(shù)據(jù)集序號
if (rioWriteBulkLongLong(&aof,j) == 0) goto werr;
// 寫入數(shù)據(jù)集中每一個數(shù)據(jù)項
/* Iterate this DB writing every entry */
while((de = dictNext(di)) != NULL) {
sds keystr;
robj key, *o;
long long expiretime;
keystr = dictGetKey(de);
o = dictGetVal(de);
// 將keystr 封裝在robj 里
initStaticStringObject(key,keystr);
// 獲取過期時間
expiretime = getExpire(db,&key);
// 如果已經(jīng)過期,放棄存儲
/* If this key is already expired skip it */
if (expiretime != -1 && expiretime < now) continue;
// 寫入鍵值對應(yīng)的寫操作
/* Save the key and associated value */
if (o->type == REDIS_STRING) {
/* Emit a SET command */
char cmd[]="*3\r\n$3\r\nSET\r\n";
if (rioWrite(&aof,cmd,sizeof(cmd)-1) == 0) goto werr;
/* Key and value */
if (rioWriteBulkObject(&aof,&key) == 0) goto werr;
if (rioWriteBulkObject(&aof,o) == 0) goto werr;
} else if (o->type == REDIS_LIST) {
if (rewriteListObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_SET) {
if (rewriteSetObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_ZSET) {
if (rewriteSortedSetObject(&aof,&key,o) == 0) goto werr;
} else if (o->type == REDIS_HASH) {
if (rewriteHashObject(&aof,&key,o) == 0) goto werr;
} else {
redisPanic("Unknown object type");
}
// 寫入過期時間
/* Save the expire time */
if (expiretime != -1) {
char cmd[]="*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(&aof,cmd,sizeof(cmd)-1) == 0) goto werr;
if (rioWriteBulkObject(&aof,&key) == 0) goto werr;
if (rioWriteBulkLongLong(&aof,expiretime) == 0) goto werr;
}
}
// 釋放迭代器
dictReleaseIterator(di);
}
// 寫入磁盤
/* Make sure data will not remain on the OS's output buffers */
fflush(fp);
aof_fsync(fileno(fp));
fclose(fp);
// 重寫文件名
/* Use RENAME to make sure the DB file is changed atomically only
* if the generate DB file is ok. */
if (rename(tmpfile,filename) == -1) {
redisLog(REDIS_WARNING,"Error moving temp append only file on the "
"final destination: %s", strerror(errno));
unlink(tmpfile);
return REDIS_ERR;
}
redisLog(REDIS_NOTICE,"SYNC append only file rewrite performed");
return REDIS_OK;
werr:
// 清理工作
fclose(fp);
unlink(tmpfile);
redisLog(REDIS_WARNING,"Write error writing append only file on disk: "
"%s", strerror(errno));
if (di) dictReleaseIterator(di);
return REDIS_ERR;
}
剛才所說,AOF 在持久化結(jié)束后,持久化過程產(chǎn)生的數(shù)據(jù)變更也會追加到 AOF 文件中。如果有留意定時處理函數(shù) serverCorn():父進程會在子進程結(jié)束后,將 AOF 持久化過程中產(chǎn)生的數(shù)據(jù)變更,追加到 AOF 文件。這就是 backgroundRewriteDoneHandler() 要做的:將 server.aof_rewrite_buf_blocks 追加到 AOF 文件。
// 后臺子進程結(jié)束后,Redis 更新緩存server.aof_rewrite_buf_blocks 追加到AOF 文件中
// 在AOF 持久化結(jié)束后會執(zhí)行這個函數(shù), backgroundRewriteDoneHandler() 主要工作是
// 將server.aof_rewrite_buf_blocks,即AOF 緩存寫入文件
/* A background append only file rewriting (BGREWRITEAOF) terminated its work.
* Handle this. */
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
......
// 將AOF 緩存server.aof_rewrite_buf_blocks 的AOF 寫入磁盤
if (aofRewriteBufferWrite(newfd) == -1) {
redisLog(REDIS_WARNING,
"Error trying to flush the parent diff to the rewritten AOF: %s",
strerror(errno));
close(newfd);
goto cleanup;
}
......
}
// 將累積的更新緩存server.aof_rewrite_buf_blocks 同步到磁盤
/* Write the buffer (possibly composed of multiple blocks) into the specified
* fd. If no short write or any other error happens -1 is returned,
* otherwise the number of bytes written is returned. */
ssize_t aofRewriteBufferWrite(int fd) {
listNode *ln;
listIter li;
ssize_t count = 0;
listRewind(server.aof_rewrite_buf_blocks,&li);
while((ln = listNext(&li))) {
aofrwblock *block = listNodeValue(ln);
ssize_t nwritten;
if (block->used) {
nwritten = write(fd,block->buf,block->used);
if (nwritten != block->used) {
if (nwritten == 0) errno = EIO;
return -1;
}
count += nwritten;
}
}
return count;
}
2)邊服務(wù)邊備份的方式,即 Redis 服務(wù)器會把所有的數(shù)據(jù)變更存儲在 server.aof_buf 中,并在特定時機將更新緩存寫入預(yù)設(shè)定的文件(server.aof_filename)。特定時機有三種:
Redis 無非是不想服務(wù)器突然崩潰終止,導(dǎo)致過多的數(shù)據(jù)丟失。Redis 默認是每隔固定時間進行一次邊服務(wù)邊備份,即隔固定時間將累積的變更的寫入文件。
下面是邊服務(wù)邊執(zhí)行 AOF 持久化的主要代碼:
// 同步磁盤;將所有累積的更新server.aof_buf 寫入磁盤
/* Write the append only file buffer on disk.
**
Since we are required to write the AOF before replying to the client,
* and the only way the client socket can get a write is entering when the
* the event loop, we accumulate all the AOF writes in a memory
* buffer and write it on disk using this function just before entering
* the event loop again.
**
About the 'force' argument:
**
When the fsync policy is set to 'everysec' we may delay the flush if there
* is still an fsync() going on in the background thread, since for instance
* on Linux write(2) will be blocked by the background fsync anyway.
* When this happens we remember that there is some aof buffer to be
* flushed ASAP, and will try to do that in the serverCron() function.
**
However if force is set to 1 we'll write regardless of the background
* fsync. */
void flushAppendOnlyFile(int force) {
ssize_t nwritten;
int sync_in_progress = 0;
// 無數(shù)據(jù),無需同步到磁盤
if (sdslen(server.aof_buf) == 0) return;
// 創(chuàng)建線程任務(wù),主要調(diào)用fsync()
if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
sync_in_progress = bioPendingJobsOfType(REDIS_BIO_AOF_FSYNC) != 0;
// 如果沒有設(shè)置強制同步的選項,可能不會立即進行同步
if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
// 推遲執(zhí)行AOF
/* With this append fsync policy we do background fsyncing.
* If the fsync is still in progress we can try to delay
* the write for a couple of seconds. */
if (sync_in_progress) {
if (server.aof_flush_postponed_start == 0) {
// 設(shè)置延遲沖洗時間選項
/* No previous write postponinig, remember that we are
* postponing the flush and return. */
// /* Unix time sampled every cron cycle. */
server.aof_flush_postponed_start = server.unixtime;
return;
// 沒有超過2s,直接結(jié)束
} else if (server.unixtime - server.aof_flush_postponed_start < 2) {
/* We were already waiting for fsync to finish, but for less
* than two seconds this is still ok. Postpone again. */
return;
}
// 否則,要強制寫入磁盤
/* Otherwise fall trough, and go write since we can't wait
* over two seconds. */
server.aof_delayed_fsync++;
redisLog(REDIS_NOTICE,"Asynchronous AOF fsync is taking too long (disk"
" is busy?). Writing the AOF buffer without waiting for fsync to "
"complete, this may slow down Redis.");
}
}
// 取消延遲沖洗時間設(shè)置
/* If you are following this code path, then we are going to write so
* set reset the postponed flush sentinel to zero. */
server.aof_flush_postponed_start = 0;
/* We want to perform a single write. This should be guaranteed atomic
* at least if the filesystem we are writing is a real physical one.
* While this will save us against the server being killed I don't think
* there is much to do about the whole server stopping for power problems
* or alike */
// AOF 文件已經(jīng)打開了。將server.aof_buf 中的所有緩存數(shù)據(jù)寫入文件
nwritten = write(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
if (nwritten != (signed)sdslen(server.aof_buf)) {
/* Ooops, we are in troubles. The best thing to do for now is
* aborting instead of giving the illusion that everything is
* working as expected. */
if (nwritten == -1) {
redisLog(REDIS_WARNING,"Exiting on error writing to the append-only"
" file: %s",strerror(errno));
} else {
redisLog(REDIS_WARNING,"Exiting on short write while writing to "
"the append-only file: %s (nwritten=%ld, "
"expected=%ld)",
strerror(errno),
(long)nwritten,
(long)sdslen(server.aof_buf));
if (ftruncate(server.aof_fd, server.aof_current_size) == -1) {
redisLog(REDIS_WARNING, "Could not remove short write "
"from the append-only file. Redis may refuse "
"to load the AOF the next time it starts. "
"ftruncate: %s", strerror(errno));
}
}
exit(1);
}
// 更新AOF 文件的大小
server.aof_current_size += nwritten;
// 當(dāng)server.aof_buf 足夠小, 重新利用空間,防止頻繁的內(nèi)存分配。
// 相反,當(dāng)server.aof_buf 占據(jù)大量的空間,采取的策略是釋放空間,可見redis
// 對內(nèi)存很敏感。
/* Re-use AOF buffer when it is small enough. The maximum comes from the
* arena size of 4k minus some overhead (but is otherwise arbitrary). */
if ((sdslen(server.aof_buf)+sdsavail(server.aof_buf)) < 4000) {
sdsclear(server.aof_buf);
} else {
sdsfree(server.aof_buf);
server.aof_buf = sdsempty();
}
/* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
* children doing I/O in the background. */
if (server.aof_no_fsync_on_rewrite &&
(server.aof_child_pid != -1 || server.rdb_child_pid != -1))
return;
// sync, 寫入磁盤
/* Perform the fsync if needed. */
if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
/* aof_fsync is defined as fdatasync() for Linux in order to avoid
* flushing metadata. */
aof_fsync(server.aof_fd); /* Let's try to get this data on the disk */
server.aof_last_fsync = server.unixtime;
} else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
server.unixtime > server.aof_last_fsync)) {
if (!sync_in_progress) aof_background_fsync(server.aof_fd);
server.aof_last_fsync = server.unixtime;
}
}
上面兩次提到了「更新緩存」,它即是 Redis 累積的數(shù)據(jù)變更。
更新緩存可以存儲在 server.aof_buf 中,可以存儲在 server.server.aof_rewrite_buf_blocks 連表中。他們的關(guān)系是:每一次數(shù)據(jù)變更記錄都會寫入 server.aof_buf 中,同時如果后臺子進程在持久化,變更記錄還會被寫入 server.server.aof_rewrite_buf_blocks 中。server.aof_buf 會在特定時期寫入指定文件,server.server.aof_rewrite_buf_blocks 會在后臺持久化結(jié)束后追加到文件。
Redis 源碼中是這么實現(xiàn)的:propagrate()->feedAppendOnlyFile()->aofRewriteBufferAppend()
注意,feedAppendOnlyFile() 會把更新添加到server.aof_buf;接下來會有一個判斷,如果存在 AOF 子進程,則調(diào)用aofRewriteBufferAppend() 將server.aof_buf 中的所有數(shù)據(jù)插入到 server.aof_rewrite_buf_blocks 鏈表。這樣,就能夠理解為什么在AOF 持久化子進程結(jié)束后,父進程會將 server.aof_rewrite_buf_blocks 追加到 AOF 文件了。
// 向AOF 和從機發(fā)布數(shù)據(jù)更新
/* Propagate the specified command (in the context of the specified database id)
* to AOF and Slaves.
**
flags are an xor between:
* + REDIS_PROPAGATE_NONE (no propagation of command at all)
* + REDIS_PROPAGATE_AOF (propagate into the AOF file if is enabled)
* + REDIS_PROPAGATE_REPL (propagate into the replication link)
*/
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
int flags)
{
// AOF 策略需要打開,且設(shè)置AOF 傳播標(biāo)記,將更新發(fā)布給本地文件
if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)
feedAppendOnlyFile(cmd,dbid,argv,argc);
// 設(shè)置了從機傳播標(biāo)記,將更新發(fā)布給從機
if (flags & REDIS_PROPAGATE_REPL)
replicationFeedSlaves(server.slaves,dbid,argv,argc);
}
// 將數(shù)據(jù)更新記錄到AOF 緩存中
void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv,
int argc) {
sds buf = sdsempty();
robj *tmpargv[3];
/* The DB this command was targeting is not the same as the last command
* we appendend. To issue a SELECT command is needed. */
if (dictid != server.aof_selected_db) {
char seldb[64];
snprintf(seldb,sizeof(seldb),"%d",dictid);
buf = sdscatprintf(buf,"*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
(unsigned long)strlen(seldb),seldb);
server.aof_selected_db = dictid;
}
if (cmd->proc == expireCommand || cmd->proc == pexpireCommand ||
cmd->proc == expireatCommand) {
/* Translate EXPIRE/PEXPIRE/EXPIREAT into PEXPIREAT */
buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
} else if (cmd->proc == setexCommand || cmd->proc == psetexCommand) {
/* Translate SETEX/PSETEX to SET and PEXPIREAT */
tmpargv[0] = createStringObject("SET",3);
tmpargv[1] = argv[1];
tmpargv[2] = argv[3];
buf = catAppendOnlyGenericCommand(buf,3,tmpargv);
decrRefCount(tmpargv[0]);
buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
} else {
/* All the other commands don't need translation or need the
* same translation already operated in the command vector
* for the replication itself. */
buf = catAppendOnlyGenericCommand(buf,argc,argv);
}
// 將生成的AOF 追加到server.aof_buf 中。server. 在下一次進入事件循環(huán)之前,
// aof_buf 中的內(nèi)容將會寫到磁盤上
/* Append to the AOF buffer. This will be flushed on disk just before
* of re-entering the event loop, so before the client will get a
* positive reply about the operation performed. */
if (server.aof_state == REDIS_AOF_ON)
server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf));
// 如果已經(jīng)有AOF 子進程運行,redis 采取的策略是累積子進程AOF 備份的數(shù)據(jù)和
// 內(nèi)存中數(shù)據(jù)集的差異。aofRewriteBufferAppend() 把buf 的內(nèi)容追加到
// server.aof_rewrite_buf_blocks 數(shù)組中
/* If a background append only file rewriting is in progress we want to
* accumulate the differences between the child DB and the current one
* in a buffer, so that when the child process will do its work we
* can append the differences to the new append only file. */
if (server.aof_child_pid != -1)
aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf));
sdsfree(buf);
}
// 將數(shù)據(jù)更新記錄寫入server.aof_rewrite_buf_blocks,此函數(shù)只由
// feedAppendOnlyFile() 調(diào)用
/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
void aofRewriteBufferAppend(unsigned char *s, unsigned long len) {
// 尾插法
listNode *ln = listLast(server.aof_rewrite_buf_blocks);
aofrwblock *block = ln ? ln->value : NULL;
while(len) {
/* If we already got at least an allocated block, try appending
* at least some piece into it. */
if (block) {
unsigned long thislen = (block->free < len) ? block->free : len;
if (thislen) { /* The current block is not already full. */
memcpy(block->buf+block->used, s, thislen);
block->used += thislen;
block->free -= thislen;
s += thislen;
len -= thislen;
}
}
if (len) { /* First block to allocate, or need another block. */
int numblocks;
// 創(chuàng)建新的節(jié)點,插到尾部
block = zmalloc(sizeof(*block));
block->free = AOF_RW_BUF_BLOCK_SIZE;
block->used = 0;
// 尾插法
listAddNodeTail(server.aof_rewrite_buf_blocks,block);
/* Log every time we cross more 10 or 100 blocks, respectively
* as a notice or warning. */
numblocks = listLength(server.aof_rewrite_buf_blocks);
if (((numblocks+1) % 10) == 0) {
int level = ((numblocks+1) % 100) == 0 ? REDIS_WARNING :
REDIS_NOTICE;
redisLog(level,"Background AOF buffer size: %lu MB",
aofRewriteBufferSize()/(1024*1024));
}
}
}
}
一副可以緩解視力疲勞的圖片——AOF 持久化運作機制:
http://wiki.jikexueyuan.com/project/redis/images/redis19.png" alt="" />
兩種數(shù)據(jù)落地的方式,就是 AOF 的兩個主線。因此,redis AOF 持久化機制有兩條主線:后臺執(zhí)行和邊服務(wù)邊備份,抓住這兩點就能理解 redis AOF 了。
這里有一個疑問,兩條主線都會涉及文件的寫:后臺執(zhí)行會寫一個AOF 文件,邊服務(wù)邊備份也會寫一個,以哪個為準(zhǔn)?
后臺持久化的數(shù)據(jù)首先會被寫入“temp-rewriteaof-bg-%d.aof”,其中“%d”是AOF 子進程 id;待 AOF 子進程結(jié)束后,“temp-rewriteaof-bg-%d.aof”會被以追加的方式打開,繼而寫入 server.aof_rewrite_buf_blocks 中的更新緩存,最后“temp-rewriteaof-bg-%d.aof”文件被命名為 server.aof_filename,所以之前的名為 server.aof_filename 的文件會被刪除,也就是說邊服務(wù)邊備份寫入的文件會被刪除。邊服務(wù)邊備份的數(shù)據(jù)會被一直寫入到 server.aof_filename文件中。
因此,確實會產(chǎn)生兩個文件,但是最后都會變成 server.aof_filename 文件。這里可能還有一個疑問,既然有了后臺持久化,為什么還要邊服務(wù)邊備份?邊服務(wù)邊備份時間長了會產(chǎn)生數(shù)據(jù)冗余甚至備份過舊的數(shù)據(jù),而后臺持久化可以消除這些東西???,這里是 Redis 的雙保險。
AOF 的數(shù)據(jù)恢復(fù)過程設(shè)計很巧妙,它模擬一個 Redis 的服務(wù)過程。Redis 首先虛擬一個客戶端,讀取 AOF 文件恢復(fù) Redis 命令和參數(shù);接著過程就和服務(wù)客戶端一樣執(zhí)行命令相應(yīng)的函數(shù),從而恢復(fù)數(shù)據(jù),這樣做的目的無非是提高代碼的復(fù)用率。這些過程主要在 loadAppendOnlyFile() 中實現(xiàn)。
// 加載AOF 文件,恢復(fù)數(shù)據(jù)
/* Replay the append log file. On error REDIS_OK is returned. On non fatal
* error (the append only file is zero-length) REDIS_ERR is returned. On
* fatal error an error message is logged and the program exists. */
int loadAppendOnlyFile(char *filename) {
struct redisClient *fakeClient;
FILE *fp = fopen(filename,"r");
struct redis_stat sb;
int old_aof_state = server.aof_state;
long loops = 0;
// 文件大小不能為0
if (fp && redis_fstat(fileno(fp),&sb) != -1 && sb.st_size == 0) {
server.aof_current_size = 0;
fclose(fp);
return REDIS_ERR;
}
if (fp == NULL) {
redisLog(REDIS_WARNING,"Fatal error: can't open the append log file "
"for reading: %s",strerror(errno));
exit(1);
}
// 正在執(zhí)行AOF 加載操作,于是暫時禁止AOF 的所有操作,以免混淆
/* Temporarily disable AOF, to prevent EXEC from feeding a MULTI
* to the same file we're about to read. */
server.aof_state = REDIS_AOF_OFF;
// 虛擬出一個客戶端,即redisClient
fakeClient = createFakeClient();
startLoading(fp);
while(1) {
int argc, j;
unsigned long len;
robj **argv;
char buf[128];
sds argsds;
struct redisCommand *cmd;
// 每循環(huán)1000 次,在恢復(fù)數(shù)據(jù)的同時,服務(wù)器也為客戶端服務(wù)。
// aeProcessEvents() 會進入事件循環(huán)
/* Serve the clients from time to time */
if (!(loops++ % 1000)) {
loadingProgress(ftello(fp));
aeProcessEvents(server.el, AE_FILE_EVENTS|AE_DONT_WAIT);
}
// 可能aof 文件到了結(jié)尾
if (fgets(buf,sizeof(buf),fp) == NULL) {
if (feof(fp))
break;
else
goto readerr;
}
// 必須以“*”開頭,格式不對,退出
if (buf[0] != '*') goto fmterr;
// 參數(shù)的個數(shù)
argc = atoi(buf+1);
// 參數(shù)個數(shù)錯誤
if (argc < 1) goto fmterr;
// 為參數(shù)分配空間
argv = zmalloc(sizeof(robj*)*argc);
// 依次讀取參數(shù)
for (j = 0; j < argc; j++) {
if (fgets(buf,sizeof(buf),fp) == NULL) goto readerr;
if (buf[0] != '$') goto fmterr;
len = strtol(buf+1,NULL,10);
argsds = sdsnewlen(NULL,len);
if (len && fread(argsds,len,1,fp) == 0) goto fmterr;
argv[j] = createObject(REDIS_STRING,argsds);
if (fread(buf,2,1,fp) == 0) goto fmterr; /* discard CRLF */
}
// 找到相應(yīng)的命令
/* Command lookup */
cmd = lookupCommand(argv[0]->ptr);
if (!cmd) {
redisLog(REDIS_WARNING,"Unknown command '%s' reading the "
"append only file", (char*)argv[0]->ptr);
exit(1);
}
// 執(zhí)行命令,模擬服務(wù)客戶端請求的過程,從而寫入數(shù)據(jù)
/* Run the command in the context of a fake client */
fakeClient->argc = argc;
fakeClient->argv = argv;
cmd->proc(fakeClient);
/* The fake client should not have a reply */
redisAssert(fakeClient->bufpos == 0 && listLength(fakeClient->reply)
== 0);
/* The fake client should never get blocked */
redisAssert((fakeClient->flags & REDIS_BLOCKED) == 0);
// 釋放虛擬客戶端空間
/* Clean up. Command code may have changed argv/argc so we use the
* argv/argc of the client instead of the local variables. */
for (j = 0; j < fakeClient->argc; j++)
decrRefCount(fakeClient->argv[j]);
zfree(fakeClient->argv);
}
/* This point can only be reached when EOF is reached without errors.
* If the client is in the middle of a MULTI/EXEC, log error and quit. */
if (fakeClient->flags & REDIS_MULTI) goto readerr;
// 清理工作
fclose(fp);
freeFakeClient(fakeClient);
// 恢復(fù)舊的AOF 狀態(tài)
server.aof_state = old_aof_state;
stopLoading();
// 記錄最近AOF 操作的文件大小
aofUpdateCurrentSize();
server.aof_rewrite_base_size = server.aof_current_size;
return REDIS_OK;
readerr:
// 錯誤,清理工作
if (feof(fp)) {
redisLog(REDIS_WARNING,"Unexpected end of file reading the append "
"only file");
} else {
redisLog(REDIS_WARNING,"Unrecoverable error reading the append only "
"file: %s", strerror(errno));
}
exit(1);
fmterr:
redisLog(REDIS_WARNING,"Bad file format reading the append only file: "
"make a backup of your AOF file, then use ./redis-check-aof --fix "
"<filename>");
exit(1);
}
如果對數(shù)據(jù)比較關(guān)心,分秒必爭,可以用 AOF 持久化,而且AOF 文件很容易進行分析。