就像mget允許我們一次性檢索多個文檔一樣，bulk API允許我們使用單一請求來實現(xiàn)多個文檔的create、index、update或delete。這對索引類似于日志活動這樣的數(shù)據(jù)流非常有用，它們可以以成百上千的數(shù)據(jù)為一個批次按序進(jìn)行索引。

bulk請求體如下，它有一點不同尋常：

{ action: { metadata }}\n
{ request body        }\n
{ action: { metadata }}\n
{ request body        }\n
...

這種格式類似于用"\n"符號連接起來的一行一行的JSON文檔流(stream)。兩個重要的點需要注意：

每行必須以"\n"符號結(jié)尾，包括最后一行。這些都是作為每行有效的分離而做的標(biāo)記。
每一行的數(shù)據(jù)不能包含未被轉(zhuǎn)義的換行符，它們會干擾分析——這意味著JSON不能被美化打印。

提示:

在《批量格式》一章我們介紹了為什么bulk API使用這種格式。

action/metadata這一行定義了文檔行為(what action)發(fā)生在哪個文檔(which document)之上。

行為(action)必須是以下幾種：

行為	解釋
`create`	當(dāng)文檔不存在時創(chuàng)建之。詳見《創(chuàng)建文檔》
`index`	創(chuàng)建新文檔或替換已有文檔。見《索引文檔》和《更新文檔》
`update`	局部更新文檔。見《局部更新》
`delete`	刪除一個文檔。見《刪除文檔》

在索引、創(chuàng)建、更新或刪除時必須指定文檔的_index、_type、_id這些元數(shù)據(jù)(metadata)。

例如刪除請求看起來像這樣：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

請求體(request body)由文檔的_source組成——文檔所包含的一些字段以及其值。它被index和create操作所必須，這是有道理的：你必須提供文檔用來索引。

這些還被update操作所必需，而且請求體的組成應(yīng)該與update API（doc, upsert, script等等）一致。刪除操作不需要請求體(request body)。

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

如果定義_id，ID將會被自動創(chuàng)建：

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

為了將這些放在一起，bulk請求表單是這樣的：

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1>
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} } <2>

注意`delete`**行為(action)**沒有請求體，它緊接著另一個**行為(action)**
記得最后一個換行符

Elasticsearch響應(yīng)包含一個items數(shù)組，它羅列了每一個請求的結(jié)果，結(jié)果的順序與我們請求的順序相同：

{
   "took": 4,
   "errors": false, <1>
   "items": [
      {  "delete": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 2,
            "status":   200,
            "found":    true
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 3,
            "status":   201
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version": 1,
            "status":   201
      }},
      {  "update": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 4,
            "status":   200
      }}
   ]
}}

所有子請求都成功完成。

每個子請求都被獨立的執(zhí)行，所以一個子請求的錯誤并不影響其它請求。如果任何一個請求失敗，頂層的error標(biāo)記將被設(shè)置為true，然后錯誤的細(xì)節(jié)將在相應(yīng)的請求中被報告：

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

響應(yīng)中我們將看到create文檔123失敗了，因為文檔已經(jīng)存在，但是后來的在123上執(zhí)行的index請求成功了：

{
   "took": 3,
   "errors": true, <1>
   "items": [
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "status":   409, <2>
            "error":    "DocumentAlreadyExistsException <3>
                        [[website][4] [blog][123]:
                        document already exists]"
      }},
      {  "index": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 5,
            "status":   200 <4>
      }}
   ]
}

一個或多個請求失敗。
這個請求的HTTP狀態(tài)碼被報告為`409 CONFLICT`。
錯誤消息說明了什么請求錯誤。
第二個請求成功了，狀態(tài)碼是`200 OK`。

這些說明bulk請求不是原子操作——它們不能實現(xiàn)事務(wù)。每個請求操作時分開的，所以每個請求的成功與否不干擾其它操作。

不要重復(fù)

你可能在同一個index下的同一個type里批量索引日志數(shù)據(jù)。為每個文檔指定相同的元數(shù)據(jù)是多余的。就像mget API，bulk請求也可以在URL中使用/_index或/_index/_type:

POST /website/_bulk
{ "index": { "_type": "log" }}
{ "event": "User logged in" }

你依舊可以覆蓋元數(shù)據(jù)行的_index和_type，在沒有覆蓋時它會使用URL中的值作為默認(rèn)值：

POST /website/log/_bulk
{ "index": {}}
{ "event": "User logged in" }
{ "index": { "_type": "blog" }}
{ "title": "Overriding the default type" }

多大才算太大？

整個批量請求需要被加載到接受我們請求節(jié)點的內(nèi)存里，所以請求越大，給其它請求可用的內(nèi)存就越小。有一個最佳的bulk請求大小。超過這個大小，性能不再提升而且可能降低。

最佳大小，當(dāng)然并不是一個固定的數(shù)字。它完全取決于你的硬件、你文檔的大小和復(fù)雜度以及索引和搜索的負(fù)載。幸運的是，這個最佳點(sweetspot)還是容易找到的：

試著批量索引標(biāo)準(zhǔn)的文檔，隨著大小的增長，當(dāng)性能開始降低，說明你每個批次的大小太大了。開始的數(shù)量可以在1000~5000個文檔之間，如果你的文檔非常大，可以使用較小的批次。

通常著眼于你請求批次的物理大小是非常有用的。一千個1kB的文檔和一千個1MB的文檔大不相同。一個好的批次最好保持在5-15MB大小間。

上一篇：復(fù)合核心字段類型下一篇：索引設(shè)置