OpenStack Object Storage (Swift) 是用來創建冗餘的、可擴 展的對象存儲(引擎)的開源軟件。通過閱讀Swift的技術文檔,我們可以理解其中的設計的原理和實現的方法 。
Swift項目已經進展有兩年了,對外開放也一年有餘,在國外的社區你可以獲得許多幫助,但在國內只能找到一些零零散散不齊全的資料,許多人更喜歡坐享其成,而不是參與其中。本人於9月底開始接觸swift,剛開始看文檔的時候一知半解,有幸閱讀了zzcase等人的博客,才得以入門。非常贊同鄭燁在某本書序言中所說的話:“翻譯向來是一件費力不討好的事情。” 。本人本著知識共享、共同進步的目的,與諸位分享。隨著對swift設計原理的理解和源碼的深入,文檔經過數次反复的修改,希望對各位學習swift的童鞋有所幫助,水平有限,若各位發現有錯誤之處,懇請指出。文檔中的紅字部分錶示還需斟酌,歡迎提出各種建議和想法。
轉載請註明譯者和出處,謝謝!
原文鏈接: http://www.cnblogs.com/yuxc/archive/2011/12/06/2278303.html
1. Swift Architectural Overview Swift 架構概述
1.1 Proxy Server 代理服務器
代理服務器負責 Swift 架構的其餘組件間的相互通信。對於每個客戶端的請求,它將在環中查詢帳號、容器或者對象的位置並且相應地轉發請求。也可以使用公共 API 向代理服務器發送請求。
代理服務器也處理大量的失敗請求。例如,如果對於某個對象 PUT 請求時,某個存儲節點不可用,它將會查詢環可傳送的服務器並轉發請求。
對像以流的形式到達(來自) 對象服務器,它們直接從代理服務器傳送到(來自)用戶 — 代理服務器並不緩衝它們。
1.2 The Ring 環
環表示存儲在硬盤上的實體名稱和物理位置間的映射。帳號、容器、對像都有相應的環。當 swift 的其它組件 ( 比如復制 ) 要對帳號、容器或對像操作時,需要查詢相應的環來確定它在集群上的位置。
環使用區域、設備、虛節點和副本來維護這些映射信息。環中每個虛節點在集群中都 ( 默認 ) 有 3 個副本。每個虛節點的位置由環來維護 , 並存儲在映射中。當代理服務器轉發的客戶端請求失敗時,環也負責決定由哪一個設備來接手請求。
環使用了區域的概念來保證數據的隔離。每個虛節點的副本都確保放在了不同的區域中。一個區域可以是一個磁盤,一個服務器,一個機架,一個交換機,甚至是一個數據中心。
在 swift 安裝的時候,環的虛節點會均衡地劃分到所有的設備中。當虛節點需要移動時 ( 例如新設備被加入到集群 ) ,環會確保一次移動最少數量的虛節點數,並且一次只移動一個虛節點的一個副本。
權重可以用來平衡集群中虛節點在驅動器上的分佈。例如,當不同大小的驅動器被用於集群中時就顯得非常有用。
ring 被代理服務器和一些後台程序使用(如 replication )。
1.3 Object Server 對象服務器
對象服務器是一個簡單的二進制大對象存儲服務器,可以用來存儲、檢索和刪除本地設備上的對象。在文件系統上,對像以二進製文件的形式存儲,它的元數據存儲在文件系統的擴展屬性 (xattrs) 中。這要求用於對象服務器的文件系統需要支持文件有擴展屬性。一些文件系統,如 ext3 ,它的 xattrs 屬性默認是關閉的。
每個對象使用對象名稱的哈希值和操作的時間戳組成的路徑來存儲。最後一次寫操作總可以成功,並確保最新一次的對象版本將會被處理。刪除也被視為文件的一個版本(一個以 ".ts" 結尾的 0 字節文件, ts 表示墓碑)。這確保了被刪除的文件被正確地複制並且不會因為遭遇故障場景導致早些的版本神奇再現。
1.4 Container Server 容器服務器
容器服務器的首要工作是處理對象的列表。容器服務器並不知道對象存在哪,只知道指定容器裡存的哪些對象。 這些對象信息以 sqlite 數據庫文件的形式存儲,和對像一樣在集群上做類似的備份。容器服務器也做一些跟踪統計,比如對象的總數,容器的使用情況。
1.5 Account Server 帳號服務器
帳號服務器與容器服務器非常相似,除了它是負責處理容器的列表而不是對象。
1.6 Replication 複製
複製是設計在面臨如網絡中斷或者驅動器故障等臨時性故障情況時來保持系統的一致性。
複製進程將本地數據與每個遠程拷貝比較以確保它們都包含有最新的版本。對象複製使用一個哈希列表來快速地比較每個虛節點的子段,容器和帳號的複制使用哈希值和共享的高水位線的組合進行版本比較。
複製更新基於推模式的。對於對象的複制,更新只是使用 rsync 同步文件到對等節點。帳號和容器的複制通過 HTTP 或 rsync 來推送整個數據庫文件上丟失的記錄。
複製器也確保數據已從系統中移除。當有一項(對象、容器、或者帳號)被刪除,則一個墓碑文件被設置作為該項的最新版本。複製器將會檢測到該墓碑文件並確保將它從整個系統中移除。
1.7 Updaters 更新器
在一些情況 下,容器或帳號中的數據不會被立即更新。這種情況經常發生在系統故障或者是高負荷的情況下。如果更新失敗,該次更新在本地文件系統上會被加入隊列,然後更新器會繼續處理這些失敗了的更新工作。最終,一致性窗口將會起作用。例如,假設一個容器服務器處於負荷之下,此時一個新的對像被加入到系統。當代理服務器成功地響應客戶端的請求,這個對象將變為直接可用的。但是容器服務器並沒有更新對象列表,因此此次更新將進入隊列等待延後的更新。所以,容器列表不可能馬上就包含這個新對象。
在實際使用中,一致性窗口的大小和更新器的運行頻度一致,因為代理服務器會轉送列表請求給第一個響應的容器服務器,所以可能不會被注意到。當然,負載下的服務器不應該再去響應後續的列表請求,其他 2 個副本中的一個應該處理這些列表請求。
1.8 Auditors 審計器
審計器會在本地服務器上反复地爬取來檢測對象、容器、帳號的完整性。一旦發現不完整的數據 ( 例如,發生了 bit rot 的情況:可能改變代碼 ) ,該文件就會被隔離,然後復制器會從其他的副本那裡把問題文件替換。如果其他錯誤出現 ( 比如在任何一個容器服務器中都找不到所需的對象列表 ) ,還會記錄進日誌。
2. The Rings 環
環決定數據在集群中的位置。帳號數據庫、容器數據庫和單個對象的環都有獨立的環管理,不過每個環均以相同的方式工作。這些環被外部工具管理,服務器進程並不修改環,而是由其他工具修改並傳送新的環。
環從路徑的 MD5 哈希值中使用可配置的比特數,該比特位作為一個虛節點的索引來指派設備。從該哈希值中保留的比特數稱為虛節點的冪,並且 2 的虛節點的冪次方表示虛節點的數量。使用完全 MD5 哈希值來劃分,環允許集群的其他組件一次以分批的項來工作,這將更有效率地完成,或者至少比獨立地處理每一個項或者整個集群同時工作的複雜度更低。
另一個可配置的值是副本數量,表示有多少個虛節點 -> 設備分派來構成單個環。給定一個虛節點編號,每個副本的設備將不會與其它副本的設備在同一個區域內。區域可以基於物理位置、電力分隔、網絡分隔或者其它可以減少多個副本在同個時間點上失效的屬性用來聚合設備。
2.1 Ring Builder 環構造器
使用工具 ring-builder 來手動地構建和管理環。ring-builder 將虛節點分配到設備並且生成一個優化的 Python 結構,之後打包 (gzipped) 、序列化 (pickled) ,保存到磁盤上,用以服務器的傳送。服務器進程只是不定時地檢測文件的修改時間,如果需要就重新加載環結構在內存中的拷貝。因為 ring-builder 管理環的變化的方式,使用一個稍舊的環僅意味對於的一小部分的虛節點,它的3 個副本中的一個不正確,這還是容易解決的。
ring-builder 也存有它本身關於環信息的構造器文件和額外所需用來構建新環的數據。保存多份構建器文件的備份拷貝非常重要。一種選擇是當複制這些環文件時,複製這些構造器文件到每個服務器上。另一種這是上傳構造器文件到集群中。構造器文件的完整性受損將意味著要重新創建一些新的環,幾乎所有的虛節點將最終分配到不同的設備,因此幾乎所有的數據將不得不復製到新的位置上。所以,從一個受損的構建器文件恢復是有可能的,但是會造成數據在一段時間內不可用。
2.2 Ring Data Structure 環數據結構
環的數據結構由三個頂層域組成:在集群中設備的列表;設備 id 列表的列表,表示虛節點到設備的指派;以及表示MD5 hash 值位移的位數來計算該哈希值對應的虛節點。
2.2.1 List of Devices 設備列表
設備的列表在 Ring 類內部被稱為 devs 。設備列表中的每一項為帶有以下鍵的字典:
id integer 所列設備中的索引
zone integer 設備所在的區域
weight float 該設備與其他設備的相對權重。這常常直接與設備的磁盤空間數量和其它設備的磁盤空間數量的比有關。例如,一個 1T 大小的設備有 100 的權重而一個 2T 大小的磁盤將有 200 的權重。這個權重也可以被用於恢復一個超出或少於所需數 據的磁盤。一個良好的平均權重 100 考慮了靈活性,如果需要日後可以降低該權重。
ip string 包含該設備的服務器 IP 地址
port int 服務器進程所使用的 TCP 端口用來提供該設備的服務請求
device string sdb1 服務器上設備的磁盤名稱。例如: sdb1
meta string 存儲設備額外信息的通用字段。該信息並不直接被服務器進程使用,但是在調試時會派上用場。例如,安裝的日期和時間和硬件生產商可以存儲在這。
注意:設備的列表可能包含了 holes ,或設為 None 的索引,表示已經從集群移除的設備。一般地,設備的 id 不會被重用。一些設備也可以通過設置權重為 0.0 來暫時地被禁用。為了獲得有效設備的列表(例如,用於運行時間輪詢), Python 代碼如下: devices = [device for device in self.devs if device and device['weight']]
2.2.2 Partition Assignment List 虛節點分配列表
這是設備 id 的 array('I') 組成的列表。列表中包含了每個副本的數組 array('I') 。每個 array('I') 的長度等於環的虛節點數。在 array('I') 中的每個整數是到上面設備列表的索引。虛節點列表在 Ring 類內部被稱為 _replica2part2dev_id 。
因此,創建指派到一個虛節點的設備字典的列表, Python 代碼如下: devices =[self.devs[part2dev_id[partition]] for part2dev_id in self._replica2part2dev_id]
array('I') 適合保存在內存中,因為可能有幾百萬個虛節點。
2.2.3 Partition Shift Value 虛節點位移值
虛節點的位移值在 Ring 類內部稱為 _part_shift 。這個值用於轉換一個 MD5 的哈希值來計算虛節點,對於那個哈希值是哪個數據。僅哈希值的前 4 個字節被用於這個過程。例如,為了計算路徑 /account/container/object 的虛節點, Python 代碼如下:
partition = unpack_from('>I',md5('/account/container/object').digest())[0] >> self._part_shift
2.3 Building the Ring 構建環
環的初始化構建首先基於設備的權重來計算理想情況下分配給每個設備的虛節點數量。例如,如虛節點冪為 20 ,則環有 1,048,576 個虛節點。如果有 1000 個相同權重的設備,那麼它們每個分到 1,048.576 個虛節點。設備通過它們要求的虛節點數來排序,並在整個初始化過程中保持順序。
然後,環構建器根據最適合的原則將每個虛節點的副本分配到設備,限制擁有相同虛節點的副本的設備不能在同一個區域中。每分配一次,設備要求的虛節點數減 1 並且移動到在設備列表中新的已排序的位置,然後進程繼續執行。
當基於舊環來構造新環時,每個設備所需的虛節點數量被重新計算。接下來,將需要被重新分配的虛節點收集起來。所有被移除的設備將它們已分配的虛節點取消分配並把這些虛節點添加到收集列表。任何一個擁有比目前所需的虛結點數多的設備隨機地取消分配虛結點並添加到收集列表中。最後,收集列表中的虛節點使用與上述初始化分配類似的方法被重新分配。
每當有虛節點的副本被重新分配,重分配的時間將被記錄。我們考慮了當收集虛節點來重新分配時,沒有虛節點在可配置的時間內被移動兩次。這個可配置的時間數量在 RingBuilder 類內稱為 min_part_hours 。這一限制對於已被移除的設備上的虛節點的副本被忽略,因為移除設備僅發生在設備故障並且此時別無擇選只能進行重新分配。
由於收集虛節點用來重新分配的隨機本性,以上的進程並不總可以完美地重新平衡一個環。為了幫助達到一個更平衡的環,重平衡進程被重複執行直到接近完美 ( 小於 1% )或者當平衡的提升達不到最小值 1% (表明由於雜亂不平衡的區域或最近移動的虛節點數過多,我們可能不能獲得完美的平衡)。
2.4 History 發展史
環的代碼在到達當前版本並保持一段時間的穩定前發生了多次反复的修改,如果有新的想法產生,環的算法可能發生改變甚至從根本上發生變化。這一章節將會描述先前嘗試過的想法並且解釋為何它們被廢棄了。
A “live ring” option was considered where each server could maintain its own copy of the ring and the servers would use a gossip protocol to communicate the changes they made. This was discarded as too complex and error prone to code correctly in the project time span available. One bug could easily gossip bad data out to the entire cluster and be difficult to recover from. Having an externally managed ring simplifies the process, allows full validation of data before it's shipped out to the servers, and guarantees each server is using a ring from the same timeline. It also means that the servers themselves aren't spending a lot of resources maintaining rings.
曾考慮過 "live ring" 選項,其中每個服務器自己可以維護環的副本並且服務器將使用 gossip 協議進行通訊它們所作做的變化。該方法由於過於復雜並且在工程有效時間內正確編寫代碼容易產生錯誤而被廢棄。一個 Bug 是可以很容易把壞數據 gossip 到整個集群而恢復很困難。通過外部管理環可以簡化這一過程,允許數據在傳輸到服務器前進行數據的完整驗證,並且保證每個服務器使用相同時間線的環。這也意味著服務器本身不用花費大量的資源來維護環。
A couple of “ring server” options were considered. One was where all ring lookups would be done by calling a service on a separate server or set of servers, but this was discarded due to the latency involved. Another was much like the current process but where servers could submit change requests to the ring server to have a new ring built and shipped back out to the servers. This was discarded due to project time constraints and because ring changes are currently infrequent enough that manual control was sufficient. However, lack of quick automatic ring changes did mean that other parts of the system had to be coded to handle devices being unavailable for a period of hours until someone could manually update the ring.
有一對 "ring server" 選項曾被考慮過。一個是所有的環查詢可以由調用獨立的服務器或服務器集上的服務器來完成,但是由於涉及到延遲被棄用了。另一個更類似於當前的過程,不過其中服務器可以提交改變的請求到環服務器來構建一個新的環,然後運回到服務器上。由於工程時間的約束以及就目前來說,環的改變的頻繁足夠低到人工控制就可以滿足而被棄用。然後,缺乏快速自動的環改變意味著系統的其他部件不得不花上數個小時編碼來處理失效的設備直到有人可以手動地升級環。
The current ring process has each replica of a partition independently assigned to a device. A version of the ring that used a third of the memory was tried, where the first replica of a partition was directly assigned and the other two were determined by “walking ” the ring until finding additional devices in other zones. This was discarded as control was lost as to how many replicas for a given partition moved at once. Keeping each replica independent allows for moving only one partition replica within a given time window (except due to device failures). Using the additional memory was deemed a good tradeoff for moving data around the cluster much less often.
當前的環程序將一個虛節點的每個副本獨立地分配給一個設備。某個環程序版本中嘗試使用 1/3 的內存,其中虛節點的第一個副本被直接分配而另外兩個則在環中 “ 行走 ” 直到在其它區域找到額外的設備。這個方法因為對於給定虛節點的多個副本立刻移動會使得控制失效而被廢除。(不是很通順啊)保持每個副本的獨立性考慮在給定的時間窗口內僅移動一個虛節點副本(除了由於設備故障)。使用額外的內存看起來是一個不錯的權衡,在集群中可以更低頻率地移動數據。
Another ring design was tried where the partition to device assignments weren't stored in a big list in memory but instead each device was assigned a set of hashes, or anchors. The partition would be determined from the data item's hash and the nearest device anchors would determine where the replicas should be stored. However, to get reasonable distribution of data each device had to have a lot of anchors and walking through those anchors to find replicas started to add up. In the end, the memory savings wasn't that great and more processing power was used, so the idea was discarded.
另一個被嘗試過的環設計是不把虛節點到設備的分配存儲在內存中的大列表里而是為每個設備分配一個哈希集合或錨。虛節點將會來自數據項的哈希值來決定並且最近的設備錨將決定副本存儲的位置。然而,為了獲得更合理的數據分佈,每個設備不得不用於大量的錨並且沿著這些錨來尋找副本開始合計。最後,由於內存存儲沒有那麼大並且花費了更多的處理能力,這個想法被廢棄了。
A completely non-partitioned ring was also tried but discarded as the partitioning helps many other parts of the system, especially replication . Replication can be attempted and retried in a partition batch with the other replicas rather than each data item independently attempted and retried. Hashes of directory structures can be calculated and compared with other replicas to reduce directory walking and network traffic.
一個完整的無虛節點的環也被嘗試,但是由於虛節點有助於系統的許多其他部件,尤其是複製而被廢棄。複製可以在虛節點與其它副本的批處理中被嘗試和重試,而不是每個數據項獨立地被嘗試和重試。目錄結構的哈希值可以被計算並用來與其它副本比較來減少目錄的遍歷和網絡流量。
Partitioning and independently assigning partition replicas also allowed for the best balanced cluster. The best of the other strategies tended to give +-10% variance on device balance with devices of equal weight and +-15% with devices of varying weights. The current strategy allows us to get +-3% and +-8% respectively.
虛節點和獨立地分配虛節點的副本也考慮了最佳平衡的集群。其他策略的最佳平衡集群在設備平衡上傾向於對於平等權重的設備給出 +-10% 的變化而對於變化權重的設備則給出 +-15% 。當前的策略允許我們獲得相應 +-3% 和 +-8% 的變化。
Various hashing algorithms were tried. SHA offers better security, but the ring doesn't need to be cryptographically secure and SHA is slower. Murmur was much faster, but MD5 was built-in and hash computation is a small percentage of the overall request handling time. In all, once it was decided the servers wouldn't be maintaining the rings themselves anyway and only doing hash lookups, MD5 was chosen for its general availability, good distribution, and adequate speed.
各種哈希的算法被嘗試過。SHA 提供更好的安全,但是環並不需要安全可靠地加密而且 SHA 比較慢。Murmur 更快,但是 MD5 是 Python 內建的庫並且哈希計算只是整個請求處理時間中只是一小部分。總之,一旦環被確定,服務器不用自己來維護環而且僅作哈希查找, MD5 被選擇是因為它的通用性,良好的分佈以及足夠快的速度。
3. The Account Reaper 賬號收割器
The Account Reaper removes data from deleted accounts in the background.
賬號收割器運行在後台從要被刪除賬號中移除數據。
An account is marked for deletion by a reseller through the services server's remove_storage_account XMLRPC call. This simply puts the value DELETED into the status column of the account_stat table in the account database (and replicas), indicating the data for the account should be deleted later . There is no set retention time and no undelete; it is assumed the reseller will implement such features and only call remove_storage_account once it is truly desired the account's data be removed.
通過服務器的 remove_storage_account 的 XMLRPC 調用,賬號被 reseller 標記為刪除。這一行為簡單地將值 DELETED 放入到賬號數據庫 ( 和副本 ) 的表 account_stat 的 status 列;表示賬號數據未來將被刪除。沒有保留時間和取消刪除的設置;它假設 reseller 將會實現這樣的特性並且一旦調用 remove_storage_account ,該賬號的數據就真地被移除。
The account reaper runs on each account server and scans the server occasionally for account databases marked for deletion. It will only trigger on accounts that server is the primary node for, so that multiple account servers aren't all trying to do the same work at the same time. Using multiple servers to delete one account might improve deletion speed, but requires coordination so they aren't duplicating effort. Speed really isn't as much of a concern with data deletion and large accounts aren't deleted that often.
賬號收割器運行在每個賬號服務器上,不定期地掃描服務器的賬號數據庫中標記為刪除的數據。它僅會在當前服務器為主節點的賬號上觸發,因此多個賬號服務器並不都嘗試著在相同時間內做相同的工作。使用多個服務器來刪除一個賬號可能會提升刪除的速度,但是需要協作以避免重複刪除。實際上,在數據刪除的速度上並沒有給予過多的關注,因為大多數賬號並沒有那麼頻繁地被刪除。
The deletion process for an account itself is pretty straightforward. For each container in the account, each object is deleted and then the container is deleted. Any deletion requests that fail won't stop the overall process, but will cause the overall process to fail eventually (for example, if an object delete times out, the container won't be able to be deleted later and therefore the account won't be deleted either). The overall process continues even on a failure so that it doesn't get hung up reclaiming cluster space because of one troublesome spot. The account reaper will keep trying to delete an account until it eventually becomes empty, at which point the database reclaim process within the db_replicator will eventually remove the database files.
刪除賬號的過程是相當直接的。對於每個賬號中的容器,每個對像先被刪除然後容器被刪除。任何失敗的刪除請求將不會阻止整個過程,但是將會導致整個過程最終失敗(例如,如果一個對象的刪除超時,容器將不能被刪除,因此賬號也不能被刪除)。整個處理過程即使遭遇失敗也繼續執行,這樣它不會因為一個麻煩的問題而中止恢復集群空間。賬號收割器將會繼續不斷地嘗試刪除賬號直到它最終變為空,此時數據庫在 db_replicator 中回收處理,最終移除這個數據庫文件。
3.1 History 發展史
At first, a simple approach of deleting an account through completely external calls was considered as it required no changes to the system. All data would simply be deleted in the same way the actual user would, through the public ReST API. However, the downside was that it would use proxy resources and log everything when it didn't really need to. Also, it would likely need a dedicated server or two, just for issuing the delete requests.
最初的時候,一個通過完全地外部調用來刪除帳號的簡單方法被考慮因為它不需要對系統改變。實際的用戶可以通過公共的 ReST 的 API 以相同的方式來簡易地刪除所有的數據。然而,壞處是因為它將使用代理的資源並且記錄任何信息即使是不需要的日誌。此外,它可能需要一個或兩個專用的服務器,僅分配給處理刪除請求。
A completely bottom-up approach was also considered, where the object and container servers would occasionally scan the data they held and check if the account was deleted, removing the data if so. The upside was the speed of reclamation with no impact on the proxies or logging, but the downside was that nearly 100% of the scanning would result in no action creating a lot of I/O load for no reason.
一個完全地自底向下的方法也被考慮過,其中對象和容器服務器將不定期地掃面它們的數據並且檢測是否該對像被刪除了,如果是的話就刪除它的數據。好處是回收的速度對於代理或日誌沒有影響,不過壞事是幾乎 100% 的掃描將會導致無端地沒有活動地造成大量的 I/O 負載。
A more container server centric approach was also considered, where the account server would mark all the containers for deletion and the container servers would delete the objects in each container and then themselves. This has the benefit of still speedy reclamation for accounts with a lot of containers, but has the downside of a pretty big load spike. The process could be slowed down to alleviate the load spike possibility, but then the benefit of speedy reclamation is lost and what's left is just a more complex process. Also, scanning all the containers for those marked for deletion when the majority wouldn't be seemed wasteful. The db_replicator could do this work while performing its replication scan, but it would have to spawn and track deletion processes which seemed needlessly complex.
一個容器服務器中心的方法也曾被考慮,其中賬號服務器將會標記所有的要被刪除的容器,然後容器服務器將會刪除每個容器中的對象接著刪除容器。這對於帶有大量容器的賬號的快速回收大有裨益,但壞處是有相當大的負載峰值。該過程可以被放緩來減輕負載峰值的可能性,不過那樣的話快速回收的優點就喪失了並且剩下的只是更複雜的過程。同樣的,掃描所有的容器中標記來刪除的當大多數的將不會視為浪費的。db_replicator 可以在執行複制掃面時完成這些工作,但是它將不得產生和記錄刪除過程這些看起來不必要的複雜性。
In the end, an account server centric approach seemed best, as described above.
最後如上所述,賬號服務器中心方法看起來是最佳的。
4. The Auth System 認證系統
4.1 TempAuth
The auth system for Swift is loosely based on the auth system from the existing Rackspace architecture – actually from a few existing auth systems – and is therefore a bit disjointed. The distilled points about it are:
Swift 的認證系統鬆散地基於已存在的 Rackspace 架構的認證系統 — 實際上來自於一些已存在的認證系統 — 所以有些不連貫。關於此認證系統的要點有以下 4 點:
1. 認證 / 授權部分可以作為一個運行在 Swift 中作為 WSGI 中間件的外部系統或子系統
2.Swift 用戶在每個請求中會附加認證令牌。
3.Swift 用外部的認證系統或者認證子系統來驗證每個令牌並且緩存結果
4. 令牌不是每次請求都會變化,但是存在有效期
The token can be passed into Swift using the X-Auth-Token or the X-Storage-Token header. Both have the same format: just a simple string representing the token. Some auth systems use UUID tokens, some an MD5 hash of something unique, some use “something else” but the salient point is that the token is a string which can be sent as-is back to the auth system for validation.
令牌可以通過使用 X-Auth-Token 或者 X-Storage-Token 頭部被傳入 Swift 。兩者都有相同的格式:僅使用簡單的字符串來表示令牌。一些認證系統使用 UUID 令牌,一些使用唯一的 MD5 哈希值,一些則使用其它的方法,不過共同點是令牌是可以發送回認證系統進行證實有效性的字符串。
Swift will make calls to the auth system, giving the auth token to be validated. For a valid token, the auth system responds with an overall expiration in seconds from now. Swift will cache the token up to the expiration time.
Swift 將會調用認證系統,給出要驗證的認證令牌。對於一個正確的令牌,認證系統回應一個從當前開始的總有效期秒數。Swift 將會緩存令牌直到有效期結束。
The included TempAuth also has the concept of admin and non-admin users within an account. Admin users can do anything within the account. Non-admin users can only perform operations per container based on the container's X-Container-Read and X-Container -Write ACLs. For more information on ACLs, see swift.common.middleware.acl .
其包含的 TempAuth ,對於 account 而言,也有 admin 和 non-admin 用戶的概念。admin 用戶擁有賬號的所有操作權限。non-admin 用戶僅可以基於每個容器執行基於容器的 X-Container-Read and X-Container-Write 的訪問控制列表進行操作。對於更多關於 ACLs 的信息,參見swift.common.middleware.acl
Additionally, if the auth system sets the request environ's swift_owner key to True, the proxy will return additional header information in some requests, such as the X-Container-Sync-Key for a container GET or HEAD.
此外,如果認證系統設置 request environ 的 swift_owner 鍵為 True ,該代理服務器將在某些請求中返回額外的頭部信息,諸如用於容器的 GET 或 HEAD 的 X-Container-Sync-Key 。
The user starts a session by sending a ReST request to the auth system to receive the auth token and a URL to the Swift system.
用戶通過發送一個 ReST 請求到認證系統來接受認證令牌和一個 URL 到 Swift 系統來開始會話。
4.2 Extending Auth 擴展認證
TempAuth is written as wsgi middleware, so implementing your own auth is as easy as writing new wsgi middleware, and plugging it in to the proxy server. The KeyStone project and the Swauth project are examples of additional auth services.
Also, see Auth Server and Middleware .
TempAuth 被作為 wsgi 中間件,因此實現你自己的認證系統就如同寫一個新的 wsgi 中間件一樣容易,然後把它安裝到代理服務器上。KeyStone 和 Swauth 項目是認證服務器的另外例子。也可以參見 Auth Server and Middleware .
5. Replication 複製
Since each replica in swift functions independently, and clients generally require only a simple majority of nodes responding to consider an operation successful, transient failures like network partitions can quickly cause replicas to diverge. These differences are eventually reconciled by asynchronous, peer-to-peer replicator processes. The replicator processes traverse their local filesystems, concurrently performing operations in a manner that balances load across physical disks.
由於每個副本在 Swift 中獨立地運行,並且客戶端通常只需要一個簡單的主節點響應就可以認為操作成功,如網絡等瞬時故障虛節點會快速導致副本出現分歧。這些不同最終由異步、對等網絡的 replicator 進程來調解。replicator 進程遍歷它們的本地文件,在物理磁盤上以平衡負載的方式並發地執行操作。
Replication uses a push model, with records and files generally only being copied from local to remote replicas. This is important because data on the node may not belong there (as in the case of handoffs and ring changes), and a replicator can't know what data exists elsewhere in the cluster that it should pull in . It's the duty of any node that contains data to ensure that data gets to where it belongs. Replica placement is handled by the ring.
複製使用推模型(推模型的簡單實現是通過循環的方式將任務發送到服務器上),記錄和文件通常只是從本地拷貝到遠程副本。這一點非常重要,因為節點上的數據可能不屬於那兒(當在傳送數據而環改變的情況下),並且 replicator 不知道在集群的其它位置上它應該拉什麼數據。這是任何一個含有數據的節點職責,確保數據到達它所應該到達的位置。副本的位置由環來處理。
Every deleted record or file in the system is marked by a tombstone, so that deletions can be replicated alongside creations. These tombstones are cleaned up by the replication process after a period of time referred to as the consistency window, which is related to replication duration and how long transient failures can remove a node from the cluster. Tombstone cleanup must be tied to replication to reach replica convergence.
文件系統中每個被刪除的記錄或文件被標記為墓碑,因此刪除可以在創建的時候被複製。在一段稱為一致性窗口的時間後,墓碑文件被 replication 進程清除,與復制的持續時間和將節點從集群移除瞬時故障的持續時間有關。tombstone 的清除應該綁定replication 和對應的replica ,不應該出現有的replica 中的tombstone 刪除掉了,而有的卻沒有刪除掉。
If a replicator detects that a remote drive is has failed, it will use the ring's “get_more_nodes” interface to choose an alternate node to synchronize with. The replicator can generally maintain desired levels of replication in the face of hardware failures, though some replicas may not be in an immediately usable location.
如果 replicator 檢測到遠程驅動器發生故障,它將使用環的 get_more_nodes 接口來選擇一個替代節點進行同步。在面臨硬件故障時,複製器通常可以維護所需的複制級別,即使有一些副本可能不在一個直接可用的位置。
Replication is an area of active development, and likely rife with potential improvements to speed and correctness.
複製是一個活躍的開發領域,在速度和正確性上具有提升的潛力。
There are two major classes of replicator - the db replicator, which replicates accounts and containers, and the object replicator, which replicates object data.
有兩種主要的 replicator 類型 —— 用來複製賬號和容器的 db 複製器,以及用來複製對像數據的對象複製器。
5.1 DB Replication DB 複製
The first step performed by db replication is a low-cost hash comparison to find out whether or not two replicas already match. Under normal operation, this check is able to verify that most databases in the system are already synchronized very quickly. If the hashes differ, the replicator brings the databases in sync by sharing records added since the last sync point.
db 複製執行的第一步是一個低消耗的哈希比較來查明兩個副本是否已匹配。在常規運行下,這一檢測可以非常快速地驗證系統中的大多數數據庫已經同步。如果哈希值不一致,複製器通過共享最後一次同步點之後增加的記錄對數據庫進行同步。
This sync point is a high water mark noting the last record at which two databases were known to be in sync, and is stored in each database as a tuple of the remote database id and record id. Database ids are unique amongst all replicas of the database, and record ids are monotonically increasing integers. After all new records have been pushed to the remote database, the entire sync table of the local database is pushed, so the remote database knows it's now in sync with everyone the local database has previously synchronized with.
所謂的同步點是一個高水印標記用來記錄上一次記錄在哪兩個數據庫間進行了同步,並且存儲在每個數據庫中作為一個由 remote database id 和 record id 組成的元組。在數據庫的所有副本中,數據庫的 id 是唯一的,並且記錄 id 為單調遞增的整數。當所有的新紀錄推送到遠程數據庫後,本地數據庫的整個同步表被推送出去,因此遠程數據庫知道現在已經和先前本地數據庫與之同步的所有數據庫同步了。
If a replica is found to be missing entirely, the whole local database file is transmitted to the peer using rsync(1) and vested with a new unique id.
如果某個副本完全丟失了,使用 rsync(1) 傳送整個數據庫文件到對等節點的遠程數據庫,並且賦予一個新的唯一 id 。
In practice, DB replication can process hundreds of databases per concurrency setting per second (up to the number of available CPUs or disks) and is bound by the number of DB transactions that must be performed.
實際運行中, DB 複製可以處理數百個數據庫每並發設定值每秒(取決於可用的 CPU 和磁盤的數量)並且受必須執行 DB 事務的數量約束。
5.2 Object Replication 對象複製
The initial implementation of object replication simply performed an rsync to push data from a local partition to all remote servers it was expected to exist on. While this performed adequately at small scale, replication times skyrocketed once directory structures could no longer be held in RAM. We now use a modification of this scheme in which a hash of the contents for each suffix directory is saved to a per-partition hashes file. The hash for a suffix directory is invalidated when the contents of that suffix directory are modified.
對象複製的最初實現是簡單地執行 rsync 從本地虛節點推送數據到它預期存放的所有遠程服務器上。雖然該方案在小規模上的表現出色,然而一旦目錄結構不能保存在 RAM 中時,複製的時間將會突飛猛漲。我們現在使用這一機制的改進版本,將每個後綴目錄的內容的哈希值保存到每一虛節點的哈希文件中。當目錄後綴的內容被修改時,它的哈希值將無效。
The object replication process reads in these hash files, calculating any invalidated hashes. It then transmits the hashes to each remote server that should hold the partition, and only suffix directories with differing hashes on the remote server are rsynced. After pushing files to the remote server, the replication process notifies it to recalculate hashes for the rsynced suffix directories.
對象複製進程讀取這些哈希文件,計算出失效的哈希值。然後傳輸這些哈希值到每個有該 partition 的遠程服務器上,並且僅有不一致哈希的後綴目錄在遠程服務器上的被 rsync 。在推送文件到遠程服務器之後,複製進程通知服務器重新計算執行了 rsync 的後綴目錄的哈希值。
Performance of object replication is generally bound by the number of uncached directories it has to traverse, usually as a result of invalidated suffix directory hashes. Using write volume and partition counts from our running systems, it was designed so that around 2% of the hash space on a normal node will be invalidated per day, which has experimentally given us acceptable replication speeds.
對象複製的性能通常被它要遍歷的未緩存目錄的數量限制,常常作為是失效的後綴目錄的哈希值的結果。從我們運行的系統上使用寫捲和虛節點計數,它被設計因此在一個普通節點上有每天大約 2% 的哈希空間會失效,已經通過試驗,提供給我們可接受的複制速度。
6. Rate Limiting 速率限制
Rate limiting in swift is implemented as a pluggable middleware. Rate limiting is performed on requests that result in database writes to the account and container sqlite dbs. It uses memcached and is dependent on the proxy servers having highly synchronized time. The rate limits are limited by the accuracy of the proxy server clocks.
速率限制在 swift 中是作為一個可插的中間件。速率限制處理在數據庫寫操作到賬號和容器 sqlite db 的請求。它使用 memcached 並且依賴於高度同步時間的代理服務器。速率限制受限於代理服務器時鐘的精度。
6.1 Configuration 配置
All configuration is optional. If no account or container limits are provided there will be no rate limiting. Configuration available:
所有的配置是可選的。如果沒有給出賬號或容器的限制,那麼就沒有速率限制。可用配置參數如下:
Option Default Description
clock_accuracy 1000 Represents how accurate the proxy servers' system clocks are with each other. 1000 means that all the proxies' clock are accurate to each other within 1 millisecond. No ratelimit should be higher than the clock accuracy.
表示代理服務器的系統時鐘相互之間的精度。1000 表示所有的代理相互之間的時鐘精確到毫秒。沒有速率限制應該比該時鐘精度更高。
max_sleep_time_seconds 60 App will immediately return a 498 response if the necessary sleep time ever exceeds the given max_sleep_time_seconds.
如果必須的休眠時間超過了給定的 max_sleep_time_seconds ,應用程序會立刻返回一個 498 響應
log_sleep_time_seconds 0 To allow visibility into rate limiting set this value > 0 and all sleeps greater than the number will be logged.
在速率限制中考慮可見性,設置這一值大於 0 並且所有的休眠時間大於這個數值得將被記錄。
rate_buffer_seconds 5 Number of seconds the rate counter can drop and be allowed to catch up (at a faster than listed rate). A larger number will result in larger spikes in rate but better average accuracy.
速度計數器終止並允許追趕的秒數(以一個比已登錄更快的速度)。一個更大的數將會在速率上產生更大的峰值但是更好的平均精度。
account_ratelimit 0 If set, will limit PUT and DELETE requests to /account_name/container_name. Number is in requests per second.
如果設置,將會限制 PUT 和 DELETE 到 account_name/container_name 請求。數值為每秒的請求數。
account_whitelist '' Comma separated lists of account names that will not be rate limited.
由逗號分隔的不會被速度限制的賬號名字列表。
account_blacklist '' Comma separated lists of account names that will not be allowed. Returns a 497 response.
由逗號分隔的不被允許的賬號名稱列表。
container_ratelimit_size '' When set with container_limit_x = r: for containers of size x, limit requests per second to r. Will limit PUT, DELETE, and POST requests to /a/c/o.
當設置為 container_limit_x = r : 對於大小為 x 的容器,限制的請求數為 r 次每秒。使用 /a/c/o 來限制 PUT,DELETE 和 POST 請求。
The container rate limits are linearly interpolated from the values given. A sample container rate limiting could be:
容器的速率限制從給定值線性地插入。一個容器速率限制的樣例如下:
container_ratelimit_100 = 100
container_ratelimit_200 = 50
container_ratelimit_500 = 20
This would result in
這將會產生
Container Size Rate Limit
0-99 No limiting
100 100
150 75
500 20
1000 20
7. Large Object Support 大對象支持
7.1 Overview 概述
Swift has a limit on the size of a single uploaded object; by default this is 5GB. However, the download size of a single object is virtually unlimited with the concept of segmentation. Segments of the larger object are uploaded and a special manifest file is created that, when downloaded, sends all the segments concatenated as a single object. This also offers much greater upload speed with the possibility of parallel uploads of the segments.
Siwft 對於單個上傳對像有體積的限制;默認是 5GB 。不過由於使用了分割的概念,單個對象的下載大小幾乎是沒有限制的。對於更大的對象進行分割然後上傳並且會創建一個特殊的描述文件,當下載該對象的時候,把所有的分割聯接為一個單個對象來發送。這使得併行上傳分割成為可能,因此也提供了更快的上傳速度。
6.2 Using swift for Segmented Objects 使用swift 來分割對象
The quickest way to try out this feature is use the included swift Swift Tool. You can use the -S option to specify the segment size to use when splitting a large file. For example:
嘗試這一特性的最快捷的方式是使用 swift 自帶的 Swift Tool 。你可以使用 -S 選項來描述在分割大文件的時候使用的分卷大小。例如:
swift upload test_container -S 1073741824 large_file
This would split the large_file into 1G segments and begin uploading those segments in parallel. Once all the segments have been uploaded, swift will then create the manifest file so the segments can be downloaded as one.
這個會把 large_file 分割為 1G 的分卷並且開始並行地上傳這些分卷。一旦所有的分卷上傳完畢, swift 將會創建描述文件,這樣這些分卷可以作為一個對象來下載。
So now, the following swift command would download the entire large object:
所以現在,使用以下 swift 命令可以下載整個大對象:
swift download test_container large_file
swift uses a strict convention for its segmented object support. In the above example it will upload all the segments into a second container named test_container_segments. These segments will have names like large_file/1290206778.25/21474836480/00000000, large_file/1290206778.25/21474836480/00000001, etc.
swift 使用一個嚴格的約定對於它的分卷對象支持。在上面的例子中,它將會上傳所有的分捲到一個名為 test_container_segments 的附加 容器。這些分卷的名稱類似於 large_file/1290206778.25/21474836480/00000000, large_file/1290206778.25/21474836480/00000001 等。
The main benefit for using a separate container is that the main container listings will not be polluted with all the segment names. The reason for using the segment name format of /// is so that an upload of a new file with the same name won't overwrite the contents of the first until the last moment when the manifest file is updated.
使用一個獨立的容器的主要好處是主容器列表將不會被所有的分卷名字污染。使用 /// 分卷名稱格式的理由是當上傳一個相同名稱的新文件時將不會重寫先前文件的內容直到最後描述文件被上傳的時候。
swift will manage these segment files for you, deleting old segments on deletes and overwrites, etc. You can override this behavior with the --leave-segments option if desired; this is useful if you want to have multiple versions of the same large object available.
swift 將會為你管理這些分卷文件,使用刪除和重寫等方法來刪除舊的分卷。若需要,你可以用 --leave-segments 選項重寫這一行為;如果你想要同個大對象的多個版本可用這將非常有用。
6.3 Direct API 直接的API
You can also work with the segments and manifests directly with HTTP requests instead of having swift do that for you. You can just upload the segments like you would any other object and the manifest is just a zero-byte file with an extra X-Object -Manifest header.
你也可以直接用 HTTP 請求代替 swift 工具來使用分捲和描述文件。你可以只上傳分卷,在帶有一個額外的 X-Object-Manifest 頭部中指明任何其他的對象和描述文件只是一個 0 字節的文件。
All the object segments need to be in the same container, have a common object name prefix, and their names sort in the order they should be concatenated. They don't have to be in the same container as the manifest file will be, which is useful to keep container listings clean as explained above with swift.
所有的對象分卷需要在同一個容器內,有一個相同的對象名稱前綴,並且它們的名稱按照連結的順序排序。它們不用和描述文件在同一個容器下,這與上面解釋 swift 組件中一樣有助於保持容器列表的干淨。
The manifest file is simply a zero-byte file with the extra X-Object-Manifest: / header, where is the container the object segments are in and is the common prefix for all the segments.
描述文件僅是一個帶有額外 X-Objetc-Manifest 的 0 字節文件: / 頭部,其中 是指對象分卷所在的容器, 是所有分卷的通用前綴。
It is best to upload all the segments first and then create or update the manifest. In this way, the full object won't be available for downloading until the upload is complete. Also, you can upload a new set of segments to a second location and then update the manifest to point to this new location. During the upload of the new segments, the original manifest will still be available to download the first set of segments.
最好先上傳所有的分卷並且然後創建或升級描述文件。在這種方式下,完整的對象的下載直到上傳完成才可用。此外,你可以上傳一個新的分卷集到新的位置,然後上傳描述文件來指出這一新位置。在上傳這些新分卷的時候,原始的描述文件將仍然可用來下載第一個分卷集合。
Here's an example using curl with tiny 1-byte segments:
這裡有一個使用 curl 對 1 字節的小分卷的例子:
# First, upload the segments
curl -X PUT -H 'X-Auth-Token: ' \
http:///container/myobject/1 --data-binary '1'
curl -X PUT -H 'X-Auth-Token: ' \
http:///container/myobject/2 --data-binary '2'
curl -X PUT -H 'X-Auth-Token: ' \
http:///container/myobject/3 --data-binary '3'
# Next, create the manifest file
curl -X PUT -H 'X-Auth-Token: ' \
-H 'X-Object-Manifest: container/myobject/' \
http:///container/myobject --data-binary ''
# And now we can download the segments as a single object
curl -H 'X-Auth-Token: ' \
http:///container/myobject
6.4 Additional Notes 其他注意事項
With a GET or HEAD of a manifest file, the X-Object-Manifest: / header will be returned with the concatenated object so you can tell where it's getting its segments from.
帶有 GET 或者 HEAD 的描述文件, X-Object-Manifest: / 頭部將會返回被連結的對象,這樣你可以辨別它從哪裡獲得它的分卷。
The response's Content-Length for a GET or HEAD on the manifest file will be the sum of all the segments in the /listing, dynamically. So, uploading additional segments after the manifest is created will cause the concatenated object to be that much larger; there's no need to recreate the manifest file.
在描述文件上的 GET 或 HEAD 請求的 Content-Length 是所有在 / 列表中的分卷的動態總和。因此,在創建了描述文件之後上傳額外的分卷將會導致連結對像變得更大;沒有需要去重新創建描述文件。
The response's Content-Type for a GET or HEAD on the manifest will be the same as the Content-Type set during the PUT request that created the manifest. You can easily change the Content-Type by reissuing the PUT.
GET 或 HEAD 描述文件的請求返回的 Content-Type 和在創建描述文件的 PUT 請求中的 Content-Type 設置一樣。你可以通過重新發出 PUT 請求來輕鬆地修改 Content-Type
The response's ETag for a GET or HEAD on the manifest file will be the MD5 sum of the concatenated string of ETags for each of the segments in the / listing, dynamically. Usually in Swift the ETag is the MD5 sum of the contents of the object, and that holds true for each segment independently. But, it's not feasible to generate such an ETag for the manifest itself, so this method was chosen to at least offer change detection.
GET 或 HEAD 描述文件的請求的 ETag 是 / 所列的連結每個分卷的 ETags 的字符串的 MD5 值的動態總和。在 Swift 中 Etag 常常是對象內容的 MD5 值總和,並且適用於每個分卷。但是,為描述文件本身來創建這樣一個 Etag 是不可行的,因此這個方法被選擇來至少提供變更檢測。
Note 注意
If you are using the container sync feature you will need to ensure both your manifest file and your segment files are synced if they happen to be in different containers.
如果你選擇了容器同步的特性,你將需要來確保你的描述文件和你的分卷文件被同步若它們在不同的容器中。
6.5 History 發展史
Large object support has gone through various iterations before settling on this implementation.
大對象的支持在設為現在這種實現方式前已經經歷了各種反復修改。
The primary factor driving the limitation of object size in swift is maintaining balance among the partitions of the ring. To maintain an even dispersion of disk usage throughout the cluster the obvious storage pattern was to simply split larger objects into smaller segments, which could then be glued together during a read.
在 swift 中驅使限制對像大小的主要因素是維持 ring 中的 partiton 間的平衡。為了在集群中維持磁盤使用的平坦散佈,一種顯而易見的方式是簡單地將較大的對象分割到更小的分卷,在讀取時分卷可以被粘連在一起。
Before the introduction of large object support some applications were already splitting their uploads into segments and re-assembling them on the client side after retrieving the individual pieces. This design allowed the client to support backup and archiving of large data sets, but was also frequently employed to improve performance or reduce errors due to network interruption. The major disadvantage of this method is that knowledge of the original partitioning scheme is required to properly reassemble the object, which is not practical for some use cases, such as CDN origination.
在介紹大型對象支持之前,一些應用已經將它們的上載對象分割為分卷並且在檢索出這些獨立塊之後在客戶端上重新裝配它們。這一設計允許客戶端支持備份和將大的數據集存檔,但也頻繁地使用來提升性能或減少由於網絡中斷引發的錯誤。這一方法的主要缺點是需要初始的分割組合的知識來合適地將對象重新裝配,對於一些使用場景來說是不切實際的,諸如 CDN 源。
In order to eliminate any barrier to entry for clients wanting to store objects larger than 5GB, initially we also prototyped fully transparent support for large object uploads. A fully transparent implementation would support a larger max size by automatically splitting objects into segments during upload within the proxy without any changes to the client API. All segments were completely hidden from the client API.
為了解決客戶想要存儲大於 5GB 的對象障礙,最初的我們原型化完全透明的對於上傳大對象的支持。一個完全透明的實現可以在上傳時通過自動地將對象分割為分卷在代理內對於客戶端 API 沒有任何變化來支持更大的最大分卷大小。
This solution introduced a number of challenging failure conditions into the cluster, wouldn't provide the client with any option to do parallel uploads, and had no basis for a resume feature. The transparent implementation was deemed just too complex for the benefit.
這一解決方案引入了大量的有挑戰性的失敗條件到集群中,不會提供客戶端任何選項來進行並行上傳,而且沒有把重新開始特性作為基礎。這一透明實現被認為對於好處來說是太複雜了。
The current “user manifest” design was chosen in order to provide a transparent download of large objects to the client and still provide the uploading client a clean API to support segmented uploads.
當前的 “ 用戶描述 ” 設計被挑選出來為了提供大型對像到 客戶的透明下載並且仍然對上載客戶端提供了乾淨的 API 來支持分卷上載。
Alternative “explicit” user manifest options were discussed which would have required a pre-defined format for listing the segments to “finalize” the segmented upload. While this may offer some potential advantages, it was decided that pushing an added burden onto the client which could potentially limit adoption should be avoided in favor of a simpler “API” (essentially just the format of the 'X-Object-Manifest' header).
一種替代的 “ 顯式 ” 用戶描述選項被討論,需要一個預定義格式來列出分卷來 “ 完成 ” 分卷上週。儘管這可以提供一些潛在的優勢,它決定推送一個增加的負載到客戶端上,該行為可能潛在地限制了應該採用更簡單的 “API” 的支持(本質上就是 'X-Object-Manifest' 頭的格式)
During development it was noted that this “implicit” user manifest approach which is based on the path prefix can be potentially affected by the eventual consistency window of the container listings, which could theoretically cause a GET on the manifest object to return an invalid whole object for that short term. In reality you're unlikely to encounter this scenario unless you're running very high concurrency uploads against a small testing environment which isn't running the object-updaters or container-replicators.
在開發期間,我們注意到這種基於路徑前綴的 “ 隱式 ” 的用戶描述方法可以潛在地被容器列表的一致性窗口影響,理論上在短期內這會產生一個對描述對象的 GET 返回一個無效的整體對象。實際上,你不可能遇到這種場景除非你運行著非常高的並發性上傳針對一個小的沒有運行著 object-updaters 或者 container-replicator 的測試環境。
Like all of swift, Large Object Support is living feature which will continue to improve and may change over time.
像所有的 swift 版本,大對象支持是一個活躍的特性,將會繼續改進並且不斷地改變。
7. Container to Container Synchronization 容器同步
7.1 Overview 概述
Swift has a feature where all the contents of a container can be mirrored to another container through background synchronization. Swift cluster operators configure their cluster to allow/accept sync requests to/from other clusters, and the user specifies where to sync their container to along with a secret synchronization key.
swift 有一個特性:容器的內容可以通過後端的同步鏡像到其他的容器。Swift 集群操作員配置他們的集群來允許 / 接受同步請求到 / 來自其他的集群,用戶使用同步密鑰來指定要同步的容器。
Note 注意
Container sync will sync object POSTs only if the proxy server is set to use “object_post_as_copy = true” which is the default. So-called fast object posts, “object_post_as_copy = false” do not update the container listings and therefore can't be detected for synchronization.
只有代理服務器設置使用 “object_post_as_copy = true” 默認值時,容器同步將會同步對象的 POSTs 。所謂快速對象的 posts ,使用 “object_post_as_copy = false” 不升級容器列表並且因此不能被同步檢測到。
Note 注意
If you are using the large objects feature you will need to ensure both your manifest file and your segment files are synced if they happen to be in different containers.
如果你使用大對象特性你將需要確保你的描述文件和你的分卷文件被同步了,如果它們在不同的容器內。
7.2 Configuring a Cluster's Allowable Sync Hosts
配置一個集群容許的同步主機
The Swift cluster operator must allow synchronization with a set of hosts before the user can enable container synchronization. First, the backend container server needs to be given this list of hosts in the container-server.conf file:
Swift 集群操作員必須在用戶開啟容器同步之前允許和一組主機同步。首先,後端的容器服務器需要在 container-server.conf 文件中給定這些主機列表:
[DEFAULT]
# This is a comma separated list of hosts allowed in the
# X-Container-Sync-To field for containers.
# allowed_sync_hosts = 127.0.0.1
allowed_sync_hosts = host1,host2,etc.
...
[container-sync]
# You can override the default log routing for this app here (don't
# use set!):
# log_name = container-sync
# log_facility = LOG_LOCAL0
# log_level = INFO
# Will sync, at most, each container once per interval
# interval = 300
# Maximum amount of time to spend syncing each container
# container_time = 60
Tracking sync progress, problems, and just general activity can only be achieved with log processing for this first release of container synchronization. In that light, you may wish to set the above log_ options to direct the container-sync logs to a different file for easier monitoring. Additionally, it should be noted there is no way for an end user to detect sync progress or problems other than HEADing both containers and comparing the overall information.
跟踪同步的進度,問題,以及只是一般的活動可以只用日誌處理來實現容器同步的第一次分佈。基於此,你可能希望設置以上的 log 選項將 contaniner-sync 導向不同的文件方便監控。此外,需要主要的是對於終端用戶來說他們除了使用 HEAD 兩個容器並且比較它們的整體信息以外沒有方法來檢測同步的進度或問題。
The authentication system also needs to be configured to allow synchronization requests. Here is an example with TempAuth:
認真係統也需要進行配置來允許同步請求。這裡有一個關於 TempAuth 的例子:
[filter:tempauth]
# This is a comma separated list of hosts allowed to send
# X-Container-Sync-Key requests.
# allowed_sync_hosts = 127.0.0.1
allowed_sync_hosts = host1,host2,etc.
The default of 127.0.0.1 is just so no configuration is required for SAIO setups – for testing.
默認值 127.0.0.1 只是因為 SAIO 設置不需要配置 - 用於測試。
7.3 Using the swift tool to set up synchronized containers
使用swift 工具來設置同步容器
Note 注意
You must be the account admin on the account to set synchronization targets and keys.
你必須使用該帳號的帳號管理權限來設置同步目標和鍵。
You simply tell each container where to sync to and give it a secret synchronization key. First, let's get the account details for our two cluster accounts:
你只需告訴每個容器同步到何處並且給予一個同步的密鑰。首先,讓我們先獲得我們的兩個集群帳號的帳號細節:
$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing stat -v
StorageURL: http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e
Auth Token: AUTH_tkd5359e46ff9e419fa193dbd367f3cd19
Account: AUTH_208d1854-e475-4500-b315-81de645d060e
Containers: 0
Objects: 0
Bytes: 0
$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 stat -v
StorageURL: http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c
Auth Token: AUTH_tk816a1aaf403c49adb92ecfca2f88e430
Account: AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c
Containers: 0
Objects: 0
Bytes: 0
Now, let's make our first container and tell it to synchronize to a second we'll make next:
現在,讓我們獲取我們第一個容器並且告訴它我們下一步將會設置的第二個容器:
$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing post \
-t 'http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c/container2' \
-k 'secret' container1
The -t indicates the URL to sync to, which is the StorageURL from cluster2 we retrieved above plus the container name. The -k specifies the secret key the two containers will share for synchronization. Now, we'll do something similar for the second cluster's container:
-t 表示同步到的 URL ,就是我們上一步從 cluster2 檢索得到的 StorageURL 再加上容器名稱。-k 指定了兩個容器共享用於同步的安全密鑰。現在,我們對於第二個集群容器將會做一些類似的操作:
$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 post \
-t 'http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e/container1' \
-k 'secret' container2
That's it. Now we can upload a bunch of stuff to the first container and watch as it gets synchronized over to the second:
就是如此。現在我們可以上載一堆東西到第一個容器並且觀察它與第二個容器進行同步:
$ swift -A http://cluster1/auth/v1.0 -U test:tester -K testing \
upload container1 .
photo002.png
photo004.png
photo001.png
photo003.png
$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 \
list container2
[Nothing there yet, so we wait a bit...]
[If you're an operator running SAIO and just testing, you may need to
run 'swift-init container-sync once' to perform a sync scan.]
$ swift -A http://cluster2/auth/v1.0 -U test2:tester2 -K testing2 \
list container2
photo001.png
photo002.png
photo003.png
photo004.png
You can also set up a chain of synced containers if you want more than two. You'd point 1 -> 2, then 2 -> 3, and finally 3 -> 1 for three containers. They'd all need to share the same secret synchronization key.
你也可以設置一個同步容器鏈如果你想要使用兩個以上的容器。對於三個容器,你得將指向 1->2 ,然後 2->3 ,最後 3->1 。它們須共享同一個同步密鑰。
7.4 Using curl (or other tools) instead
使用curl( 或其他工具代替)
So what's swift doing behind the scenes? Nothing overly complicated. It translates the -t option into an X-Container-Sync-To: header and the -k option into an X-Container-Sync -Key: header.
因此 swift 在這個場景背後做了什麼呢?其實沒有什麼很複雜的操作。它將 -t 選項轉換為一個 X-Container-Sync-To: 頭以及將 -k 選項轉換為 X-Container-Sync-Key: 頭。
For instance, when we created the first container above and told it to synchronize to the second, we could have used this curl command:
例如,當我們創建以上第一個容器時並且告訴它與第二個容器同步,我們可以使用以下 curl 命令:
$ curl -i -X POST -H 'X-Auth-Token: AUTH_tkd5359e46ff9e419fa193dbd367f3cd19' \
-H 'X-Container-Sync-To: http://cluster2/v1/AUTH_33cdcad8-09fb-4940-90da-0f00cbf21c7c/container2' \
-H 'X-Container-Sync-Key: secret' \
'http://cluster1/v1/AUTH_208d1854-e475-4500-b315-81de645d060e/container1'
HTTP/1.1 204 No Content
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Date: Thu, 24 Feb 2011 22:39:14 GMT
7.5 What's going on behind the scenes, in the cluster?
在集群中,後台正在運行著什麼?
The swift-container-sync does the job of sending updates to the remote container.
swift-container-sync 執行發送 update 到遠程容器的工作。
This is done by scanning the local devices for container databases and checking for x-container-sync-to and x-container-sync-key metadata values. If they exist, newer rows since the last sync will trigger PUTs or DELETEs to the other container.
通過掃描本地設備的容器數據庫並且檢測 x-container-sync-to 和 x-container-sync-key 元數據值來完成。如果它們存在,上一次更新的較新的行將會觸發 PUTS 和 DELETEs 到其它的容器。
Note 注意
Container sync will sync object POSTs only if the proxy server is set to use “object_post_as_copy = true” which is the default. So-called fast object posts, “object_post_as_copy = false” do not update the container listings and therefore can't be detected for synchronization.
容器同步將會同步對象 POSTs 僅當代理服務器設置使用了 “object_post_as_copy = true” 的默認值。所謂的快速對象發送, “object_post_as_copy = false” 不升級容器列表,所以因此不能被同步檢測。
The actual syncing is slightly more complicated to make use of the three (or number-of-replicas) main nodes for a container without each trying to do the exact same work but also without missing work if one node happens to be down.
使用一個容器的三個(或者 replicas 的數目 number-of-replicas )主節點實際的同步稍微更複雜些,沒有每個嘗試去做完全相同的工作而且如果一個節點發生故障不會丟失工作。
Two sync points are kept per container database. All rows between the two sync points trigger updates. Any rows newer than both sync points cause updates depending on the node's position for the container (primary nodes do one third, etc. depending on the replica count of course). After a sync run, the first sync point is set to the newest ROWID known and the second sync point is set to newest ROWID for which all updates have been sent.
兩個同步點被保存在每個容器數據庫中。在兩個同步點之間的所有的行觸發 update 。兩個同步節點間觸發 updates 的任何一個較新的行取決於節點對於容器的位置(主節點做 1/3 ,等等。當然取決於 replica 的數量)。在同步運行之後,第一個同步點設置為最新已知的 ROWID 並且第二個同步點被設置為最新的 ROWID 表示所有的 updates 已經被發送。
An example may help. Assume replica count is 3 and perfectly matching ROWIDs starting at 1.
一個例子有助於理解。假設 replica 數量是 3 並且完全匹配在 1 開始的 ROWID 。
First sync run, database has 6 rows:
SyncPoint1 starts as -1.
SyncPoint2 starts as -1.
No rows between points, so no “all updates” rows.
Six rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.
SyncPoint1 is set as 6 (the newest ROWID known).
SyncPoint2 is left as -1 since no “all updates” rows were synced.
Next sync run, database has 12 rows:
SyncPoint1 starts as 6.
SyncPoint2 starts as -1.
The rows between -1 and 6 all trigger updates (most of which should short-circuit on the remote end as having already been done).
Six more rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.
SyncPoint1 is set as 12 (the newest ROWID known).
SyncPoint2 is set as 6 (the newest “all updates” ROWID).
In this way, under normal circumstances each node sends its share of updates each run and just sends a batch of older updates to ensure nothing was missed.
用這種方式,在一般情況下每個節點發送它運行的共享的更新並且僅發送一批較老的更新來確保沒有丟失信息。
留言
張貼留言