Yanyg - Software Engineer

数据放置

目录

分布式存储里,数据放置解决数据在多个节点上合理分布的问题。需要满足:

data-placement.png

1 技术点

1.1 概率分布函数

2 HDFS实现

2.1 翻译

2.1.1 介绍

HDFS默认使用BlockPlacementPolicyDefault。这种策略下,一个数据块在本地节点,另两个数据块在同机架下的其他两台节点上。此外,HDFS支持几种不同的块分配策略插件。用户基于基础设施和使用场景做出自己的选择。本文描述相关策略的细节,以及使用场景和配置。

2.1.2 BlockPlacementPolicyRackFaultToleran

BlockPlacementPolicyRackFaultTolerant用于数据块需要放置到多个机架的场景。对于三副本数据,BlockPlacementPolicyDefault策略是这样的:如果writer在datanode上,放置一个副本到本机上,否则在writer所在机架上随机选择一个datanode,而第二个副本放置到另一个机架的datanode上,最后一个副本选择第二个副本相同机架不同的datanode 上。因此使用2个机架,这种场景下,2个机架同时宕掉将导致数据不可用。 BlockPlacementPolicyRackFaultTolerant则会把三副本放置到三个不同的机架上。

配置文件 hdfs-site.xml:

<property>
  <name>dfs.block.replicator.classname</name>
  <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant</value>
</property>

2.1.3 BlockPlacementPolicyWithNodeGroup

新的三层拓扑中,引入了node组的概念。node组映射到基于虚拟化环境的基础设施。在虚拟化环境下,多个VM运行在同一个物理机上。同一个物理机上的VM受相同硬件故障的影响。映射到某个物理主机的node组,数据放置策略保证在这个node组上不会放置的数据副本不会超过1个,因此node组失效时,最多只有一个副本受到影响。

Configurations:

  • core-site.xml
<property>
  <name>net.topology.impl</name>
  <value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value>
</property>
<property>
  <name>net.topology.nodegroup.aware</name>
  <value>true</value>
</property>
  • hdfs-site.xml:
<property>
  <name>dfs.block.replicator.classname</name>
  <value>
    org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup
  </value>
</property>
  • Topology script

拓扑脚本同上,唯一的差别是,这个拓扑脚本返回/{rack}/{nodegroup},而不仅仅是 /{rack}. 下面是一个拓扑映射表的例子:

192.168.0.1 /rack1/nodegroup1
192.168.0.2 /rack1/nodegroup1
192.168.0.3 /rack1/nodegroup2
192.168.0.4 /rack1/nodegroup2
192.168.0.5 /rack2/nodegroup3
192.168.0.6 /rack2/nodegroup3

更多细节参考HDFS-8468.

AvailableSpaceBlockPlacementPolicy 基于空间均衡的策略。类似于BlockPlacementPolicyDefault,但对于新的数据块,选中较低空间使用率数据节点,被选中的概率会稍高一点。

参考HDFS-8131. For more details check HDFS-8131

AvailableSpaceRackFaultTolerantBlockPlacementPolicy 基于空间均衡的策略,类似于BlockPlacementPolicyRackFaultTolerant,数据分布到尽可能多的机架上,同时较低空间使用率的数据节点,被选中的概率会稍高。

参考HDFS-15288.

2.2 原文

From https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsBlockPlacementPolicies.html

BlockPlacementPolicies

Introduction


By default HDFS supports BlockPlacementPolicyDefault. Where one block on local and copy on 2 different nodes of same remote rack. Additional to this HDFS supports several different pluggable block placement policies. Users can choose the policy based on their infrastructure and use case. This document describes the detailed information about the type of policies with its use cases and configuration.

BlockPlacementPolicyRackFaultTolerant


BlockPlacementPolicyRackFaultTolerant can be used to split the placement of blocks across multiple rack. By default with replication of 3 BlockPlacementPolicyDefault will put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. So totally 2 racks will be used, in scenario like 2 racks going down at the same time will cause data inavailability where using BlockPlacementPolicyRackFaultTolerant will help in placing 3 blocks on 3 different racks.

For more detail check HDFS-7891. data-placement-hdfs-rack-fault.png.

Configurations: hdfs-site.xml <property> <name>dfs.block.replicator.classname</name> <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant</value> </property>

BlockPlacementPolicyWithNodeGroup With new 3 layer hierarchical topology, a node group level got introduced, which maps well onto a infrastructure that is based on a virtualized environment. In Virtualized environment multiple vm's will be hosted on same physical machine. Vm's on the same physical host are affected by the same hardware failure. So mapping the physical host a node groups this block placement gurantees that it will never place more than one replica on the same node group (physical host), in case of node group failure, only one replica will be lost at the maximum.

Configurations:

  • core-site.xml

<property> <name>net.topology.impl</name> <value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value> </property> <property> <name>net.topology.nodegroup.aware</name> <value>true</value> </property>

  • hdfs-site.xml

<property> <name>dfs.block.replicator.classname</name> <value> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup </value> </property>

  • Topology script

Topology script is the same as the examples above, the only difference is, instead of returning only /{rack}, the script should return /{rack}/{nodegroup}. Following is an example topology mapping table:

192.168.0.1 /rack1/nodegroup1 192.168.0.2 /rack1/nodegroup1 192.168.0.3 /rack1/nodegroup2 192.168.0.4 /rack1/nodegroup2 192.168.0.5 /rack2/nodegroup3 192.168.0.6 /rack2/nodegroup3

For more details check HDFS-8468.

BlockPlacementPolicyWithUpgradeDomain To address the limitation of block placement policy on rolling upgrade, the concept of upgrade domain has been added to HDFS via a new block placement policy. The idea is to group datanodes in a new dimension called upgrade domain, in addition to the existing rack-based grouping. For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01, nodes in the second position to upgrade domain ud_02 and so on. It will make sure replicas of any given block are distributed across machines from different upgrade domains. By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to a specific upgrade domain collectively won’t store more than one replica of any block.

For more details check HDFS-9006.

Detailed info about configuration Upgrade Domain Policy.

AvailableSpaceBlockPlacementPolicy The AvailableSpaceBlockPlacementPolicy is a space balanced block placement policy. It is similar to BlockPlacementPolicyDefault but will choose low used percent datanodes for new blocks with a little high possibility.

AvailableSpaceRackFaultTolerantBlockPlacementPolicy The AvailableSpaceRackFaultTolerantBlockPlacementPolicy is a space balanced block placement policy similar to AvailableSpaceBlockPlacementPolicy. It extends BlockPlacementPolicyRackFaultTolerant and distributes the blocks amongst maximum number of racks possible and at the same time will try to choose datanodes with low used percent with high probability.

3 How to test

4 References