Yanyg - Software Engineer

XArray/SparseArray

目录

1 概述

XArray(eXtensible ARRAY)/SparseArray 是一种多维数组。 XArray 实现和 Trie 类似,都是数字查找检索。 XArray 适用于稀疏数组情况(不存在的元素比较多)。典型的使用场景是内核的页缓存,对于块设备或文件,其一是尺寸远小于64位地址空间所能表示的大小,另外实际存在的部分,通常大部分不会在页缓存中。

数字检索技术在TAOCP卷3的6.3节有详细描述:

From <The Art of Computer Programming> V3 6.3:

Instead of basing a search method on comparisons between keys, we can make use of their representation as a sequence of digits or alphabetic characters. Consider, for example, the thumb index on a large dictionary; from the first letter of a given word, we can immediately locate the pages that contain all words beginning with that letter.

If we pursue the thumb-index idea to one of its logical conclusions, we come up with a searching scheme based on repeating "subscripting".

XArray 内部实现为多层(Multi Layer)检索数据结构,每层有特定数量的bits。最外一层称为Leaf,存放实际数组对象。其他各层存放下一层级的索引地址。

XArray 在Linux kernel有实现,已经替换掉原来的Radix Tree。参见这里 https://elixir.bootlin.com/linux/latest/source/lib/xarray.c

假定有一个 {2,3,3}XArray ,访问 123234 的示意图如下:

maxIndex=(1<<(2+3+3)) - 1 = 255

1. Xarray[123], (1<<6) + (7<<3) + 3
layer1_idx = 1
layer2_idx = 7
layer3_idx = 3

LayerIdx   Layer1          Layer2         Layer3
   0        | |             | |            | |
   1        |1|------|      | |            | |
   2        | |      |      | |            | |
   3        | |      |      | |     |----->|3|<value>
   4        | |      |      | |     |      | |
   5        | |      |      | |     |      | |
   6        | |      |      | |     |      | |
   7        | |      |----->|7|-----|      | |

2. Xarray[234], (3<<6) + (5<<3) + 2
layer1_idx = 3
layer2_idx = 5
layer3_idx = 2

LayerIdx   Layer1          Layer2         Layer3
   0        | |             | |            | |
   1        | |             | |            | |
   2        | |             | |     |----->|2|<value>
   3        |3|------|      | |     |      |3|
   4        | |      |      | |     |      | |
   5        | |      |----->|5|-----|      | |
   6        | |             | |            | |
   7        | |             |7|            | |

2 XArray vs. RadixTree in Linux Kernel

  • 两者实质都是 Trie ,每一层4/6 bits,因此一个大的index需要查找16/11层;
  • XArray有更良好的API设计;

3 Adaptive Radix Tree

Paper: https://db.in.tum.de/~leis/papers/ART.pdf

XArray也可参考此论文,实现大Index的压缩。比如16Byte UUID的XArray。 UUID规范:https://tools.ietf.org/html/rfc4122

UUID                   = time-low "-" time-mid "-"
                         time-high-and-version "-"
                         clock-seq-and-reserved
                         clock-seq-low "-" node
time-low               = 4hexOctet
time-mid               = 2hexOctet
time-high-and-version  = 2hexOctet
clock-seq-and-reserved = hexOctet
clock-seq-low          = hexOctet
node                   = 6hexOctet
hexOctet               = hexDigit hexDigit
hexDigit =
      "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
      "a" / "b" / "c" / "d" / "e" / "f" /
      "A" / "B" / "C" / "D" / "E" / "F"

4 Implementation

暂不公开。

5 Learn from youtube

5.1 Xarray: One Data Structure To Rule Them All

5.1.1 What is the Xarray

  • Automatically resizing array of pointers
  • Indexed by unsigned long
  • All pointers initially NULL
  • Embeds a spinlock
  • Loads are store-free (using rcu)

5.1.2 Fundamentals

  • Some users only need load, store and (maybe) iterate:
void *xa_load(struct xarray*, unsigned long index);
void *xa_store(struct xarray*, unsigned long index, void *entry, gfp_t);
void *xa_erase(struct xarray*, unsigned long index);
xa_for_each(struct xarray*, unsigned long index, void *entry);

5.1.3 Three auxiliary bits per non-NULL entry

void xa_set_mark(struct xarray*, unsigned long index, xa_mark_t);
void xa_clear_mark(struct xarray*, unsigned long index, xa_mark_t);
bool xa_get_mark(struct xarray*, unsigned long index, xa_mark_t);

5.1.4 Some users need something a little more complex

5.1.5 The Xarray can trace free entries for you

6 Reference