深入剖析HashSet和HashMap实现

HashSet是一个包含非重复元素的集合,如何实现的,要从底层实现代码看起。

背景

首先非重复元素如何定义,看Set的描述:

More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element.

Set不会找到两个元素,并且两个元素满足e1.equals(e2)为true;并且最多只有一个null元素。

如果没有重写equals方法,查看Object类中equal方法的实现,==比较的其实是两个对象在内存中的地址。

1
2
3
public boolean equals(Object obj) {
    return (this == obj);
}

说起equals方法,就不得不说hashCode方法了。Java中对于hashCode有个常规协定

The general contract of hashCode is:

  • Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

  • If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

  • It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

  • 程序执行期间,在同一个对象上执行多次hashCode方法,都返回相同的整数,前提是equals比较中所使用的字段没有被修改。跨应用中的hashCode方法调用返回的整数不要求相同。

  • 如果两个对象根据equals方法比较相同,那hashCode返回的整数也必须相同。

  • 如果两个对象equals方法比较不相同,调用hashCode返回的整数不需要不同。但是程序员应该知道为不相等的对象生成不同的整数可以提高哈希表的性能。

HashSet的底层实现

HashSet的底层是通过HashMap实现的,将元素作为map的key以达到去重的目的,value使用的是同一个虚拟的Object实例。

1
2
3
4
5
6
7
8
private transient HashMap<E,Object> map;

// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();

public boolean add(E e) {
   return map.put(e, PRESENT)==null;
}

HashMap的底层实现

{% asset_img hashmap-structure.jpg %}

到最后我们要看HashMap的实现了,简单说就是一个数组+链表的结合。

Entry元素

Entry是链表的结果,key为Map中的key,value为Map中的value,hash为key的hash结果,next为下一个元素。

1
2
3
4
5
6
static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    Entry<K,V> next;
    int hash;
}
添加元素

JDK8此处有更新,见末尾

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
public V put(K key, V value) {
    if (table == EMPTY_TABLE) {
        inflateTable(threshold);
    }
    if (key == null)
        return putForNullKey(value);
    int hash = hash(key);
    int i = indexFor(hash, table.length);
    for (Entry<K,V> e = table[i]; e != null; e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }
    modCount++;
    addEntry(hash, key, value, i);
    return null;
}
扩充table

对toSzie算出最小的2的幂值,用了Integer.highestOneBit((toSize -1) << 1)。减一之后左移一位,然后取最高位值,其余为补0。

为什么数组长度必须为2的幂值,请继续看。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/**
 * 扩充table
**/
private void inflateTable(int toSize) {
    // Find a power of 2 >= toSize
    int capacity = roundUpToPowerOf2(toSize);
    threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
    table = new Entry[capacity];
    initHashSeedAsNeeded(capacity);
}
计算hash值

hashSeed值为0,将key的hashCode值做多次位移和异或运算

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
final int hash(Object k) {
    int h = hashSeed;
    if (0 != h && k instanceof String) {
        return sun.misc.Hashing.stringHash32((String) k);
    }
    h ^= k.hashCode();
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
计算元素位置

这里的逻辑很简单:将hash值跟数组长度-1做了按位与。

在进行查找的时候是通过key的hash值,如果我们将元素的位置分布得尽量均匀一些,尽量做到每个位置上只有一个元素,达到O(1)的查找。这种查找通过取余就可以做到,在Java中如何做到比较快的取余呢,答案是位与运算。

上面扩充数组的时候我们保证长度为2的幂值,那减一之后就是每位都是1。做位与运算就能保证低位不同的hash值会落在不同的位置上,降低冲突(碰撞),最大程度做到均匀分布,减少链表的出现(查找变成O(n))。

1
2
3
4
static int indexFor(int h, int length) {
    // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
    return h & (length-1);
}
添加entry

添加新的元素时要检查元素个数是否达到阈值,否则要做扩容处理,新table的容量为当前table长度的两倍。

1
2
3
4
5
6
7
8
void addEntry(int hash, K key, V value, int bucketIndex) {
    if ((size >= threshold) && (null != table[bucketIndex])) {
        resize(2 * table.length);
        hash = (null != key) ? hash(key) : 0;
        bucketIndex = indexFor(hash, table.length);
    }
    createEntry(hash, key, value, bucketIndex);
}
resize

新table的容量为当前table长度的两倍(table.length >= size),将旧数据中的数据迁移到新的数组中,迁移的过程中要重新计算元素在新数组中的位置。网上很多地方提到这个操作rehash,但我觉得reindex反而更恰当一些。JDK中对rehash有额外的定义,就是initHashSeedAsNeeded。当新的容量>=jdk.map.althashing.threshold的配置时,会重新计算key的hash值,即hash(e.key)。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void resize(int newCapacity) {
    Entry[] oldTable = table;
    int oldCapacity = oldTable.length;
    if (oldCapacity == MAXIMUM_CAPACITY) {
        threshold = Integer.MAX_VALUE;
        return;
    }
    Entry[] newTable = new Entry[newCapacity];
    transfer(newTable, initHashSeedAsNeeded(newCapacity));
    table = newTable;
    threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
}
reindex
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
void transfer(Entry[] newTable, boolean rehash) {
    int newCapacity = newTable.length;
    for (Entry<K,V> e : table) {
        while(null != e) {
            Entry<K,V> next = e.next;
            if (rehash) {
                e.hash = null == e.key ? 0 : hash(e.key);
            }
            int i = indexFor(e.hash, newCapacity);
            e.next = newTable[i];
            newTable[i] = e;
            e = next;
        }
    }
}

JDK8 update 添加元素的时候,如果发生哈希冲突,会遍历链表。加入链表的长度大于TREEIFY_THRESHOLD(默认为8),会将链表转成红黑树

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    else {
        Node<K,V> e; K k;
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        else {
            for (int binCount = 0; ; ++binCount) {
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break;
                }
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        if (e != null) { // existing mapping for key
            V oldValue = e.value;
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            afterNodeAccess(e);
            return oldValue;
        }
    }
    ++modCount;
    if (++size > threshold)
        resize();
    afterNodeInsertion(evict);
    return null;
}
<!---->
final void treeifyBin(Node<K,V>[] tab, int hash) {
    int n, index; Node<K,V> e;
    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();
    else if ((e = tab[index = (n - 1) & hash]) != null) {
        TreeNode<K,V> hd = null, tl = null;
        do {
            TreeNode<K,V> p = replacementTreeNode(e, null);
            if (tl == null)
                hd = p;
            else {
                p.prev = tl;
                tl.next = p;
            }
            tl = p;
        } while ((e = e.next) != null);
        if ((tab[index] = hd) != null)
            hd.treeify(tab);
    }
}

同样,get(key)的时候也会相应的从树中查找元素。

Comments

comments powered by Disqus