完成集合 set 讲义和代码

PegasusWang · PegasusWang · commit 1ac17eb34d2e · 2018-04-22T16:22:18.000+08:00
diff --git a/docs/7_哈希表/hashtable.md b/docs/7_哈希表/hashtable.md
@@ -147,6 +147,9 @@ class HashTable(object):
 
 具体的实现和代码编写在视频里讲解。这个代码可不太好实现，稍不留神就会有错，我们还是通过编写单元测试验证代码的正确性。
 
+# 思考题
+- Slot 在二次探查法里为什么不能直接删除？为什么我们要给它定义几个状态？
+
 # 延伸阅读
 - 《Data Structures and Algorithms in Python》11 章 Hash Tables
 - 《算法导论》第三版 11 章散列表
diff --git a/docs/8_字典/dict.md b/docs/8_字典/dict.md
@@ -11,7 +11,7 @@
 字典最常使用的场景就是 k,v 存储，经常用作缓存，它的 key 值是唯一的。
 内置库 collections.OrderDict 还保持了 key 的添加顺序，其实用我们之前实现的链表也能自己实现一个 OrderDict。
 
-# 实现 dict
+# 实现 dict ADT
 
 其实上边 HashTable 实现的三个基本方法就是我们使用字典最常用的三个基本方法， 这里我们继承一下这个类，
 然后实现更多 dict 支持的方法，items(), keys(), values()。不过需要注意的是，在 python2 和 python3 里这些方法
@@ -24,3 +24,29 @@ class DictADT(HashTable):
 ```
 
 视频里我们将演示如何实现这些方法，并且写单侧验证正确性。
+
+# Hashable
+作为 dict 的 key 必须是可哈希的，也就是说不能是 list 等可变对象。不信你在 ipython 里运行如下代码：
+
+```py
+d = dict()
+d[[1]] = 1
+# TypeError: unhashable type: 'list'
+```
+
+我引用 python 文档里的说法，大家可以自己理解下：
+
+> An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.
+
+> Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
+
+> All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal (except with themselves), and their hash value is derived from their id().
+
+
+# 思考题：
+- 你能在哈希表的基础上实现 dict 的其他操作吗？
+- 对于 python 来说，哪些内置数据类型是可哈希的呢？
+- 你了解 python 的 hash 函数吗？
+
+# 延伸阅读
+阅读 python 文档关于 dict 的相关内容
diff --git a/docs/9_集合/set.md b/docs/9_集合/set.md
@@ -0,0 +1,47 @@
+# 集合 set
+
+这一章讲集合，实际上它的底层也是哈希表实现的，所以像实现 DictADT 一样，借助 HashTable 实现它也比较简单。
+
+
+# 集合操作
+集合可能最常用的就是去重，判断是否存在一个元素等，但是 set 相比 dict 有更丰富的操作，主要是数学概念上的。
+如果你学过《离散数学》中集合相关的概念，基本上是一致的。 python 的 set 提供了如下基本的集合操作，
+假设有两个集合 A，B，有以下操作：
+
+- 交集: A & B，表示同时在 A 和 B 中的元素。 python 中重载  `__and__` 实现
+- 并集: A | B，表示在 A 或者 B 中的元素，两个集合相加。python 中重载 `__or__` 实现
+- 差集:  A - B，表示在 A 中但是不在 B 中的元素。 python 中重载 `__sub__` 实现
+- 对称差: A ^ B，返回在 A 或 B 但是不在 A、B 中都出现的元素。其实就是 (A|B) - (A&B)， python 中重载 `__xor__` 实现
+
+这里使用的  &, |, -, ^ 在 python 内置的 set 实现中都是重载了内置的运算符。这里我们也用这种方式实现，
+具体实现我会在视频里演示。python 同样实现了  intersection, union, difference, symmetric_difference 这四个方法，
+和使用运算符的功能是一样的。
+
+![](./set.png)
+
+# python frozenset
+在 python 里还有一个 frozenset，看它的名字就知道这种也是集合，但是它的内容是无法变动的。一般我们使用
+它的常见就是用一个可迭代对象初始化它，然后只用来判重等操作。
+
+
+# 实现一个 set ADT
+如何实现一个集合的 ADT 呢，其实还是个哈希表，哈希表不是有 key 和 value 嘛，咱把 value 置为 1 不就行了。
+
+```py
+class SetADT(HashTable):
+
+    def add(self, key):
+        # 集合其实就是一个 dict，只不过我们把它的 value 设置成 1
+        return super(SetADT, self).add(key, True)
+```
+
+当然其它数学上的操作就麻烦点了。
+
+
+# 思考题
+- 你能尝试实现对称差操作吗？这里我没有实现，留给你作为练习
+- 你知道如何重载 python 的内置运算符吗？这里我们实现 set 的集合操作就是用到了重载，请阅读相关 python 文档。
+
+
+# 延伸阅读
+阅读 python 文档关于 set 的相关章节，了解 set 还有哪些操作？比如比较运算符的概念
diff --git a/docs/9_集合/set.png b/docs/9_集合/set.png
diff --git a/docs/9_集合/set_adt.py b/docs/9_集合/set_adt.py
@@ -0,0 +1,203 @@
+# -*- coding: utf-8 -*-
+
+# 从数组和列表章复制的代码
+
+
+class Array(object):
+
+    def __init__(self, size=32):
+        self._size = size
+        self._items = [None] * size
+
+    def __getitem__(self, index):
+        return self._items[index]
+
+    def __setitem__(self, index, value):
+        self._items[index] = value
+
+    def __len__(self):
+        return self._size
+
+    def clear(self, value=None):
+        for i in range(self._items):
+            self._items[i] = value
+
+    def __iter__(self):
+        for item in self._items:
+            yield item
+
+
+class Slot(object):
+    """定义一个 hash 表 数组的槽
+    注意，一个槽有三种状态，看你能否想明白。相比链接法解决冲突，二次探查法删除一个 key 的操作稍微复杂。
+    1.从未使用 HashMap.UNUSED。此槽没有被使用和冲突过，查找时只要找到 UNUSED 就不用再继续探查了
+    2.使用过但是 remove 了，此时是 HashMap.EMPTY，该探查点后边的元素扔可能是有key
+    3.槽正在使用 Slot 节点
+    """
+
+    def __init__(self, key, value):
+        self.key, self.value = key, value
+
+
+class HashTable(object):
+
+    UNUSED = None    # 没被使用过的槽，作为该类变量的一个单例，下边都是is 判断
+    EMPTY = Slot(None, None)     # 使用过但是被删除的槽
+
+    def __init__(self):
+        self._table = Array(7)
+        self.length = 0
+
+    @property
+    def _load_factor(self):
+        # load factor 超过 2/3 就重新分配空间
+        return self.length / float(len(self._table))
+
+    def __len__(self):
+        return self.length
+
+    def _hash1(self, key):
+        """ 计算key的hash值"""
+        return abs(hash(key)) % len(self._table)
+
+    def _find_slot(self, key, for_insert=False):
+        """_find_slot
+
+        :param key:
+        :param for_insert: 是否插入，还是仅仅查询
+        :return:  slot index or None
+        """
+        index = self._hash1(key)
+        base_index = index
+        hash_times = 1
+        _len = len(self._table)
+
+        if not for_insert:  # 查找是否存在 key
+            while self._table[index] is not HashTable.UNUSED:
+                if self._table[index] is HashTable.EMPTY:
+                    index = (index + hash_times * hash_times) % _len    # 一个简单的二次方探查
+                    continue
+                elif self._table[index].key == key:
+                    return index
+                index = (index + hash_times * hash_times) % _len
+                hash_times += 1
+            return None
+        else:
+            while not self._slot_can_insert(index):  # 循环直到找到一个可以插入的槽
+                index = (index + hash_times * hash_times) % _len
+                hash_times += 1
+            return index
+
+    def _slot_can_insert(self, index):
+        return (self._table[index] is HashTable.EMPTY or self._table[index] is HashTable.UNUSED)
+
+    def __contains__(self, key):   # in operator
+        index = self._find_slot(key, for_insert=False)
+        return index is not None
+
+    def add(self, key, value):
+        if key in self:    # key 相同值不一样的时候，用新的值
+            index = self._find_slot(key, for_insert=False)
+            self._table[index].value = value
+            return False
+        else:
+            index = self._find_slot(key, for_insert=True)
+            self._table[index] = Slot(key, value)
+            self.length += 1
+            if self._load_factor >= 0.8:    # 注意超过了 阈值 rehashing
+                self._rehash()
+            return True
+
+    def _rehash(self):
+        old_table = self._table
+        newsize = len(self._table) * 2 + 1   # 扩大 2*n + 1
+        self._table = Array(newsize)
+
+        self.length = 0
+
+        for slot in old_table:
+            if slot is not HashTable.UNUSED and slot is not HashTable.EMPTY:
+                index = self._find_slot(slot.key, for_insert=True)
+                self._table[index] = slot
+                self.length += 1
+
+    def get(self, key, default=None):
+        index = self._find_slot(key, for_insert=False)
+        if index is None:
+            return default
+        else:
+            return self._table[index].value
+
+    def remove(self, key):
+        assert key in self, 'keyerror'
+        index = self._find_slot(key, for_insert=False)
+        value = self._table[index].value
+        self.length -= 1
+        self._table[index] = HashTable.EMPTY
+        return value
+
+    def __iter__(self):
+        for slot in self._table:
+            if slot not in (HashTable.EMPTY, HashTable.UNUSED):
+                yield slot.key   # 和 python dict 一样，默认遍历 key，需要value 的话写个 items() 方法
+
+
+#########################################
+# 上边是从 哈希表章 拷贝过来的代码，我们会直接继承 HashTable 实现 集合 set
+#########################################
+
+class SetADT(HashTable):
+
+    def add(self, key):
+        # 集合其实就是一个 dict，只不过我们把它的 value 设置成 1
+        return super(SetADT, self).add(key, True)
+
+    def __and__(self, other_set):
+        """交集 A&B"""
+        new_set = SetADT()
+        for element_a in self:
+            if element_a in other_set:
+                new_set.add(element_a)
+        for element_b in other_set:
+            if element_b in self:
+                new_set.add(element_b)
+        return new_set
+
+    def __sub__(self, other_set):
+        """差集 A-B"""
+        new_set = SetADT()
+        new_set = SetADT()
+        for element_a in self:
+            if element_a not in other_set:
+                new_set.add(element_a)
+        return new_set
+
+    def __or__(self, other_set):
+        """并集 A|B"""
+        new_set = SetADT()
+        for element_a in self:
+            new_set.add(element_a)
+        for element_b in other_set:
+            new_set.add(element_b)
+        return new_set
+
+
+def test_set_adt():
+    sa = SetADT()
+    sa.add(1)
+    sa.add(2)
+    sa.add(3)
+    assert 1 in sa    # 测试  __contains__ 方法，实现了 add 和 __contains__，集合最基本的功能就实现啦
+
+    sb = SetADT()
+    sb.add(3)
+    sb.add(4)
+    sb.add(5)
+
+    sorted(list(sa & sb)) == [3]
+    sorted(list(sa - sb)) == [1, 2]
+    sorted(list(sa | sb)) == [1, 2, 3, 4, 5]
+
+
+if __name__ == '__main__':
+    test_set_adt()