Set Type Follow-up

After hearing from a few folks on Twitter, and thinking through the implementation a little more, I have a couple of changes to the Set type from my last post. I think each of these changes brings the Set type a little closer to how it might have been implemented as a first-party datatype.

A better Generator

In my last post, I added SequenceType conformance by returning the internal Dictionary’s keys.generator(), like this:

extension Set : SequenceType {
    typealias Generator = MapSequenceGenerator<DictionaryGenerator<T, Bool>, T>
    func generate() -> Generator {
        return contents.keys.generate()
    }
}

It works fine, but get a load of that return type! MapSequenceGenerator<DictionaryGenerator<T, Bool>, T> is pretty ugly, and exposes way too much of the internal workings of the Set type for my comfort. What if I want to change the underlying data structure? (Ahem.)

Instead, I can use the handy GeneratorOf type to create a generator that is custom-tailored to the Set’s subtype and masks all the internal gobbledegook. The initializer for GeneratorOf takes a closure that acts as the next() method for the generator. Much cleaner:

extension Set : SequenceType {
    typealias Generator = GeneratorOf<T>
    public func generate() -> Generator {
        var generator = contents.keys.generate()
        return Swift.GeneratorOf {
            return generator.next()
        }
    }
}

Bool vs. Void?

I had a couple requests about why I used Bool instead of Void in the Set type’s internal Dictionary. It’s a good question – just the presence of the key is enough to establish an element’s existence in the set, so why use space for even a single-bit data type when Void should take up no space at all? What if the internal storage is inefficient, and the Bool value is taking up even more than a single bit?

I decided to run a quick test to see if I could figure out how much space I could save converting to a Void, and it turns out: zero. No matter how large my Dictionary got, the [Key: Bool] and [Key: Void] versions of the process used the same amount of memory.¹ My hunch is that a Void type is represented internally by a single zero bit, so it’s no savings over a Bool using that same bit for true/false.

Interestingly enough, the Void testing brought up another challenge as well, since the conversion broke the the Set equality test. To check for equality between two sets, I’m deferring to the == operator on the underlying contents Dictionary, but for that to work both the key and value types of the Dictionary need to be Equatable. The Void type isn’t equatable (it isn’t anything, really), so that comparison stopped working.

In the end, I’m going to stick with the boolean true as the sentinel value – it feels better (semantically) to have the internal Dictionary actually representing the Set in that way.

Better indexing for a better collection

After taking a second look, I realized my ExtensibleCollectionType implementation was pretty far off the mark. Dictionary instances use the DictionaryIndex type as their index, which is nearly opaque (it seems to hold some sort of enum). Since the ExtensibleCollectionType protocol requires that you implement a subscript for your index type, this allows for both “indexed” subscripting and keyed subscripting.

var dict = ["a": "Apple",                   // set up the Dictionary
            "b": "Banana",
            "c": "Cherry"]                  
let bananaIndex = dict.indexForKey("b")!    // retrieve an index
println(dict[bananaIndex])                  // subscript via DictionaryIndex<String, String>
> > (b, Banana)
println(dict["c"])                          // subscript via String key
> > Optional("Cherry")

My original Set used a simple Int as an index, which, in addition to not really making sense, prevented me from having an element-based subscript.

let set: Set = ["Apple", "Banana", "Cherry"]  // set up the Set
let start: Int = set.startIndex
println(set[start])
> > Cherry

println(set["Apple"])
> error: type 'Index' does not conform to protocol 'StringLiteralConvertible'

A better solution is to create a new SetIndex type that embeds the DictionaryIndex we saw in action up above. The SetIndex type can simply pass the index between the caller and the internal Dictionary, once again obfuscating the working details. To be used as an index, SetIndex has to at least implement the ForwardIndexType protocol, but we’ll go all the way and implement BidirectionalIndexType. Note that the initializer is private – this is typical for an index in Swift. You can retrieve one from the instance you want to traverse, but you can’t simply create an index directly.

struct SetIndex<T: Hashable> : BidirectionalIndexType {
    private var index: DictionaryIndex<T, Bool>
    private init(_ dictionaryIndex: DictionaryIndex<T, Bool>) {
        self.index = dictionaryIndex
    }
    func predecessor() -> SetIndex<T> {
        return SetIndex(self.index.predecessor())
    }
    func successor() -> SetIndex<T> {
        return SetIndex(self.index.successor())
    }
}

func ==<T: Hashable>(lhs: SetIndex<T>, rhs: SetIndex<T>) -> Bool {
    return lhs.index == rhs.index
}

Next we need to update our ExtensibleCollectionType extension to use the new SetIndex type instead of Int as the index. I’ve added two additional methods – indexForElement() returns a SetIndex for the requested element, or nil if the element isn’t in the set, and subscripting with set elements now acts as a shortcut for the contains() method, returning true if the element is in the set and false otherwise.

extension Set : ExtensibleCollectionType {
    typealias Index = SetIndex<Element>
    
    var startIndex: Index { return Index(contents.startIndex) }
    var endIndex: Index { return Index(contents.endIndex) }
    
    func indexForElement(element: Element) -> Index? {
        if let index = contents.indexforKey(element) {
            return Index(index)
        }
        return nil
    }
    
    subscript(i: Index) -> Element {
        return contents.keys[i.index]
    }

    subscript(element: Element) -> Bool {
        return contains(element)
    }

    // some things that didn't change...
}

Back to our indexing code up above, we can now use either a SetIndex or the Set’s subtype for subscripting:

let set: Set = ["Apple", "Banana", "Cherry"]  // set up the Set
let start = set.startIndex
println(set[start])
> > Cherry

println((set["Apple"], set["Pear"]))
> > (true, false)

To see all the changes from this post, you can view the diff on GitHub. I’ve moved this into a new project on GitHub, SwiftSets, to make room for some more data types in the future. Please take a look!

We don’t know how a Dictionary’s storage is handled internally, but with a million Int: Bool pairs in the Dictionary the swift process was consuming about 43.5MB. If Dictionary is implemented as some kind of binary tree, each node would use five 8-byte values (for the key itself, the hashed key, and pointers to the parent and two children) plus a single bit for the Bool. Multiplying everything together gives a total of roughly 38.3MB – could be?² ↩
Nope! David Smith weighed in on twitter and it turns out my poorly informed back-of-the-envelope calculations are not entirely reliable. ↩