After hearing from a few folks on Twitter, and thinking through the implementation a little more, I have a couple of changes to the Set type from my last post. I think each of these changes brings the Set type a little closer to how it might have been implemented as a first-party datatype.
A better Generator
In my last post, I added SequenceType
conformance by returning the internal Dictionary’s keys.generator()
, like this:
extension Set : SequenceType {
typealias Generator = MapSequenceGenerator<DictionaryGenerator<T, Bool>, T>
func generate() -> Generator {
return contents.keys.generate()
}
}
It works fine, but get a load of that return type! MapSequenceGenerator<DictionaryGenerator<T, Bool>, T>
is pretty ugly, and exposes way too much of the internal workings of the Set type for my comfort. What if I want to change the underlying data structure? (Ahem.)
Instead, I can use the handy GeneratorOf
type to create a generator that is custom-tailored to the Set’s subtype and masks all the internal gobbledegook. The initializer for GeneratorOf
takes a closure that acts as the next()
method for the generator. Much cleaner:
extension Set : SequenceType {
typealias Generator = GeneratorOf<T>
public func generate() -> Generator {
var generator = contents.keys.generate()
return Swift.GeneratorOf {
return generator.next()
}
}
}
Bool vs. Void?
I had a couple requests about why I used Bool
instead of Void
in the Set type’s internal Dictionary. It’s a good question – just the presence of the key is enough to establish an element’s existence in the set, so why use space for even a single-bit data type when Void
should take up no space at all? What if the internal storage is inefficient, and the Bool
value is taking up even more than a single bit?
I decided to run a quick test to see if I could figure out how much space I could save converting to a Void, and it turns out: zero. No matter how large my Dictionary got, the [Key: Bool]
and [Key: Void]
versions of the process used the same amount of memory.1 My hunch is that a Void type is represented internally by a single zero bit, so it’s no savings over a Bool using that same bit for true/false.
Interestingly enough, the Void
testing brought up another challenge as well, since the conversion broke the the Set equality test. To check for equality between two sets, I’m deferring to the ==
operator on the underlying contents
Dictionary, but for that to work both the key and value types of the Dictionary need to be Equatable
. The Void type isn’t equatable (it isn’t anything, really), so that comparison stopped working.
In the end, I’m going to stick with the boolean true
as the sentinel value – it feels better (semantically) to have the internal Dictionary actually representing the Set in that way.
Better indexing for a better collection
After taking a second look, I realized my ExtensibleCollectionType
implementation was pretty far off the mark. Dictionary instances use the DictionaryIndex
type as their index, which is nearly opaque (it seems to hold some sort of enum
). Since the ExtensibleCollectionType
protocol requires that you implement a subscript for your index type, this allows for both “indexed” subscripting and keyed subscripting.
var dict = ["a": "Apple", // set up the Dictionary
"b": "Banana",
"c": "Cherry"]
let bananaIndex = dict.indexForKey("b")! // retrieve an index
println(dict[bananaIndex]) // subscript via DictionaryIndex<String, String>
> > (b, Banana)
println(dict["c"]) // subscript via String key
> > Optional("Cherry")
My original Set used a simple Int
as an index, which, in addition to not really making sense, prevented me from having an element-based subscript.
let set: Set = ["Apple", "Banana", "Cherry"] // set up the Set
let start: Int = set.startIndex
println(set[start])
> > Cherry
println(set["Apple"])
> error: type 'Index' does not conform to protocol 'StringLiteralConvertible'
A better solution is to create a new SetIndex
type that embeds the DictionaryIndex
we saw in action up above. The SetIndex type can simply pass the index between the caller and the internal Dictionary, once again obfuscating the working details. To be used as an index, SetIndex
has to at least implement the ForwardIndexType
protocol, but we’ll go all the way and implement BidirectionalIndexType
. Note that the initializer is private – this is typical for an index in Swift. You can retrieve one from the instance you want to traverse, but you can’t simply create an index directly.
struct SetIndex<T: Hashable> : BidirectionalIndexType {
private var index: DictionaryIndex<T, Bool>
private init(_ dictionaryIndex: DictionaryIndex<T, Bool>) {
self.index = dictionaryIndex
}
func predecessor() -> SetIndex<T> {
return SetIndex(self.index.predecessor())
}
func successor() -> SetIndex<T> {
return SetIndex(self.index.successor())
}
}
func ==<T: Hashable>(lhs: SetIndex<T>, rhs: SetIndex<T>) -> Bool {
return lhs.index == rhs.index
}
Next we need to update our ExtensibleCollectionType
extension to use the new SetIndex type instead of Int as the index. I’ve added two additional methods – indexForElement()
returns a SetIndex for the requested element, or nil
if the element isn’t in the set, and subscripting with set elements now acts as a shortcut for the contains()
method, returning true
if the element is in the set and false
otherwise.
extension Set : ExtensibleCollectionType {
typealias Index = SetIndex<Element>
var startIndex: Index { return Index(contents.startIndex) }
var endIndex: Index { return Index(contents.endIndex) }
func indexForElement(element: Element) -> Index? {
if let index = contents.indexforKey(element) {
return Index(index)
}
return nil
}
subscript(i: Index) -> Element {
return contents.keys[i.index]
}
subscript(element: Element) -> Bool {
return contains(element)
}
// some things that didn't change...
}
Back to our indexing code up above, we can now use either a SetIndex
or the Set’s subtype for subscripting:
let set: Set = ["Apple", "Banana", "Cherry"] // set up the Set
let start = set.startIndex
println(set[start])
> > Cherry
println((set["Apple"], set["Pear"]))
> > (true, false)
To see all the changes from this post, you can view the diff on GitHub. I’ve moved this into a new project on GitHub, SwiftSets, to make room for some more data types in the future. Please take a look!
-
We don’t know how a Dictionary’s storage is handled internally, but with a million
Int: Bool
pairs in the Dictionary theswift
process was consuming about 43.5MB. If Dictionary is implemented as some kind of binary tree, each node would use five 8-byte values (for the key itself, the hashed key, and pointers to the parent and two children) plus a single bit for the Bool. Multiplying everything together gives a total of roughly 38.3MB – could be?2 ↩ -
Nope! David Smith weighed in on twitter and it turns out my poorly informed back-of-the-envelope calculations are not entirely reliable. ↩