I was recently surprised to discover a big-O time complexity difference in measuring the size of Map
vs HashMap
through their respective size
functions. This sent me on a journey to double-check the other container types we commonly use in Haskell, to be sure that I haven't been punishing myself unawares.
The crux is this: some containers expose a custom length
(or size
) function, while some present Foldable.length
as the canonical measurement tool. In both camps, some override Foldable.length
with a better implementation, and some rely on the default. That default is implemented as follows:
-- Hurray, foldl'!
length :: t a -> Int
length = foldl' (\c _ -> c+1) 0
-- Wait...
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
For some structures, foldr
is naughty. For others, it isn't. We seek then definitive answers to these questions:
Foldable.length
?Legend:
length = size
: Is Foldable.length
defined in terms of a custom size
function?Container | Custom length | Foldable.length | length = size ? | Notes |
---|---|---|---|---|
[] | O(n) | O(n) | Yes | Rewrite rules for performance |
NonEmpty | O(n) | O(n) | No | |
Map | O(1) | O(1) | Yes | Embedded Size field |
IntMap | O(n) | O(n) | Yes | |
Set | O(1) | O(1) | Yes | Embedded Size field |
IntSet | O(n) | N/A | N/A | No Foldable instance |
Seq | O(1) | O(1) | Yes | Embedded Int field |
Tree | N/A | O(n) | N/A | Default Foldable instance |
Array | O(1) | O(1) | Yes | Embedded Int field |
HashMap | O(n) | Worse O(n) | No | Default Foldable instance |
HashSet | O(n) | Worse O(n) | No | Default Foldable instance |
Vector | Amortized O(1) | Amortized O(1) | Yes |
[]
Perhaps surprisingly, list length is calculated in the way we were all told to implement it when first learning Haskell:
length :: [a] -> Int
length xs = lenAcc xs 0
lenAcc :: [a] -> Int -> Int
lenAcc [] n = n
lenAcc (_:ys) n = lenAcc ys (n+1)
Although why is there no !
on the n
? Futhermore, there are some rewrite rules that claim to be for performance:
{-# RULES
"length" [~1] forall xs . length xs = foldr lengthFB idLength xs 0
"lengthList" [1] foldr lengthFB idLength = lenAcc
#-}
-- The lambda form turns out to be necessary to make this inline
-- when we need it to and give good performance.
{-# INLINE [0] lengthFB #-}
lengthFB :: x -> (Int -> Int) -> Int -> Int
lengthFB _ r = \ !a -> r (a + 1)
NonEmpty
Not a surprise:
But note that its Foldable
instance does not override length
! Its default calls foldl'
, which calls foldr
, which is overridden as:
Map
Map
is defined like this (altered slightly for clarity):
Calls like insert
alter the Size
field at the same time that they write the new element. Fetching that Size
couldn't get more O(1)
:
Set
Set
is not defined in terms of Map
as one might have thought, but it follows the same idea:
Seq
Seq
is defined as:
newtype Seq a = Seq (FingerTree (Elem a))
data FingerTree a
= EmptyT
| Single a
| Deep Int (Digit a) (FingerTree (Node a)) (Digit a)
length :: Seq a -> Int
length (Seq xs) = size xs
class Sized a where
size :: a -> Int
instance Sized a => Sized (FingerTree a) where
size EmptyT = 0
size (Single x) = size x
size (Deep v _ _ _) = v
The Int
field of Deep
is the size, which is updated upon every insert / removal.
Tree
We don't get much:
data Tree a = Node a [Tree a]
instance Foldable Tree where
foldMap f (Node x ts) = f x `mappend` foldMap (foldMap f) ts
So this is as "default" as Foldable.length
can get. Recall length -> foldl'
-> foldr
, where:
Array
data Array i e = Array i i Int (Array# e)
numElements :: Array i e -> Int
numElements (Array _ _ n _) = n
{-# INLINE numElements #-}
Where the Int
is explained to be:
A cache of (rangeSize (l,u)) used to make sure an index is really in range.
HashMap
This type has an interesting story.
data HashMap k v
= Empty
| BitmapIndexed Bitmap (A.Array (HashMap k v))
| Leaf Hash (Leaf k v)
| Full (A.Array (HashMap k v))
| Collision Hash (A.Array (Leaf k v))
This A.Array
type is defined internally to unordered-containers
, but is based off of GHC.Exts.SmallArray#
:
With the appearance of GHC magic, I shall tread no deeper. SmallArray#
doesn't have an embedded size field like the usual Data.Array.Array
does, hence O(n)
for HashMap
:
size :: HashMap k v -> Int
size t = go t 0
where
go Empty !n = n
go (Leaf _ _) n = n + 1
go (BitmapIndexed _ ary) n = A.foldl' (flip go) n ary
go (Full ary) n = A.foldl' (flip go) n ary
go (Collision _ ary) n = n + A.length ary
--- In `Data.HashMap.Array` ---
length :: Array a -> Int
length ary = I# (sizeofArray# (unArray ary))
sizeofArray# :: SmallArray# a -> Int#
sizeofArray# = sizeofSmallArray#
--- In `GHC.Exts` --
sizeofSmallArray# :: SmallArray# a -> Int#
And what's this? A near-default Foldable
instance?
instance Foldable.Foldable (HashMap k) where
foldr f = foldrWithKey (const f)
foldrWithKey :: (k -> v -> a -> a) -> a -> HashMap k v -> a
foldrWithKey f = go
where
go z Empty = z
go z (Leaf _ (L k v)) = f k v z
go z (BitmapIndexed _ ary) = A.foldr (flip go) z ary
go z (Full ary) = A.foldr (flip go) z ary
go z (Collision _ ary) = A.foldr (\ (L k v) z' -> f k v z') z ary
Notice the strict foldl'
versus foldr
.
HashSet
The smoking gun. This must be why I thought Set
was defined in terms of Map
:
I wonder if that redundantly allocates all the ()
, or if a pointer is shared internally?
instance Foldable.Foldable HashSet where
foldr = Data.HashSet.Base.foldr
foldr :: (b -> a -> a) -> a -> HashSet b -> a
foldr f z0 = foldrWithKey g z0 . asMap
where g k _ z = f k z
So, entirely based on the performance of HashMap
.
Vector
Crawl down the rabbit hole with me:
length :: Vector a -> Int
length = G.length
length :: Vector v a => v a -> Int
length = Bundle.length . stream'
length :: Bundle v a -> Int
length = unId . M.length
length :: Monad m => Bundle m v a -> m Int
length Bundle{sSize = Exact n} = return n
length Bundle{sChunks = s} = S.foldl' (\n (Chunk k _) -> n+k) 0 s
data Bundle m v a = Bundle
{ sElems :: Stream m a
, sChunks :: Stream m (Chunk v a)
, sVector :: Maybe (v a)
, sSize :: Size }
data Size = Exact Int -- ^ Exact size
| Max Int -- ^ Upper bound on the size
| Unknown -- ^ Unknown size
So O(1)
if dealing with an untouched Vector
, but potentially O(c)
had you done some splitting and twisting which invoked Fusion.
Blog Archive