I was recently surprised to discover a big-O time complexity difference in measuring the size of `Map`

vs `HashMap`

through their respective `size`

functions. This sent me on a journey to double-check the other container types we commonly use in Haskell, to be sure that I haven't been punishing myself unawares.

The crux is this: some containers expose a custom `length`

(or `size`

) function, while some present `Foldable.length`

as the canonical measurement tool. In both camps, some override `Foldable.length`

with a better implementation, and some rely on the default. That default is implemented as follows:

```
-- Hurray, foldl'!
length :: t a -> Int
length = foldl' (\c _ -> c+1) 0
-- Wait...
foldl' :: (b -> a -> b) -> b -> t a -> b
foldl' f z0 xs = foldr f' id xs z0
where f' x k z = k $! f z x
```

For some structures, `foldr`

is naughty. For others, it isn't. We seek then definitive answers to these questions:

- What's the best way to measure the size of each container in Haskell?
- In general, can we trust
`Foldable.length`

?

Legend:

`length = size`

: Is`Foldable.length`

defined in terms of a custom`size`

function?

Container | Custom `length` | `Foldable.length` | `length = size` ? | Notes |
---|---|---|---|---|

`[]` | `O(n)` | `O(n)` | Yes | Rewrite rules for performance |

`NonEmpty` | `O(n)` | `O(n)` | No | |

`Map` | `O(1)` | `O(1)` | Yes | Embedded `Size` field |

`IntMap` | `O(n)` | `O(n)` | Yes | |

`Set` | `O(1)` | `O(1)` | Yes | Embedded `Size` field |

`IntSet` | `O(n)` | N/A | N/A | No `Foldable` instance |

`Seq` | `O(1)` | `O(1)` | Yes | Embedded `Int` field |

`Tree` | N/A | `O(n)` | N/A | Default `Foldable` instance |

`Array` | `O(1)` | `O(1)` | Yes | Embedded `Int` field |

`HashMap` | `O(n)` | Worse `O(n)` | No | Default `Foldable` instance |

`HashSet` | `O(n)` | Worse `O(n)` | No | Default `Foldable` instance |

`Vector` | Amortized `O(1)` | Amortized `O(1)` | Yes |

`[]`

Perhaps surprisingly, list length is calculated in the way we were all told to implement it when first learning Haskell:

```
length :: [a] -> Int
length xs = lenAcc xs 0
lenAcc :: [a] -> Int -> Int
lenAcc [] n = n
lenAcc (_:ys) n = lenAcc ys (n+1)
```

Although why is there no `!`

on the `n`

? Futhermore, there are some rewrite rules that claim to be for performance:

```
{-# RULES
"length" [~1] forall xs . length xs = foldr lengthFB idLength xs 0
"lengthList" [1] foldr lengthFB idLength = lenAcc
#-}
-- The lambda form turns out to be necessary to make this inline
-- when we need it to and give good performance.
{-# INLINE [0] lengthFB #-}
lengthFB :: x -> (Int -> Int) -> Int -> Int
lengthFB _ r = \ !a -> r (a + 1)
```

`NonEmpty`

Not a surprise:

But note that its `Foldable`

instance does not override `length`

! Its default calls `foldl'`

, which calls `foldr`

, which is overridden as:

`Map`

`Map`

is defined like this (altered slightly for clarity):

Calls like `insert`

alter the `Size`

field at the same time that they write the new element. Fetching that `Size`

couldn't get more `O(1)`

:

`Set`

`Set`

is not defined in terms of `Map`

as one might have thought, but it follows the same idea:

`Seq`

`Seq`

is defined as:

```
newtype Seq a = Seq (FingerTree (Elem a))
data FingerTree a
= EmptyT
| Single a
| Deep Int (Digit a) (FingerTree (Node a)) (Digit a)
length :: Seq a -> Int
length (Seq xs) = size xs
class Sized a where
size :: a -> Int
instance Sized a => Sized (FingerTree a) where
size EmptyT = 0
size (Single x) = size x
size (Deep v _ _ _) = v
```

The `Int`

field of `Deep`

is the size, which is updated upon every insert / removal.

`Tree`

We don't get much:

```
data Tree a = Node a [Tree a]
instance Foldable Tree where
foldMap f (Node x ts) = f x `mappend` foldMap (foldMap f) ts
```

So this is as "default" as `Foldable.length`

can get. Recall ```
length -> foldl'
-> foldr
```

, where:

`Array`

```
data Array i e = Array i i Int (Array# e)
numElements :: Array i e -> Int
numElements (Array _ _ n _) = n
{-# INLINE numElements #-}
```

Where the `Int`

is explained to be:

A cache of (rangeSize (l,u)) used to make sure an index is really in range.

`HashMap`

This type has an interesting story.

```
data HashMap k v
= Empty
| BitmapIndexed Bitmap (A.Array (HashMap k v))
| Leaf Hash (Leaf k v)
| Full (A.Array (HashMap k v))
| Collision Hash (A.Array (Leaf k v))
```

This `A.Array`

type is defined internally to `unordered-containers`

, but is based off of `GHC.Exts.SmallArray#`

:

With the appearance of GHC magic, I shall tread no deeper. `SmallArray#`

doesn't have an embedded size field like the usual `Data.Array.Array`

does, hence `O(n)`

for `HashMap`

:

```
size :: HashMap k v -> Int
size t = go t 0
where
go Empty !n = n
go (Leaf _ _) n = n + 1
go (BitmapIndexed _ ary) n = A.foldl' (flip go) n ary
go (Full ary) n = A.foldl' (flip go) n ary
go (Collision _ ary) n = n + A.length ary
--- In `Data.HashMap.Array` ---
length :: Array a -> Int
length ary = I# (sizeofArray# (unArray ary))
sizeofArray# :: SmallArray# a -> Int#
sizeofArray# = sizeofSmallArray#
--- In `GHC.Exts` --
sizeofSmallArray# :: SmallArray# a -> Int#
```

And what's this? A near-default `Foldable`

instance?

```
instance Foldable.Foldable (HashMap k) where
foldr f = foldrWithKey (const f)
foldrWithKey :: (k -> v -> a -> a) -> a -> HashMap k v -> a
foldrWithKey f = go
where
go z Empty = z
go z (Leaf _ (L k v)) = f k v z
go z (BitmapIndexed _ ary) = A.foldr (flip go) z ary
go z (Full ary) = A.foldr (flip go) z ary
go z (Collision _ ary) = A.foldr (\ (L k v) z' -> f k v z') z ary
```

Notice the strict `foldl'`

versus `foldr`

.

`HashSet`

The smoking gun. This must be why I thought `Set`

was defined in terms of `Map`

:

I wonder if that redundantly allocates all the `()`

, or if a pointer is shared internally?

```
instance Foldable.Foldable HashSet where
foldr = Data.HashSet.Base.foldr
foldr :: (b -> a -> a) -> a -> HashSet b -> a
foldr f z0 = foldrWithKey g z0 . asMap
where g k _ z = f k z
```

So, entirely based on the performance of `HashMap`

.

`Vector`

Crawl down the rabbit hole with me:

```
length :: Vector a -> Int
length = G.length
length :: Vector v a => v a -> Int
length = Bundle.length . stream'
length :: Bundle v a -> Int
length = unId . M.length
length :: Monad m => Bundle m v a -> m Int
length Bundle{sSize = Exact n} = return n
length Bundle{sChunks = s} = S.foldl' (\n (Chunk k _) -> n+k) 0 s
data Bundle m v a = Bundle
{ sElems :: Stream m a
, sChunks :: Stream m (Chunk v a)
, sVector :: Maybe (v a)
, sSize :: Size }
data Size = Exact Int -- ^ Exact size
| Max Int -- ^ Upper bound on the size
| Unknown -- ^ Unknown size
```

So `O(1)`

if dealing with an untouched `Vector`

, but potentially `O(c)`

had you done some splitting and twisting which invoked Fusion.

Blog Archive