Colin Woodbury - Optimizing Common Lisp

I recently released a Parser Combinator library for Common Lisp, but was unhappy with its performance. This article is a description of how I used sb-sprof, built in to SBCL, to identify both CPU and memory allocation hotspots, improving the runtime speed of the parcom/json module by 3x and decreasing memory allocation by 25x.

Introduction to sb-sprof

SBCL has a number of built-in extension libraries. sb-sprof is a "statistical profiler" that periodically samples the call stack to see where time is being spent and who is doing the most allocating. In my parcom/benchmarks system, I have:

(require :sb-sprof)

;; OVERALL
(let ((s (uiop:read-file-string "tests/data/large-file.json")))
  (time (pj:parse s)))

;; SPEED
(let ((s (uiop:read-file-string "tests/data/large-file.json")))
  (sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph)
    (aref (pj:parse s) 0)))

;; MEMORY
(let ((s (uiop:read-file-string "tests/data/large-file.json")))
  (sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph :mode :alloc)
    (aref (pj:parse s) 0)))

Like time, just by wrapping a call with with-profiling we can interactively profile any function we wish.

time

time itself is available in any CL implementation and is a quick way to gauge the overall performance of something. Running the OVERALL form above, we see:

Evaluation took:
  0.881 seconds of real time
  0.877017 seconds of total run time (0.867189 user, 0.009828 system)
  [ Real times consist of 0.079 seconds GC time, and 0.802 seconds non-GC time. ]
  [ Run times consist of 0.078 seconds GC time, and 0.800 seconds non-GC time. ]
  99.55% CPU
  2,474,196,347 processor cycles
  261,309,152 bytes consed

Before I started optimizing, time was reporting 6.5 billion bytes consed, which is where the "25x reduction" stat comes from.

with-profiling: CPU Mode

Evaluating the SPEED form above, we see:

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    159  19.4    270  33.0    159  19.4        -  (LAMBDA (PC::OFFSET) :IN PC:CONSUME)
   2     73   8.9     73   8.9    232  28.3        -  SB-KERNEL:TWO-ARG->=
   3     63   7.7     67   8.2    295  36.0        -  PJ::ESCAPED
   4     55   6.7     55   6.7    350  42.7        -  (LAMBDA (PJ::C) :IN PJ:STRING)
   5     39   4.8     45   5.5    389  47.5        -  (LAMBDA (PC::OFFSET) :IN PC:CHAR)
   6     38   4.6     76   9.3    427  52.1        -  PC:CHAR
   7     37   4.5     37   4.5    464  56.7        -  SB-IMPL::GETHASH/EQL-HASH
   8     33   4.0     48   5.9    497  60.7        -  PC:CONSUME
   9     27   3.3     27   3.3    524  64.0        -  SB-IMPL::GETHASH/EQ-HASH/FLAT
  10     18   2.2    405  49.5    542  66.2        -  PJ:STRING
...

And so on. Perhaps the most relevant column here is Self %, which tells us how much of the total runtime was spent inside the named function. In this case we see it spending the most time within the innards of the consume parser. "Data ends arguments" as they say, so it is helpful to know where the time is actually being spent as opposed to guessing. Here, it wouldn't make sense for me to optimize string (only 2% of the total) when consume is so clearly a hotspot.

Normally the :report mode is :flat, but by setting it to :graph we are shown where each function is called from, and what it calls:

                               Callers
                 Total.     Function
 Count     %  Count     %      Callees
------------------------------------------------------------------------
     3   0.4                   (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1]
    35   4.3                   PJ::SKIP-SPACE [17]
     1   0.1                   (LAMBDA (#:*>-OFFSET2) :IN PC:BETWEEN) [11]
   159  19.4    270  33.0   (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1]
     1   0.1                   PC:CONSUME [8]
     3   0.4                   (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1]
     4   0.5                   PC::OFF [23]
     6   0.7                   PC:SPACE? [22]
    50   6.1                   (LAMBDA (PJ::C) :IN PJ:STRING) [4]
    47   5.7                   SB-KERNEL:TWO-ARG->= [2]
------------------------------------------------------------------------
...

And so on. Here it looks like it's often being used to skip over whitespace.

with-profiling: Memory Mode

By adding :mode :alloc, the output of with-profiling becomes memory-focused.

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1   6282  74.3   6282  74.3   6282  74.3        -  ALLOC-TRAMP
   2   1177  13.9   1177  13.9   7459  88.3        -  SYS-ALLOC-TRAMP
   3    989  11.7    989  11.7   8448 100.0        -  LIST-ALLOC-TRAMP
   4      2   0.0      2   0.0   8450 100.0        -  SYS-LIST-ALLOC-TRAMP
   5      0   0.0   8450 100.0   8450 100.0        -  PJ:ARRAY
   6      0   0.0   8450 100.0   8450 100.0        -  PJ:COLLECTION
   7      0   0.0   8450 100.0   8450 100.0        -  PJ:JSON
   8      0   0.0   8450 100.0   8450 100.0        -  PC:PARSE

The top results are all internal functions, so we need to view the "graph" to see who is actually allocating:

                               Callers
                 Total.     Function
 Count     %  Count     %      Callees
------------------------------------------------------------------------
   108   1.3                   PC:TAKE-WHILE [63]
   136   1.6                   PC:TAKE-WHILE1 [61]
   286   3.4                   PC:OPT [58]
   230   2.7                   PC:RECOGNIZE [56]
    52   0.6                   PC:CONST [54]
   418   4.9                   SB-IMPL::%ALLOC-HASH-TABLE [49]
   436   5.2                   SB-KERNEL:%MAKE-ARRAY [48]
    30   0.4                   PC:FLOAT [47]
    33   0.4                   PJ:SCIENTIFIC [39]
   958  11.3                   SB-IMPL::%MAKE-HASH-TABLE [38]
    66   0.8                   PC:STRING [37]
   122   1.4                   PC:<$ [36]
  3397  40.2                   PJ::ESCAPED [34]
     1   0.0                   PJ:ARRAY [25]
  6282  74.3   6282  74.3   ALLOC-TRAMP [1]
------------------------------------------------------------------------
...

The top results all make sense:

ESCAPED for allocating freshly escaped strings.
%MAKE-HASH-TABLE and %ALLOC-HASH-TABLE for all the JSON objects.

As above, it would make sense to start with escaped if I were to optimize further.

Optimization Techniques

Avoiding Work

If your code takes 10 seconds, reducing that to 5 is a 2x speedup, or a 50% reduction in runtime. If you reduce it to 1, that's a 10x speedup, or a 90% reduction. What, then, is the best possible runtime? A 100% reduction of course; 0 seconds. In other words:

The fastest way to solve a problem is to get rid of it.

The best possible optimization is the deletion of the code.

This optimization technique comes first as it is the most obvious and not unique to Common Lisp. If you can't find a way to make your existing code faster, step back and ask yourself if there is a way to remove the need for that code in the first place. In the case of parcom, code like this to parse a JSON string:

(p:between (p:char #\")
           (p:many #'compound-char)
           (p:char #\"))

is elegant on the surface but suffers from over-allocation of lists within many, which is wasteful considering that that list of chars must further be converted to a concrete string later. Here we defer to the masters and recall that take-while should be favoured over many whenever possible, as it yields us a slice (called "displaced arrays" in CL) of the original string instead:

(p:between (p:char #\")
           (p:take-while (lambda (c) ...))
           (p:char #\"))

Unfortunately, as before, this slice itself isn't sufficient, because the source string may contain escaped characters that we need to interpret. Either way we're allocating a new string, which means we don't actually need the slice, only the parsing offset range that the slice would have inhabited.

(p:between (p:char #\")
           (p:consume (lambda (c) ...))
           (p:char #\"))

Now our char escaping function, which processes the output of this parse, can operate over the raw input string, not on a slice. Read on for why that matters.

simple-string and schar

In Common Lisp, Vectors are just Arrays, and String are just Vectors. If you inspect the type of a string literal like "hello", you'll see:

(SIMPLE-ARRAY CHARACTER (5))

Which is a child type of a simple-string. If we know our source input is a simple-string, we gain access to schar, a faster version of char (or aref), for accessing elements. The importance of this can't be overstated. With char and especially on a string slice / "displaced array", we're likely to see our sb-sprof results filled with calls to %DATA-VECTOR-INDEX and mentions of "hairy data". This is the compiler being unsure of the contents of the underlying array, and doing lots of wasteful bounds checking and offset calculations. With schar, we get what we'd expect: no-nonsense, O(1) access to the char.

In parcom, I use this whereever possible:

(deftype char-string ()
  '(simple-array character (*)))

Type Signatures

Usage of a char-string type declaration was mentioned above. In strictly typed languages, we imagine these types as contracts that protect us from stupidity like adding ints to strings. Some CL compilers can treat them this way, but their main advantage is in helping the compiler optimize. Consider this signature:

(declaim (ftype (function (p::char-string fixnum fixnum) p::char-string) escaped))

Telling it exactly what our string type is, as mentioned above, will allow the compiler to optimize schar calls. Likewise, being clear about numeric inputs and outputs is critical to avoid internal calls to GENERIC-+ and similar functions. If your code needs to do generic dispatch for basic numerics, it will be slow.

Declaring fixnum should be sufficient for integer operations, but there are other subdivisions if you need them.

Loop Variables

If you notice a significant presence of SB-KERNEL:TWO-ARG->= in your profiler output, the compiler is somehow "forgetting" some numeric type declarations. One hot spot for this is your loop variables, especially within lambdas. The following is possible:

(loop :for i fixnum :from offset :below *input-length*
      :while (funcall p (schar *input* i))
      :finally (return (- i offset)))

See the i fixnum? You can declare the type of i inline. If we do a macroexpand on loop, we see:

(block nil
  (let ((i offset) (#:lim273 *input-length*))
    (declare (ignorable #:lim273)
             (type (and fixnum real) #:lim273)
             (ignorable i)
             (type (and real fixnum) i))
    ...))

Where (and fixnum real) is the same as fixnum. Without the explicit fixnum declaration, you may see (and number real), which is the source of the generic dispatch.

Multiple Return Values

I had perhaps been spoiled by my history with Haskell and Rust, where it's common to return multiple results as tuples and we expect this to be inconsequential with regards to Garbage Collection. This understanding does not (totally) transfer to Common Lisp, where a (cons a b) call does heap allocate, which gets quite expensive when you're trying to do that to communicate all parsing successes and failures. Especially for long alt and *> chains, this can generate a lot of garbage.

I avoided this cost by wholly embracing values and multiple-value-bind for the first time. Within parcom, all inspection of parse results now looks like:

(multiple-value-bind (res next) (funcall parser offset)
  (if (ok? res)
      ...))

This grants us both the ability to return multiple things at once, as well as desugar the "tuple" back out into its separate parts. Best of all, we get this essentially for free, performance-wise. To quote Bike, a Clasp developer:

values and multiple-value-bind are not usually going to cons. The implementation can stuff values into registers, the stack, or a thread-local preallocated vector without needing any buy-in from the caller.

Indeed, switching from cons cells to values reduced my memory usage by 30%.

Stack Allocation

In Common Lisp, you can declare to the compiler when you want a local variable to be allocated on the stack and not the heap. Within parcom/json, I have:

(declaim (ftype (function (p::char-string fixnum fixnum) p::char-string) escaped))
(defun escaped (s from to)
  "Escape a string."
  (declare (optimize (speed 3) (safety 0)))
  (let* ((len  (- to from))
         (work (make-array len :element-type 'character)))
    (declare (dynamic-extent work))
    ...))

The key is (declare (dynamic-extent work)). This causes the compiler to allocate the work array on the stack, automatically freeing the memory when this function returns. Removing this line and rerunning benchmarks shows a 30% increase in memory usage (note: the function is a hot-path for JSON parsing), so having it is certainly effective!

Lambda Caching

I found the necessity for this to be unfortunate, but I enjoy the results. It turns out that Common Lisp (at least SBCL) must freshly allocate closures every time they are called. Many parcom parsers are "higher order", in that they take an initial argument that "primes" the parser and yields it as a lambda. For example, here is the original definition of the char parser:

(defun char (c)
  "Parse a given character."
  (lambda (offset)
    (declare (optimize (speed 3) (safety 0)))
    (if (>= offset *input-length*)
        (fail offset)
        (let ((head (schar *input* offset)))
          (if (equal c head)
              (ok (off 1 offset) head)
              (fail offset))))))

Here, c is captured ("closed over") by the lambda, which means that CL has to do some bookkeeping every time char is called somewhere else. This gets especially bad for compound structures like:

(p:between (*> (p:char #\{) #'skip-space)
           (p:sep (*> (p:char #\,) #'skip-space)
                  (<*> #'string (*> #'skip-space
                                    (p:char #\:)
                                    #'skip-space
                                    (<* #'json #'skip-space))))
           (*> #'skip-space (p:char #\})))

This parses a JSON object. All the *> calls are actually macros that expand out into a lambda, which means that every call to between and char are actually doing a lot of allocation you never intended. We can avoid this with custom "lambda caches".

(defparameter *char-cache* (make-hash-table :size 64 :test #'eql))

(defun char (c)
  "Parse a given character."
  (or (gethash c *char-cache*)
      (let ((f (lambda (offset)
                 (declare (optimize (speed 3) (safety 0)))
                 (if (>= offset *input-length*)
                     (fail offset)
                     (let ((head (schar *input* offset)))
                       (if (equal c head)
                           (ok (off 1 offset) head)
                           (fail offset)))))))
        (setf (gethash c *char-cache*) f)
        f)))

Now, every call to char will start with a Hash Table lookup to see if we've already allocated a lambda for the given character. This does incur a lookup cost, but it was deemed worth it as the memory savings are significant. If we inspect our cache after some time, we see that it does indeed contain a unique lambda per associated character:

 Contents:
#\Z = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {1001959B8B}> [remove entry]
#\  = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {1001885EAB}> [remove entry]
#\T = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {10017C2D9B}> [remove entry]
#\_ = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {100E2CFE8B}> [remove entry]
#\Newline = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {100E2CFE6B}> [remove entry]
...

I still feel like it should be the compiler's responsibility to handle this, but the workaround is reasonable.

Conclusion

You can use time and sb-sprof:with-profiling to, in an entirely interactive way, determine where your code is spending its time and where memory is being allocated. With these tools in hand I stopped needing flamegraphs, as the turnaround in producing them was slower than just reading the text charts. Thanks to these tools I was able to vastly improve the performance of my library, which demonstrates that Parser Combinators can be both an elegant and performant way to parse text in Common Lisp.

Feedback

Is fixnum "big enough"?

All of the string indexing and loop variables use fixnum in parcom. However, I was told:

Never use fixnum if you want your program to be portable.

Compiler	most-positive-fixnum
SBCL	4611686018427387903
ECL	2305843009213693951
Clasp	2305843009213693951
CCL	1152921504606846975
Allegro	1152921504606846975
Clisp	281474976710655
ABCL	2147483647

Seems big enough to me. Also considering that parcom only works on in-memory strings, you'd have to read a JSON file larger than 2gb into memory on ABCL (25mb already takes a long time) and then attempt to parse it for this to realistically be a problem.

(safety 0) or (safety 1)?

(safety 0) is a very bad habit to get used to, in general, and may create the impression that it's needed for optimized code.

Easy enough to test. I changed all uses of (safety 0) to (safety 1) and reran my benchmarks. Times are in seconds.

Compiler	(safety 0)	(safety 1)
SBCL	0.75	0.75
ECL	4.7	5.2
ABCL	6.9	6.9

Hm, a slowdown on ECL. Checking its documentation, there is indeed a difference between 0 and 1 here. I'll stick with 0, as I'm only using it in specific, controlled locations.