Table of Contents
I recently released a Parser Combinator library for Common Lisp, but was unhappy with its performance. This article is a description of how I used sb-sprof
, built in to SBCL, to identify both CPU and memory allocation hotspots, improving the runtime speed of the parcom/json
module by 3x and decreasing memory allocation by 25x.
SBCL has a number of built-in extension libraries. sb-sprof
is a "statistical profiler" that periodically samples the call stack to see where time is being spent and who is doing the most allocating. In my parcom/benchmarks
system, I have:
(require :sb-sprof) ;; OVERALL (let ((s (uiop:read-file-string "tests/data/large-file.json"))) (time (pj:parse s))) ;; SPEED (let ((s (uiop:read-file-string "tests/data/large-file.json"))) (sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph) (aref (pj:parse s) 0))) ;; MEMORY (let ((s (uiop:read-file-string "tests/data/large-file.json"))) (sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph :mode :alloc) (aref (pj:parse s) 0)))
Like time
, just by wrapping a call with with-profiling
we can interactively profile any function we wish.
time
itself is available in any CL implementation and is a quick way to gauge the overall performance of something. Running the OVERALL
form above, we see:
Evaluation took: 0.881 seconds of real time 0.877017 seconds of total run time (0.867189 user, 0.009828 system) [ Real times consist of 0.079 seconds GC time, and 0.802 seconds non-GC time. ] [ Run times consist of 0.078 seconds GC time, and 0.800 seconds non-GC time. ] 99.55% CPU 2,474,196,347 processor cycles 261,309,152 bytes consed
Before I started optimizing, time
was reporting 6.5 billion bytes consed, which is where the "25x reduction" stat comes from.
Evaluting the SPEED
form above, we see:
Self Total Cumul Nr Count % Count % Count % Calls Function ------------------------------------------------------------------------ 1 159 19.4 270 33.0 159 19.4 - (LAMBDA (PC::OFFSET) :IN PC:CONSUME) 2 73 8.9 73 8.9 232 28.3 - SB-KERNEL:TWO-ARG->= 3 63 7.7 67 8.2 295 36.0 - PJ::ESCAPED 4 55 6.7 55 6.7 350 42.7 - (LAMBDA (PJ::C) :IN PJ:STRING) 5 39 4.8 45 5.5 389 47.5 - (LAMBDA (PC::OFFSET) :IN PC:CHAR) 6 38 4.6 76 9.3 427 52.1 - PC:CHAR 7 37 4.5 37 4.5 464 56.7 - SB-IMPL::GETHASH/EQL-HASH 8 33 4.0 48 5.9 497 60.7 - PC:CONSUME 9 27 3.3 27 3.3 524 64.0 - SB-IMPL::GETHASH/EQ-HASH/FLAT 10 18 2.2 405 49.5 542 66.2 - PJ:STRING ...
And so on. Perhaps the most relevant column here is Self %
, which tells us how much of the total runtime was spent inside the named function. In this case we see it spending the most time within the innards of the consume
parser. "Data ends arguments" as they say, so it is helpful to know where the time is actually being spent as opposed to guessing. Here, it wouldn't make sense for me to optimize string
(only 2% of the total) when consume
is so clearly a hotspot.
Normally the :report
mode is :flat
, but by setting it to :graph
we are shown where each function is called from, and what it calls:
Callers Total. Function Count % Count % Callees ------------------------------------------------------------------------ 3 0.4 (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1] 35 4.3 PJ::SKIP-SPACE [17] 1 0.1 (LAMBDA (#:*>-OFFSET2) :IN PC:BETWEEN) [11] 159 19.4 270 33.0 (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1] 1 0.1 PC:CONSUME [8] 3 0.4 (LAMBDA (PC::OFFSET) :IN PC:CONSUME) [1] 4 0.5 PC::OFF [23] 6 0.7 PC:SPACE? [22] 50 6.1 (LAMBDA (PJ::C) :IN PJ:STRING) [4] 47 5.7 SB-KERNEL:TWO-ARG->= [2] ------------------------------------------------------------------------ ...
And so on. Here it looks like it's often being used to skip over whitespace.
By adding :mode :alloc
, the output of with-profiling
becomes memory-focused.
Self Total Cumul Nr Count % Count % Count % Calls Function ------------------------------------------------------------------------ 1 6282 74.3 6282 74.3 6282 74.3 - ALLOC-TRAMP 2 1177 13.9 1177 13.9 7459 88.3 - SYS-ALLOC-TRAMP 3 989 11.7 989 11.7 8448 100.0 - LIST-ALLOC-TRAMP 4 2 0.0 2 0.0 8450 100.0 - SYS-LIST-ALLOC-TRAMP 5 0 0.0 8450 100.0 8450 100.0 - PJ:ARRAY 6 0 0.0 8450 100.0 8450 100.0 - PJ:COLLECTION 7 0 0.0 8450 100.0 8450 100.0 - PJ:JSON 8 0 0.0 8450 100.0 8450 100.0 - PC:PARSE
The top results are all internal functions, so we need to view the "graph" to see who is actually allocating:
Callers Total. Function Count % Count % Callees ------------------------------------------------------------------------ 108 1.3 PC:TAKE-WHILE [63] 136 1.6 PC:TAKE-WHILE1 [61] 286 3.4 PC:OPT [58] 230 2.7 PC:RECOGNIZE [56] 52 0.6 PC:CONST [54] 418 4.9 SB-IMPL::%ALLOC-HASH-TABLE [49] 436 5.2 SB-KERNEL:%MAKE-ARRAY [48] 30 0.4 PC:FLOAT [47] 33 0.4 PJ:SCIENTIFIC [39] 958 11.3 SB-IMPL::%MAKE-HASH-TABLE [38] 66 0.8 PC:STRING [37] 122 1.4 PC:<$ [36] 3397 40.2 PJ::ESCAPED [34] 1 0.0 PJ:ARRAY [25] 6282 74.3 6282 74.3 ALLOC-TRAMP [1] ------------------------------------------------------------------------ ...
The top results all make sense:
ESCAPED
for allocating freshly escaped strings.%MAKE-HASH-TABLE
and %ALLOC-HASH-TABLE
for all the JSON objects.As above, it would make sense to start with escaped
if I were to optimize further.
If your code takes 10 seconds, reducing that to 5 is a 2x speedup, or a 50% reduction in runtime. If you reduce it to 1, that's a 10x speedup, or a 90% reduction. What, then, is the best possible runtime? A 100% reduction of course; 0 seconds. In other words:
The fastest way to solve a problem is to get rid of it.
or
The best possible optimization is the deletion of the code.
This optimization technique comes first as it is the most obvious and not unique to Common Lisp. If you can't find a way to make your existing code faster, step back and ask yourself if there is a way to remove the need for that code in the first place. In the case of parcom
, code like this to parse a JSON string:
(p:between (p:char #\") (p:many #'compound-char) (p:char #\"))
is elegant on the surface but suffers from over-allocation of lists within many
, which is wasteful considering that that list of chars must further be converted to a concrete string
later. Here we defer to the masters and recall that take-while
should be favoured over many
whenever possible, as it yields us a slice (called "displaced arrays" in CL) of the original string instead:
(p:between (p:char #\") (p:take-while (lambda (c) ...)) (p:char #\"))
Unfortunately, as before, this slice itself isn't sufficient, because the source string may contain escaped characters that we need to interpret. Either way we're allocating a new string, which means we don't actually need the slice, only the parsing offset range that the slice would have inhabited.
(p:between (p:char #\") (p:consume (lambda (c) ...)) (p:char #\"))
Now our char escaping function, which processes the output of this parse, can operate over the raw input string, not on a slice. Read on for why that matters.
In Common Lisp, Vectors are just Arrays, and String are just Vectors. If you inspect the type of a string literal like "hello"
, you'll see:
(SIMPLE-ARRAY CHARACTER (5))
Which is a child type of a simple-string
. If we know our source input is a simple-string
, we gain access to schar
, a faster version of char
(or aref
), for accessing elements. The importance of this can't be overstated. With char
and especially on a string slice / "displaced array", we're likely to see our sb-sprof
results filled with calls to %DATA-VECTOR-INDEX
and mentions of "hairy data". This is the compiler being unsure of the contents of the underlying array, and doing lots of wasteful bounds checking and offset calculations. With schar
, we get what we'd expect: no-nonsense, O(1)
access to the char.
In parcom
, I use this whereever possible:
(deftype char-string () '(simple-array character (*)))
Usage of a char-string
type declaration was mentioned above. In strictly typed languages, we image these types as contracts that protect us from stupidity like addings ints to strings. Some CL compilers can treat them this way, but their main advantage is in helping the compiler optimize. Consider this signature:
(declaim (ftype (function (p::char-string fixnum fixnum) p::char-string) escaped))
Telling it exactly what our string type is, as mentioned above, will allow the compiler to optimize schar
calls. Likewise, being clear about numeric inputs and outputs is critical to avoid internal calls to GENERIC-+
and similar functions. If your code needs to do generic dispatch for basic numerics, it will be slow.
Declaring fixnum
should be sufficient for integer operations, but there are other subdivisions if you need them.
I had perhaps been spoiled by my history with Haskell and Rust, where it's common to return multiple results as tuples and we expect this to be inconsequential with regards to Garbage Collection. This understanding does not (totally) transfer to Common Lisp, where a (cons a b)
call does heap allocate, which gets quite expensive when you're trying to do that to communicate all parsing successes and failures. Especially for long alt
and *>
chains, this can generate a lot of garbage.
I avoided this cost by wholly embracing values
and multiple-value-bind
for the first time. Within parcom
, all inspection of parse results now looks like:
(multiple-value-bind (res next) (funcall parser offset) (if (ok? res) ...))
This grants us both the ability to return multiple things at once, as well as desugar the "tuple" back out into its separate parts. Best of all, we get this essentially for free, performance-wise. To quote Bike, a Clasp developer:
values and multiple-value-bind are not usually going to cons. The implementation can stuff values into registers, the stack, or a thread-local preallocated vector without needing any buy-in from the caller.
Indeed, switching from cons cells to values
reduced my memory usage by 30%.
In Common Lisp, you can declare to the compiler when you want a local variable to be allocated on the stack and not the heap. Within parcom/json
, I have:
(declaim (ftype (function (p::char-string fixnum fixnum) p::char-string) escaped)) (defun escaped (s from to) "Escape a string." (declare (optimize (speed 3) (safety 0))) (let* ((len (- to from)) (work (make-array len :element-type 'character))) (declare (dynamic-extent work)) ...))
The key is (declare (dynamic-extent work))
. This causes the compiler to allocate the work
array on the stack, automatically freeing the memory when this function returns. Removing this line and rerunning benchmarks shows a 30% increase in memory usage (note: the function is a hot-path for JSON parsing), so having it is certainly effective!
I found the necessity for this to be unfortunate, but I enjoy the results. It turns out that Common Lisp (at least SBCL) must freshly allocate closures every
time they are called. Many parcom
parsers are "higher order", in that they take an initial argument that "primes" the parser and yields it as a lambda. For example, here is the original definition of the char
parser:
(defun char (c) "Parse a given character." (lambda (offset) (declare (optimize (speed 3) (safety 0))) (if (>= offset +input-length+) (fail offset) (let ((head (schar +input+ offset))) (if (equal c head) (ok (off 1 offset) head) (fail offset))))))
Here, c
is captured ("closed over") by the lambda, which means that CL has to do some bookkeeping every time char
is called somewhere else. This gets especially bad for compound structures like:
(p:between (*> (p:char #\{) #'skip-space) (p:sep (*> (p:char #\,) #'skip-space) (<*> #'string (*> #'skip-space (p:char #\:) #'skip-space (<* #'json #'skip-space)))) (*> #'skip-space (p:char #\})))
This parses a JSON object. All the *>
calls are actually macros that expand out into a lambda, which means that every call to between
and char
are actually doing a lot of allocation you never intended. We can avoid this with custom "lambda caches".
(defparameter +char-cache+ (make-hash-table :size 64 :test #'eql)) (defun char (c) "Parse a given character." (or (gethash c +char-cache+) (let ((f (lambda (offset) (declare (optimize (speed 3) (safety 0))) (if (>= offset +input-length+) (fail offset) (let ((head (schar +input+ offset))) (if (equal c head) (ok (off 1 offset) head) (fail offset))))))) (setf (gethash c +char-cache+) f) f)))
Now, every call to char
will start with a Hash Table lookup to see if we've already allocated a lambda for the given character. This does incur a lookup cost, but it was deemed worth it as the memory savings are significant. If we inspect our cache after some time, we see that it does indeed contain a unique lambda per associated character:
Contents: #\Z = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {1001959B8B}> [remove entry] #\ = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {1001885EAB}> [remove entry] #\T = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {10017C2D9B}> [remove entry] #\_ = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {100E2CFE8B}> [remove entry] #\Newline = #<FUNCTION (LAMBDA (OFFSET) :IN CHAR) {100E2CFE6B}> [remove entry] ...
I still feel like it should be the compiler's responsibility to handle this, but the workaround is reasonable.
You can use time
and sb-sprof:with-profiling
to, in an entirely interactive way, determine where your code is spending its time and where memory is being allocated. With these tools in hand I stopped needing flamegraphs, as the turnaround in producing them was slower than just reading the text charts. Thanks to these tools I was able to vastly improve the performance of my library, which demonstrates that Parser Combinators can be both an elegant and performant way to parse text in Common Lisp.
Blog Archive