How I fixed Atom

When good regexes go bad

Atom is the hot new up-and-comer in the world of text editing. It is my editor of choice for building software, and it’s open source, so I decided to check out its issues to see how I could contribute. I came across a strange bug: the Atom user speter had written a line of text that, when you pressed Enter at its end, caused Atom to calculate for half an hour before writing a new line. I was pretty stunned that such a simple and common operation could perform so atrociously, so I decided to jump in and figure out what was going on.


The Search

Here’s the text in question:

vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == reflect.Slice

It’s a line of code in Go, a new programming language created by Google. I’m not sure what it’s supposed to do, but that’s irrelevant here: all I care about is how pressing Enter at the end of that line could cause Atom to hang.

Having very little to go on, I began the search by searching the whole codebase for the word “newline”. After a few leads that went nowhere, I found the function responsible for writing new lines when Enter is pressed, a function called insertNewline in the file text-editor.coffee. Here’s how it looked:

insertNewline: -> 
  @insertText('\n')

Atom is written in CoffeeScript. CoffeeScript is like Javascript without most of the keywords, parentheses and brackets. It uses whitespace to determine scope, like Python. It also supports classes. These two lines define a method of the TextEditor class called insertNewline. insertNewline just calls the class method insertText on a newline character. To make sure I was in the right place, I made a quick tweak to the method:

insertNewline: ->
  console.log "Hello!"
  @insertText('\n')
  console.log "Goodbye"

With this change, every time I typed Enter I saw the lines Hello! and Goodbye show up in Atom’s console. But when I typed Enter at the end of the slow Go line, I saw Hello! but no Goodbye, since the program was too tied up in insertText to reach the console.log "Goodbye" line. This told me that something in the insertText method was causing the slowness.

So I checked out the insertText method of TextEditor. TextEditor maintains an object called a Selection that keeps track of what text is currently selected, and the insertText method of TextEditor just calls the insertText method of this Selection. (Atom actually supports multi-select, so there can be any number of these Selection objects, but in my case there was only one).

Selection.insertText is a pretty long method, but using my Hello/Goodbye logging I traced the issue to these lines at the end:

if options.autoIndentNewline and text is '\n'
  @editor.autoIndentBufferRow(newBufferRange.end.row, preserveLeadingWhitespace: true, skipBlankLines: false)

When writing a newline, this code calls back into TextEditor to determine the indentation for the new line. Atom calculates the new line’s indentation based on scope. If the new line is at the same scope as the previous line, it has the same indentation. If it enters a new scope, it’s one tab more indented, and if it exits a scope, it’s one level less indented.

TextEditor has an object called LanguageMode that keeps track of which programming language the file being edited is written in. LanguageMode knows what entering and exiting scope looks like in its language, so the autoIndentBufferRow method of TextEditor calls the autoIndentBufferRow method of LanguageMode.

The autoIndentBufferRow method of LanguageMode calls LanguageMode‘s extravagantly-named suggestedIndentForTokenizedLineAtBufferRow method to calculate the indentation level. My Hello/Goodbyes were telling me that this was where the slowness was. Here’s a streamlined version of suggestedIndentForTokenizedLineAtBufferRow:

suggestedIndentForTokenizedLineAtBufferRow: (bufferRow, line, tokenizedLine, options) ->
  iterator = tokenizedLine.getTokenIterator()
  iterator.next()
  scopeDescriptor = new ScopeDescriptor(scopes: iterator.getScopes())

  increaseIndentRegex = @increaseIndentRegexForScopeDescriptor(scopeDescriptor)
  decreaseNextIndentRegex = @decreaseNextIndentRegexForScopeDescriptor(scopeDescriptor)

  precedingRow = bufferRow - 1
  desiredIndentLevel = @editor.indentationForBufferRow(precedingRow)

  unless @editor.isBufferRowCommented(precedingRow)
    precedingLine = @buffer.lineForRow(precedingRow)
    desiredIndentLevel += 1 if increaseIndentRegex?.testSync(precedingLine)
    desiredIndentLevel -= 1 if decreaseNextIndentRegex?.testSync(precedingLine)

  Math.max(desiredIndentLevel, 0)

First it builds a ScopeDescriptor, an object encapsulating scope information about the current line. Based on this ScopeDescriptor, it gets the regular expressions increaseIndentRegex and decreaseNextIndentRegex. increaseIndentRegex matches lines that look like entering scope, and decreaseNextIndentRegex matches lines that look like exiting scope. So when the new line is entering a new scope, increaseIndentRegex?.testSync(precedingLine) is true, so desiredIndentLevel is incremented. If the new line is exiting a scope, decreaseNextIndentRegex?.testSync(precedingLine) is true, so desiredIndentLevel is decremented. One final round of Hello/Goodbye told me that decreaseNextIndentRegex?.testSync(precedingLine) was the ultimate cause of my performance woes.

Regular expressions

decreaseNextIndentRegex is a regular expression. A regular expression defines a pattern. Its test method takes a string argument and tells whether the string matches the pattern. Regular expression syntax is a little wonky, but they’re immensely powerful.

Here’s a small example to get us warmed up:

^[b]+

That matches any string that starts with the letter b, like “bananas”. Breaking it down, the initial ^ means “start of the string”. Without the initial ^, the regular expression would match any string that contains a b, like “abacus”. Then [b] just looks for the letter b. Finally, the + at the end means “at least one”. So this regular expression matches “bananas”, “bbbananas”, and any number of initial bs, but not “abacus”, because that doesn’t start with b.

The goal of decreaseNextIndentRegex, is to match a line if it has unbalanced parentheses, with more ) than (. Here’s an example:

someObject.someFunction(arg1,
  arg2,
  arg3) // <---- this line has more ) than (, so it matches decreaseNextIndentRegex

It’s pretty common style to have one argument per line like this, especially if instead of arg1 etc. you had really long argument names. The arg2 and arg3) lines are indented to make it clear they are part of the argument list. arg3) is unbalanced in favor of closed parentheses, so it matches decreaseNextIndentRegex. When you press Enter at the end of that line, Atom will indent the new line back at the level of someObject.someFunction(arg1,.

Ok, are you ready for decreaseNextIndentRegex? You are not. No one can ever be ready for decreaseNextIndentRegex. But take a deep breath, because here it is, as I found it:

^\s*[^\s()}]+(?<m>[^()]*\((?:\g<m>|[^()]*)\)[^()]*)*[^()]*\)[,]?$

Oof. I must admit, I almost called it quits when I saw that beast. But then I figured, I’d come this far, might as well see this thing through. Let’s see if we can break that down into manageable chunks:

^\s*

As we’ve seen, ^ means the start of a string. \s is the whitespace pattern. It matches tab characters and spaces. * means “At least 0”. So this matches any and all whitespace at the start of a string, such as the 2 spaces at the beginning of our arg3 line.

    [^\s()}]+

Here we have a [^, which means “not any of these characters”. So [^\s()}] looks for any character that isn’t whitespace or a ( or a ) or a }. The + again means “at least one”, so this matches arg3 in our example.

             (?<m>[^()]*\((?:\g<m>|[^()]*)\)[^()]*)*

All right, buckle up. What if instead of the simple arg3), our final argument was itself the result of evaluating a function or functions? For instance:

someObject.someFunction(arg1,
  arg2,
  f1(f2(xyz), abc).def) // <--- still has more ) than (, so still matches decreaseNextIndentRegex

The goal of this chunk of decreaseNextIndentRegex is to match function calls like f1(f2(xyz), abc), no matter how deeply they nest. Let’s break it into sub-chunks:

             (?<m>

This defines a named capture group, <m>. <m> is the name of this whole chunk of decreaseNextIndentRegex. <m> wants to match function calls to any depth.

                  [^()]*

[^()]* matches any number of non-parenthesis characters. In our example, this chunk matches f1.

                        \(

This matches a ( character such as the one following f1. The ( indicates a function call, so now we have to match strings that look like function arguments.

                          (?:\g<m>|[^()]*)

The arguments to a function can be simple lists of characters, like the xyz argument to f2, or they can be function calls themselves, like the f2(xyz) argument to f1. For simple lists of characters like xyz, another [^()]* will work. But how can <m> match a function call? And that function call might have other function calls as its arguments, so the expression we use needs to match function calls to any depth. But we have an expression that matches function calls to any depth — it’s called <m>!

That’s how (?:\g<m>|[^()]*) matches the arguments of a function. | is the alternation operator, which means “either the thing on the left of me or the thing on the right of me”. On the left side of | is \g<m>, a recursive reference to <m> for the case where the arguments to a function are function calls. On the right side of | is [^()]* for the case where the arguments are simple strings. I hope that makes some sense. I had to meditate on it at length before I understood at all.

                                          \)

This matches a ) character, closing the arguments list.

                                            [^()]*

This last bit of <m> covers the case where a function has multiple arguments, one a function call and the other a string, like our f1(?:\g<m>|[^()]*) matches the function call argument f2(xyz), and this extra [^()]* matches the string , abc.

That’s all there is to <m>. It’s all downhill from here!

                                                    [^()]*

After <m>, we have yet another [^()]*. This is to match any characters that show up between the function call(s) that <m> matched and the end of the line. In our example, this [^()]* matches .def.

                                                          \)[,]?$

The last chunk of decreaseNextIndentRegex starts with one more \). This is the extra ) that makes the line parenthetically unbalanced: remember that detecting lines with more ) than ( is the goal of decreaseNextIndentRegex. Then, [,]? matches zero or one comma characters, and $ matches the end of the string. So after the unbalancing ), the only character that is allowed in the string for decreaseNextIndentRegex to match it is an optional comma.

Phew! We made it all the way through decreaseNextIndentRegex. Great work!

Catastrophe strikes

But of course the whole reason we’re here is because there’s something wrong with decreaseNextIndentRegex: it’s ridiculously slow at matching vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == reflect.Slice. I hadn’t the faintest idea what might cause this, so I just Googled “regular expression performance”. I ended up at https://regex101.com/, a super-nifty site where you can input regular expressions and test them on any string, and the site gives a detailed breakdown of the matching process. When I input decreaseNextIndentRegex and that string, it said Catastrophic backtracking detected. That didn’t sound good.

regex101.com pointed me to this thorough explanation of the catastrophic backtracking phenomenon. I won’t cover the cause in depth here, but it’s a problem that afflicts regular expressions that are careless with their * operators. Basically, it happens if you use the * operator on a sequence of patterns that have * operators of their own, and the nested *ed expressions both match the same string. In this case, the cost of identifying a string that doesn’t match the overall regular expression but does match both *ed subexpressions becomes exponential in the length of the string.

For a simple example, the regular expression ([a]*[ab]*)*c will hit catastrophic backtracking if you give it a string like aaaaaaaaaaaaaaaaaa, since either the [a]* or the [ab]* can match each a.

Let’s see how catastrophic backtracking was affecting decreaseNextIndentRegex. To simplify the question, I removed chunks of decreaseNextIndentRegex until I found a minimal set that still triggered catastrophe. I was able to delete almost all of it and still see catastrophic backtracking. I finally whittled it down to:

^([^()]*\(\)[^()]*)*\)$

That’s just <m>, minus the (?:\g<m>|[^()]*) between \( and \)., plus the extra \) for unbalancing. With this minimal problematic expression, I could see the cause of the catastrophe. We use * on a pattern that contains two [^()]*s! Since [^()] and [^()] are exactly the same expression, they match the same strings, triggering catastrophic backtracking since they are *ed subexpressions of a *ed expression. In between the two is a (), so we only hit catastrophic backtracking with a string that has a lot of () in it, as our vVar.Type().Name()… string does.

The fix

To prevent catastrophic backtracking, one of the [^()]*s making up <m> had to go. But I couldn’t just take it out without changing the set of strings that decreaseNextIndentRegex matched, which wouldn’t be good. Remember why the second [^()]* is there: to cover the case where a function has multiple arguments, one a function call and the other a string. It is there to match this string following the function call argument. But in its current position, it is applied to the text following every function call. It only needs to check the text following a function call that is itself an argument to a function, since the [^()]* outside <m> handles the text following the top-level function call.

We detected these nested function calls with the recursive reference to <m>. So to make the second [^()]* only apply to nested function calls, I moved it to immediately after \g<m>. This moved [^()]* still tries to match the text it needs to match, but it doesn’t try to match text that is already covered by the other [^()]*. With the redundancy removed, catastrophic backtracking no longer occurs. The final decreaseNextIndentRegex looks like this:

^\s*[^\s()}]+(?<m>[^()]*\((?:\g<m>[^()]*|[^()]*)\))*[^()]*\)[,]?$

I put together a pull request with this change and a few extra tests and pushed it up. The project collaborator svanharmelen agreed with my assessment and fix, and my commit landed shortly thereafter. Just like that, I had fixed Atom!

This was surely the wildest adventure I’ve had so far in the open-source world. At the start, I had never seen CoffeeScript, and I only had a vague recollection of regular expressions left over from the two weeks we covered them in CS 164 at Berkeley all those years ago. By the end, I had read thousands of CoffeeScript lines, and I’d learned more about regular expressions than I even knew existed. Best of all, I got to fix a bizarre bug in my favorite text editor. What a trip!

41 thoughts on “How I fixed Atom”

  1. This is brilliant! I think debugging is one of the hardest skills to learn as a programmer, and unfortunately one of the least talked about. Thanks for writing this up and giving new programmers an insight into what debugging looks like in a real piece of code.

  2. Looks like a regex is simply not the right tool for this job. A very simple parser would be a lot better. Thats also the way other IDEs such as Eclipse implement it. So why regex?

    1. Great observation, Jan! It’s part of the architecture of Atom. There’s a central repository that contains most of the code, and there are scores of very small other repositories that contain code and information specific to each programming language. Part of the contract between the central repository and these language-specific repositories is that each language-specific repository supplies regular expressions like decreaseNextIndentRegex that the central repository uses to calculate indentations. So to replace the use of regexes with a parser, we’d need to change the central repository as well as the repositories for every language! A daunting task.

      1. Not sure I agree with that assessment. I admit I have no knowledge of Atom internals, but wouldn’t it be quite simple to add an option for a matching function instead of a matching regex which is called instead if provided by the language mode?

        desiredIndentLevel += 1 if increaseNextIndentCheck?(precedingLine)
        desiredIndentLevel -= 1 if decreaseNextIndentCheck?(precedingLine)

        Then the language modes could choose to provide those instead. This could both improve performance, make it easier for (non-regex-wizard) language mode developers and possibly avoid the issue you fixed reoccurring in the future.

        Thanks for the writeup, btw. It was an interesting read!

        1. Its open source… go for it!

          (Seriously, its a legit solution, and could probably be patched in as a way to start the process backwards compatibly) … do the work, pull request it, and see what happens…

    2. I think the whole approach has been inherited from Textmate, which is the text editor that Atom is largely inspired by. It allowed them to quickly support the syntax of a large number of programming languages when they launched by converting the equivalent Textmate bundles – see https://github.com/atom/language-ruby/blob/master/README.md for example. See https://manual.macromates.com/en/appendix#indentation_rules for Textmate’s explanation.

      1. Also, rsc (Russ Cox) is one of the core developers of Go itself *and* the author of Go’s reflect API. Talk about industrial-grade irony 🙂

  3. Whenever I’m debugging regular expressions, I miss /x from Perl, which lets you add insignificant whitespace and comments to regular expressions.

    This regular expression:

    ^\s*[^\s()}]+(?<m>[^()]*\((?:\g<m>|[^()]*)\)[^()]*)*[^()]*\)[,]?$

    Becomes the much more beautiful:

    /
    ^ # start of string
    \s* # whitespace
    [^\s()}]+ # at least one non-whitespace character or () or }

    (? <m> # match function calls
    [^()]* # function name
    \(
    (?: \g<m> | [^()]*) # arguments
    \)
    [^()]*
    )*
    [^()]* # match any characters that show up between the function call(s) that matched and the end of the line
    \) # unbalancing closing paren
    [,]?
    $
    /x;

    (Hopefully WordPress will format the code correctly!)

  4. Rather than take a regular expression, it would be good if it took any object supporting a .test(string) method that returned true or false, where true means indent, false means don’t indent. A regex is one such object, but for some languages it would be a lot easier to use a parser of some kind.

    You could possibly extend it so that if it returned an integer, that would be the number of steps to indent.

    1. Indeed. ^\s*[^\s()}]+(?[^()]*\((?:\g[^()]*|[^()]*)\))*[^()]*\)[,]?$ just doesn’t roll off the tongue quite so well.

      Regardless of my aesthetic opinions on the regular expressions involved, thank you for the excellent and informative post, David.

  5. Thank you for your contribution, and well-done for tracking down this tricky bug.
    It would be of inestimable value for you to embed the insight you gained into the code itself, for the benefit of the person who has to tweak or further debug this horrible regex.
    Regexes are composable from string snippets, so you can do:
    test_starts_with_whitespace = “^\\s+” (note extra escaping)
    test_next_char = “(a|b)”
    re = new RegExp(test_starts_with_whitespace + test_next_char)
    re.test(some_string) # -> true or false

  6. Fantastic article. Are you on Twitter? It would be another way to let people know when you have posted a new article.

  7. I threw up in my mouth a little bit when I first saw the regex.

    Your explanation was like Pepto Bismol. I’m fine now, thank you for asking.

    Your fix is aces!

  8. Let’s all just remember that, whenever you solve a problem with regex, you get two problems.

    I know that wasn’t the purpose of David’s post (not even the purpose of the fix), but removing the regex would be the best way to go. Any other problem with said regex and any other contributor would take the same 25 hours David took to get to the bottom of the code. Which, it seems, just count parenthesis?

  9. What happens when you have a paren inside a string with this regex? It seems like it will get counted for the indent level.

Leave a Reply

Your email address will not be published. Required fields are marked *