I Slipped on JavaScript’s Banana Peel

Posted by Mike Pennisi

Nov 20 2017

Think of the last time you goofed up on the job. Maybe you forgot to clean out the microwave in the break room. Maybe you hit “Reply All” when you really meant “Reply.” Or maybe you nodded off during an all-hands meeting.

Probably your mistake was a little less banal than any of that, but I’ll bet the result was similar: your face got red, you apologized, and within a day or two, it was all business as usual.

If that’s accurate, then, I envy you. My latest antics violated a fundamental principle of today’s most widely-used programming language. Fortunately, smarter folks than me are in charge, and the slip-up was quickly corrected. But it took much more than a few days for my complexion to return to normal.

In this post, I’ll explain what I was thinking, why I was wrong, and how “LR(1)” (a special trait of some programming languages) can be so subtle but also so important.

The “problem” (as I saw it)

Here at Bocoup, we’re routinely contributing to the development of the JavaScript programming language. Sometimes, we’re designing new features. Other times, we’re improving the standards process. Most of all, though, we’re writing tests. It was in this latter capacity that I stumbled over what seemed like an overly complicated and confusing detail in the language specification. Specifically, this detail concerned the grammar for ES2015 modules.

The syntax for export declarations in ES2015 modules is described (in part) by the ExportSpecifier:

ExportSpecifier:
  IdentifierName
  IdentifierName `as` IdentifierName

When you write export Link from './hyrule.js'; or export Zelda as Shiek;, you’re relying on ExportSpecifier.” The tricky part is that while IdentifierName includes your typical variable names like foo and bar, it’s also satisfied by reserved words like new and var. Many JavaScript developers have an intuitive understanding of this from its use to define property names in object literals:

var myObject = {
  foo: 1,
  bar: 2,
  // A little odd, but valid since ES5:
  new: 3,
  var: 4
};

Its use in ExportSpecifier makes the following code fair game:

export { var } from './strange.js';

That declaration doesn’t actually create any bindings–it just re-exports a binding defined by strange.js–so maybe that seems okay. But it begs the question: how did strange.js define that binding in the first place? Well, “local” bindings can be re-named as they are exported:

var x;
export { x as var };

So also, strange, but no problem. What threw me for a loop was that the ExportSpecifier was shared by both “indirect” exports and “local” exports. Both use ExportClause, which uses ExportsList, which uses ExportSpecifier with IdentifierName.

ExportDeclaration:
  `export` ExportClause FromClause `;`
  `export` ExportClause `;`

(Some additional parsing goals omitted for clarity’s sake.)

From there, we can trace our way through the “productions” in the grammar until we eventually come to the IdentifierName in ExportSpecifier:

ExportClause:
  `{` `}`
  `{` ExportsList `}`
  `{` ExportsList `,` `}`

ExportsList:
  ExportSpecifier
  ExportsList `,` ExportSpecifier

ExportSpecifier:
  IdentifierName
  IdentifierName `as` IdentifierName

This seemed to allow exporting impossible local bindings, such as:

// (define `var`, somehow)
export { var };

But you can’t write var var = 3;, so what should that export declaration do? Produce a ReferenceError? The error message “var is not defined.” would probably confuse even the most seasoned JavaScript developer. Or maybe it should just check the global object. After all, while you may not be able to write var var = 3;, you can write window["var"] = 3; (please don’t). But the ES2015 module system resolves all imports and exports before executing any code, so it can’t reference properties created at run time. (Although this behavior has caused headaches for implementers in some contexts, it also enables a lot of advanced static analysis and transformation such as “tree shaking.”)

It turns out that the spec defines an “early error” for exactly this case. Early errors are a way the specification disallows code that would otherwise be allowed by the grammar. For instance, it’s only thanks to an early error that using the with statement in strict mode code causes a parsing failure. When parsing “local” exports, the following early error comes in to play:

For each IdentifierName n in ReferencedBindings of ExportClause: It is a Syntax Error if StringValue of n is a ReservedWord or if the StringValue of n is one of: “implements”, “interface”, “let”, “package”, “private”, “protected”, “public”, or “static”.

NOTE The above rule means that each ReferencedBindings of ExportClause is treated as an IdentifierReference.

That means that export var; is a SyntaxError and everything is technically correct. So why was I all bent out of shape?

Imagine you’re on the phone with animal control to report a giraffe that escaped from the zoo. You could tell them, “there is a giraffe in my back yard.” That would probably be the quickest way to convey the necessary information. Or you could say, “there is a creature in my back yard,” wait for them to ask for more information, and then proceed to describe the giraffe in great detail–taking care not to use the word “giraffe.”

Whether you’re describing a Giraffa camelopardalis in terms of a “creature” or an IdentifierReference in terms of an IdentifierName, “technically correct” is not the same as “intuitive.” I felt like if a parsing rule takes half a blog post to explain, well, maybe that rule could be phrased in a better way.

My “solution”

I proposed an additional “production” named ExportSpecifier_local to compliment ExportSpecifier. Here’s how they looked side-by-side:

ExportSpecifier:
  IdentifierName
  IdentifierName `as` IdentifierName

ExportSpecifier_local:
  IdentifierReference
  IdentifierReference `as` IdentifierName

This would be used by another new production, ExportsList_local, which would be used by a third new production ExportClause_local. All of this was the necessary groundwork to make the definition of ExportDeclaration more intuitive:

ExportDeclaration:
  `export` ExportClause FromClause `;`
  `export` ExportClause_local `;`

With that in place, we could remove that early error because the grammar itself would disallow export { var };. I labeled the patch “editorial” because it wasn’t intended to change any behavior of the language, just improve readability of the specification. My hope was that this new version would make the whole IdentifierName/IdentifierReference distinction easier to discover and understand. My reviewers tended to agree: after some discussion about the grammar conventions in use, the patch was merged.

Little did I know that this seemingly-harmless change actually violated a core feature of the language.

The flaw

Months later, while reviewing that same section of the specification, I noticed that my change was missing. I opened up the old pull request and found some recent activity: a new issue titled, “Are the changes from #637 LR(1) compatible?” In a discussion that was frankly way over my head, the participants concluded that no, my changes were not “LR(1) compatible,” and they therefore had to be reverted as a matter of course.

If you’ve contributed to a few open source projects, you might be familiar with the special kind of shame that results from a reverted patch. My embarrassment in this case was “extra special” because I didn’t even understand the rationale. So I started researching.

The issue reporter verified this interpretation by building a small parser. They referred to it as a “toy grammar,” which sure sounded like fun, so I followed suit with my own version and found the same thing. The parser generator GNU Bison reported “3 reduce/reduce conflicts” when attempting to produce a parser from my change to the grammar. To understand why, we’ll need to dig a bit deeper.

LR(1) is the term for a particular kind of parser that accepts deterministic, “context-free” languages in linear time. It considers input “tokens” one after the next and usually knows what to expect following each one. For example, given the following code:

var x, y = 0;

Here’s what the parsing process might look like:

var: This is a variable declaration. Now expecting a list of bindings
x: This is a binding identifier. Now expecting either a comma, an “equals” sign, or a semicolon
,: This marks the end of the binding declaration. Now expecting another binding
y: This is another binding identifier. Expecting a comma, an “equals” sign, or a semicolon
=: This is an initializer. Now expecting a value
0: This is an expression. Now expecting a comma, an “equals” sign, or a semicolon
;: This is the end of the variable declaration. Now expecting a new statement

The next step is only “usually” known because there may be more than one way to interpret some specific input. One case of this ambiguity comes up is ES2015 arrow functions; consider the following statement:

((x) => {});

The parsing strategy we used above couldn’t cope with this:

(: This is a parenthesized expression. Now expecting an expression
(: This is an arrow function. Now expecting a list of bindings
x: This is a parameter name. Now expecting either a comma, an “equals” sign (for default parameters), or a closing parenthesis
): This is the end of the parameter list. Now expecting an “arrow”
=>: Now expecting a block or an expression
{: I’m confused–is this the beginning of a function body or the beginning of an object literal? I no longer feel so confident about the world or my place within it

When the parser reaches the opening brace character, it cannot know how to proceed–should it be interpreting the rest of the input as a series of statements or as properties of an object literal? To avoid this confusion, the specification grammar has an extra restriction: it only accepts expressions if they do not begin with that { character. That means the fifth step reads more like: “Not sure what to expect; waiting for the next token… It’s an opening brace, so I’m now expecting a function body.”

This need to “look ahead” by a single piece of input is common when parsing many programming languages–not just JavaScript. The “1” in the name “LR(1)” describes that ability.

The bad news is: a parser written to accept my change would need to “look ahead” by more than just one token. More specifically, it would need to look ahead by a variable number of tokens. To see what I mean, check out this exaggerated nonsense code:

export { a, b, c, d, e, var, f, g, h, i, j } from './elsewhere.js';

We saw something like this earlier. The parser ought to accept this because the binding named var is allowed in “indirect” exports. Unfortunately, even with the newfound ability to look ahead to the next piece of input, we’re hosed:

export: This is an export declaration. Now expecting an opening brace.
{: Not sure what to expect. If this is a “local” export, then I should expect an IdentifierName. If this is a “indirect” export, then I should expect an IdentifierReference. Waiting for the next token… It’s a. Great, I still don’t know what to do.

Hopefully at this point, my folly is more obvious:

ExportDeclaration:
  `export` ExportClause FromClause `;`
  `export` ExportClause_local `;`

With this grammar, the LR(1) parser can’t choose between ExportClause and ExportClause_local without looking ahead through the entire list of exported bindings. That’s not LR(1) or even LR(2)–that’s “LR(as many bindings as I darn well please)”. (For kicks, I experimented with how a more Python-like syntax would actually support this distinction. There are no conflicts when from comes first.) We’ve answered the question we initially asked, but it begs a more important question…

Why does this even matter?

It would be easy to say, “the patch broke JavaScript because it is impossible to write a parser that implements the change.” But this would be oversimplifying. Remember that my change was “editorial”–it only modified how the grammar was described. For all its faults, it still described the same programming language. Even with my change reverted, this code is valid:

export { new } from './elsewhere.js';

And this code is invalid:

export { new };

It’s more accurate to say, “it is impossible to write an LR(1) parser that implements the change.”

However, the most prevalent JavaScript parsers are not LR(1) parsers. They use completely different strategies to interpret source code, and they are certainly capable of “looking ahead” by a variable number of tokens. The real question is: why do the language authors bother preserving a trait that is technically unnecessary?

It comes down to an issue of verifiability. As long as the grammar is LR(1) compatible, we can use tools like GNU Bison to automatically verify that no ambiguities exist. Without that, it would be all too easy to introduce new language features that are ambiguous.

That said, JavaScript will always need additional restrictions that are not LR(1) compatible. We specify those extra rules as “early errors” because that gives us a limited set of “special cases” that have to be verified manually. We can deterministically prove that any given source code is valid JavaScript thanks to two observations: (1) the grammar is unambiguous, and (2) each of the early errors is unambiguous. LR(1) buys us the first part, so the difficult task of case-by-case verification is limited to early errors.

So while developers working to support Firefox and Chrome may complain when new browser APIs behave differently (as in new CuttingEdgeWebFeature()), they don’t have to worry about consistency in languages features (as in ((x) => {});). LR(1) compatibility helps us say this with mathematical certainty.

In theory, anyway. Scan the language specification for the term “LR(1)” and you’ll come away empty-handed. In other words: the requirement is undocumented. This went a long way in healing my bruised ego because it made my mistake less like breaking a window and more like slipping on a banana peel.

My initial inclination was to add some documentation to the spec to help others avoid making the same mistake (“Caution: Banana Peel Here”). I’ve since found reason to hold off. It turns out that there isn’t consensus about this restriction even within TC39–the standards body that maintains the language. Some members are concerned that LR(1) might unnecessarily restrict the possibilities for new syntaxes in the future. They wonder if there might be other ways to validate the determinism of the grammar (like picking up the banana peel and laying down some non-slippery food refuse… maybe a corn husk or something). So instead, we’re requesting that the committee discusses this at their next meeting later this month.

The real lesson

At Bocoup, we spend a lot of time contributing to web standards, but we also continue to consult on application development. From an application developer’s perspective, all of this might seem somewhat academic. Knowing the motivations for an early error isn’t going to help you configure a Webpack build, after all.

Then again, that same Webpack build likely relies on Babel, and an adventurous configuration may even enable support for new language features. Although experimentation requires caution, it’s also an important part of engaging with the web platform. Your next SyntaxError may be the result of a flaw in your customized programming language, so it’s good to be aware of what to watch out for.

More importantly: you should know that the industry is full of smart, welcoming folks who are eager to help you contribute. I learned a lot as a result of my mistake, and it’s all thanks to André Bargull, Michael Dyck, Shu-yu Guo, Dave Herman, Waldemar Horwat, Caitlin Potter and Brian Terlson. It’s comforting to know the web is resilient to mistakes, but it’s inspiring to collaborate with the dedicated and outgoing professionals who make it so.

Posted by
Mike Pennisi
on November 20th, 2017

Contact Us

We'd love to hear from you. Get in touch!

Email

hello@bocoup.com