249 lines
8.3 KiB
Markdown
249 lines
8.3 KiB
Markdown
# lop -- parsing library for JavaScript
|
|
|
|
lop is a library to create parsers using parser combinators with helpful errors.
|
|
|
|
```javascript
|
|
|
|
function parse(tokens) {
|
|
var parser = lop.Parser();
|
|
return parser.parseTokens(expressionRule, tokens);
|
|
}
|
|
|
|
// This rule is wrapped inside lop.rule to defer evaluation until
|
|
// the rule is used -- otherwise, it would reference integerRule
|
|
// and ifRule, which don't exist yet.
|
|
var expressionRule = lop.rule(function() {
|
|
return rules.firstOf("expression",
|
|
integerRule,
|
|
ifRule
|
|
);
|
|
});
|
|
|
|
var integerRule = rules.then(
|
|
rules.tokenOfType("integer"),
|
|
function(value) {
|
|
return new IntegerNode(parseInt(value, 10));
|
|
}
|
|
);
|
|
|
|
var ifRule = rules.sequence(
|
|
rules.token("keyword", "if"),
|
|
rules.sequence.cut(),
|
|
rules.sequence.capture(expressionRule),
|
|
rules.token("keyword", "then"),
|
|
rules.sequence.capture(expressionRule),
|
|
rules.token("keyword", "else"),
|
|
rules.sequence.capture(expressionRule)
|
|
).map(function(condition, trueBranch, falseBranch) {
|
|
return new IfNode(condition, trueBranch, falseBranch);
|
|
});
|
|
```
|
|
|
|
lop tries to provide helpful errors where possible. For instance, in `ifRule`
|
|
as defined above, there is a cut following the keyword `if`. Before the cut,
|
|
if we fail to match the input, we can backtrack -- in this case, we backtrack
|
|
and see if another form of expression might match the input. However, after the
|
|
cut, we prevent backtracking. Once we've see the keyword `if`, there's no doubt
|
|
about which sort of expression this is, so if parsing fails later in this rule,
|
|
there's no point in backtracking. This allows informative error messages to be
|
|
generated: if we try to parse the string `"if 1 42 else 12"`, we get the error:
|
|
|
|
Error: File: /tmp/lop-example
|
|
Line number: 1
|
|
Character number: 6:
|
|
Expected keyword "then"
|
|
but got integer "42"
|
|
|
|
## Tokens
|
|
|
|
When using a parser built with lop, the input is an array of tokens. A token can be any value so long as it has the property `source`, which must be a `StringSourceRange`.
|
|
|
|
### Regex tokeniser
|
|
|
|
The easiest way to create a tokeniser is using lop's regex tokeniser.
|
|
A regex tokeniser can be constructed by calling `new lop.RegexTokeniser(rules)`,
|
|
where `rules` is a list of token rules.
|
|
A token rule should have a `name` property that uniquely identifies that rule,
|
|
and a `regex` property that is an instance of `RegExp` describing the token.
|
|
|
|
Calling `tokenise` with a string will return a list of tokens.
|
|
Each token has three properties:
|
|
|
|
* `type`
|
|
* `value`
|
|
* `source`
|
|
|
|
The tokeniser will apply the regex from each rule in order at the current position.
|
|
The current position is initially zero, the start of the string.
|
|
The first rule with a matching regex is used to produce a token,
|
|
with the token's `value` being the first capture of the regex,
|
|
or `undefined` if the regex does not define any capture groups.
|
|
The current position is incremented to the index of the first character unmatched by the regex.
|
|
If no rule matches at the current position,
|
|
a single character `unrecognisedCharacter` token is produced,
|
|
and the current position is incremented by one.
|
|
|
|
For instance, to create a simple tokeniser that generates a stream of words tokens separated by whitespace tokens.
|
|
|
|
```javascript
|
|
var lop = require("lop");
|
|
|
|
var rules = [
|
|
{
|
|
name: "identifier",
|
|
regex: /(\s+)/
|
|
},
|
|
{
|
|
name: "whitespace",
|
|
regex: /(\S+)/
|
|
}
|
|
];
|
|
var tokeniser = new lop.RegexTokeniser(rules);
|
|
tokeniser.tokenise(input);
|
|
```
|
|
|
|
### Custom tokenisers
|
|
|
|
You can also create your own tokeniser.
|
|
For instance, to create a simple tokeniser that generates a stream of words tokens separated by whitespace tokens:
|
|
|
|
```javascript
|
|
var StringSource = require("lop").StringSource;
|
|
|
|
function tokeniseString(string) {
|
|
return tokenise(new StringSource(string, "raw string"));
|
|
}
|
|
|
|
function tokenise(source) {
|
|
var string = source.asString();
|
|
var whitespaceRegex = /(\s+)/g;
|
|
var result;
|
|
var start = 0;
|
|
var parts = [];
|
|
|
|
while ((result = whitespaceRegex.exec(source)) !== null) {
|
|
parts.push({
|
|
type: "word",
|
|
value: string.substring(start, result.index),
|
|
source: source.range(start, result.index)
|
|
});
|
|
parts.push({
|
|
type: "whitespace",
|
|
value: result[1],
|
|
source: source.range(result.index, whitespaceRegex.lastIndex)
|
|
});
|
|
start = whitespaceRegex.lastIndex;
|
|
}
|
|
parts.push({
|
|
type: "word",
|
|
value: string.substring(start),
|
|
source: source.range(start, string.length)
|
|
});
|
|
parts.push({
|
|
type: "end",
|
|
source: source.range(string.length, string.length)
|
|
});
|
|
return parts.filter(function(part) {
|
|
return part.type !== "word" || part.value !== "";
|
|
});
|
|
}
|
|
```
|
|
|
|
lop also defines its own notion of a token. Each instance of `lop.Token` has a type, name, and source, similarly to most of the tokens that would be created by the token above. For instance, instead of:
|
|
|
|
{
|
|
type: "word",
|
|
value: value,
|
|
source: source
|
|
}
|
|
|
|
you could use:
|
|
|
|
new Token("word", value, source)
|
|
|
|
The main advantage of using `lop.Token` is that you can then use the rules `lop.rules.token` and `lop.rules.tokenOfType` (described later). If you don't use `lop.Token`, you must define your own atomic rules, but you can use the other rules without any modifications.
|
|
|
|
## Parser
|
|
|
|
To parse an array of tokens, you can call the method `parseTokens` on `lop.Parser`, passing in the parsing rule and the array of tokens. For instance, assuming we already have a `tokenise` function (the one above would do fine):
|
|
|
|
```javascript
|
|
function parseSentence(source) {
|
|
var tokens = tokenise(source);
|
|
var parser = new lop.Parser();
|
|
var parseResult = parser.parseTokens(sentenceRule, tokens);
|
|
if (!parseResult.isSuccess()) {
|
|
throw new Error("Failed to parse: " + describeFailure(parseResult));
|
|
}
|
|
return parseResult.value();
|
|
}
|
|
|
|
function describeFailure(parseResult) {
|
|
return parseResult.errors().map(describeError).join("\n");
|
|
|
|
function describeError(error) {
|
|
return error.describe();
|
|
}
|
|
}
|
|
```
|
|
|
|
The result of parsing can be success, failure, or error. While failure indicates
|
|
that the rule didn't match the input tokens, error indicates that the input
|
|
was invalid in some way. In general, rules will backtrack when they
|
|
encounter a failure, but will completely abort when they encounter an error.
|
|
Each of these results has a number of methods:
|
|
|
|
```javascript
|
|
result.isSuccess() // true for success, false otherwise
|
|
result.isFailure() // true for failure, false otherwise
|
|
result.isError() // true for error, false otherwise
|
|
result.value() // if success, the value that was parsed
|
|
result.remaining() // if success, the tokens that weren't consumed by parsing
|
|
result.source() // the StringSourceRange containing the consumed tokens
|
|
result.errors() // if failure or error, an array of descriptions of the failure/error
|
|
```
|
|
|
|
The final question is then: how do we define rules for the parser, such as the currently undefined `sentenceRule`?
|
|
|
|
## Rules
|
|
|
|
Each rule in lop accepts an iterator over tokens, and returns a result, as
|
|
described in the previous section.
|
|
|
|
### lop.rules.token(*tokenType*, *value*)
|
|
|
|
Success if the next token has type `tokenType` and value `value`, failure
|
|
otherwise. Value on success is the value of the token.
|
|
|
|
### lop.rules.tokenOfType(*tokenType*)
|
|
|
|
Success if the next token has type `tokenType`, failure otherwise. Value on
|
|
success is the value of the token.
|
|
|
|
### lop.rules.firstOf(*name*, *subRules*)
|
|
|
|
Tries each rule in `subRules` on the input tokens in turn. We return the result
|
|
from the first sub-rule that returns success or error. In other words, return the
|
|
result from the first sub-rule that doesn't return failure. If all sub-rules return
|
|
failure, this rule returns failure.
|
|
|
|
### lop.rules.then(*subRule*, *func*)
|
|
|
|
Try `subRule` on the input tokens, and if successful, map over the result. For
|
|
instance:
|
|
|
|
```javascript
|
|
lop.rules.then(
|
|
lop.rules.tokenOfType("integer"),
|
|
function(tokenValue) {
|
|
return parseInt(tokenValue, 10);
|
|
}
|
|
)
|
|
```
|
|
|
|
### lop.rules.optional(*subRule*)
|
|
|
|
Try `subRule` on the input tokens. If the sub-rule is successful with the value
|
|
`value`, then return success with the value `options.some(value)`. If the sub-rule fails, return
|
|
success with the value `options.none`. If the sub-rules errors, return that error.
|