正则表达式 – Perl 6语法与我认为不应该匹配

我正在做 Advent of Code day 9

You sit for a while and record part of the stream (your puzzle input). The characters represent groups – sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group – that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.

Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.

In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.

当然,这对于Perl 6 Grammar来说是尖叫……

grammar Stream
{
    rule TOP { ^ <group> $}

    rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
    rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }

    token garbignore { '!' . }
    token garbchar { <-[ !> ]> }
}

这似乎在简单的例子上工作正常,但连续两个garbchars出错了:

say Stream.parse('{<aa>}');

给了Nil.

语法:: Tracer没有帮助:

TOP
|  group
|  |  group
|  |  * FAIL
|  |  garbage
|  |  |  garbchar
|  |  |  * MATCH "a"
|  |  * FAIL
|  * FAIL
* FAIL
Nil

多个garbignores没问题:

say Stream.parse('{<!!a!a>}');

得到:

「{<!!a!a>}」
 group => 「{<!!a!a>}」
  garbage => 「<!!a!a>」
   garbignore => 「!!」
   garbchar => 「a」
   garbignore => 「!a」

有任何想法吗?

UPD鉴于代码问题的出现没有提到空格,你根本不应该使用规则构造.只需将所有规则切换到令牌即可设置.一般来说,遵循布拉德的建议 – 使用令牌,除非你知道你需要一个规则(下面讨论)或一个正则表达式(如果你需要回溯).

我在下面的原始答案探讨了为什么规则不起作用.我现在就把它留下来.

TL; DR< garbchar> |包含一个空间.直接跟随规则中任何atom的空格表示标记化中断.您可以简单地删除这个不适当的空间,即写< garbchar> |相反(或者更好的是,< .garbchar> |如果你不需要捕获垃圾)来获得你寻求的结果.

正如您的原始问题所允许的那样,这不是一个错误,只是您的心理模型已关闭.

您的答案正确识别问题:tokenization.

所以我们留下的是你的后续问题,这是关于你的标记化的心理模型,或者至少是默认情况下Perl 6如何标记:

why … my second example … goes wrong with two garbchars in a row:

'{<aa>}'

简化,问题是如何标记这个:

aa

简单的高级答案是,在解析白话时,aa通常被视为一个标记,而不是两个标记,默认情况下,Perl 6假设这个普通的定义.这是你遇到的问题.

您可以否决这个普通的定义,以获得您想要达到的任何标记化结果.但是很少有必要这样做,当然不是像这样的简单情况.

我将提供两条冗余路径,我希望这些路径能够引导民间人士找到正确的心理模型:

>对于那些喜欢直接潜入细节的人,有a reddit comment I wrote recently about tokenization in Perl 6.
>本SO答案的其余部分提供了高级别的讨论,补充了我的reddit评论中的低级别解释.

摘自the “Obstacles” section of the wikipedia page on tokenization,并将摘录与P6特定讨论交错:

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a “word”. Often a tokenizer relies on simple heuristics, for example:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.

在Perl 6中,您可以使用与标记化正交的捕获功能来控制在解析树中包含或不包含的内容.

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

默认情况下,Perl 6设计体现了这两种启发式的等价物.

要获得的关键是它是规则构造,处理一串令牌,复数.令牌构造用于为每个调用定义单个令牌.

我想我会在这里结束我的答案,因为它已经很长了.请使用评论来帮助我们改进这个答案.我希望到目前为止我所写的内容有所帮助.

相关文章
相关标签/搜索