词法结构

Black2024年12月22日大约 11 分钟

词法结构

M文档是Unicode字符的有序序列。文档要么由一个表达式组成，要么由组织成节的多组定义构成。从概念上讲，以下步骤用于从文档中读取表达式：

根据其编码方案将文档解码为一个Unicode字符序列。
执行词法分析，将Unicode字符转换为令牌流。本章节余下部分将介绍词法分析。
执行句法分析，将令牌流转换为可计算的形式，后续的章节将介绍该过程。

文法约定

词法和句法用文法产生式表示。每个文法产生式定义一个非终结符，以及该非终结符可能拓展成一系列非终结或终结符。在文法产生式中，非终结+符号以斜体显示，终结符以固定宽度字体显示。

文法产生式的第一行为定义的非终结符的名称，后跟冒号。每一个后续缩进行都包含一个非终结的可能拓展，该非终结符由一系列非终结或终结符组成。例如：

if-expression:
if if-condition then true-expression else false-expression

定义一个if-expression，由令牌if后跟if-condition，令牌then后跟true-expression以及令牌else后跟false-expression组成。

当非终结符存在多个可能的拓展时，替代项在单独的行中列出。例如：

variable-list:
variable
variable-list , variable

定义一个variable-list由一个varible或variable-list后跟一个,和variable组成。换句话说，这个定义是递归的，它指定变量列表由一个或多个（用逗号分隔）变量组成。

后缀下标“_opt”用于指示可选符号。例如：

field-specification:
optional_opt field-name= field-type

上面是下面的简写形式：

field-specification:
field-name = field-type
optional field-name = field-type

上面两种方式都是在定义field-specification，它由可选的终结符optional开头，后跟field-name、终结符=和field-type。

替代项通常在单独的行中列出，但在有许多替代项的情况下，“one of”一词可能出现在以单行形式的一系列拓展之前。这只是对在单独的行中列出每个替代项的简化。例如：

decimal-digit: one of
0 1 2 3 4 5 6 7 8 9

它是下面的简写形式：

decimal-digit:
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9

词法分析

lexical-unit定义了M文档的语法，每个有效的M文档都遵循其语法。

lexical-unit:
    lexical-elements_opt
lexical-elements:
    lexical-element
    lexical-element
    lexical-elements
lexical-element:
    whitespace
    token comment

M文档由空白（whitespace）、令牌（token）、注释（comment）组成，在语法中，仅令牌有意义。

whitespace:
    Any character with Unicode class Zs
    Horizontal tab character (U+0009)
    Vertical tab character (U+000B)
    Form feed character (U+000C)
    Carriage return character (U+000D) followed by line feed character (U+000A)
    new-line-character
new-line-character:
    Carriage return character (U+000D)
    Line feed character (U+000A)
    Next line character (U+0085)
    Line separator character (U+2028)
    Paragraph separator character (U+2029)

为了与在文件结尾添加文件结束（end-of-file）标记的代码编辑工具兼容，会按照顺序对M文档进行如下处理：

如果文档最后一个字符是Control-Z字符（U+001A），则删除该字符。
如果文档非空，且最后一个字符不是回车符（U+000D）、换行符（U+000A）、行分隔符（U+2028）或段落分隔符（U+2029），则在文档末尾添加回车符（U+000D）。

注释

注释有两种：单行注释和分隔注释（或称作多行注释）。

单行注释：遇到//开始，直到行结束（任一个换行符之前）都是注释。
分隔注释：遇到/*开始，直到遇到*/（忽略其他任何字符）都是注释，可以跨行。

// 这是第一个注释

// 这是第二个注释

/*
/* 	这是第三个注释
* 	这是第三个注释
/	这是第三个注释
* /	这是第三个注释
*/

/* 这是第四个注释	*/

comment:
    single-line-comment
    delimited-comment
single-line-comment:
    //single-line-comment-characters_opt
single-line-comment-characters:
    single-line-comment-character single-line-comment-characters_opt
single-line-comment-character:
    Any Unicode character except a new-line-character
delimited-comment:
    /*delimited-comment-text_opt asterisks/_
delimited-comment-text:
    delimited-comment-section delimited-comment-text_opt
delimited-comment-section:
    /
    asterisks_opt not-slash-or-asterisk
asterisks:
    _*asterisks_opt
not-slash-or-asterisk:
    Any Unicode character except*or/

令牌

令牌（token）由标识符、关键字、字面值（literal）或操作符或标点符号（operator-or-punctuator）组成。

空白和注释用于分隔令牌，但不会视为令牌，即没有语法含义。

token:
    identifier
    keyword
    literal
    operator-or-punctuator

提示

literal表示由若干字母或符号组成的文本，它代表自身。

通常译作字面、字面值、字面量、字面常量等。

字符转义序列

M文档中可以输入任意Unicode字符，对于某些特殊字符无法直接输入，只能使用转义序列输入。

字符转义序列在字符串中（在成对双引号中）以#(为起始，)为结束。

转义序列（指#(和)之间的内容）可以是下列其中的一个或多个，如果是多个则用英文逗号分隔：

#
tab、lf、cr其中之一
4位十六进制Unicode转义序列
8位十六进制Unicode转义序列

对于十六进制的Unicode转义序列，不足4或8位的需要在前面补0。

// 等效于 "#(cr)#(lf)"
"#(cr,lf)"

// 错误，不允许有其他符号
"#(cr, lf)"

// "我"
"#(6211)"

// "+🤩+"
"+#(0001F929)+"

// 等效于"#" & "("
"#(#)("

character-escape-sequence:
    #(escape-sequence-list)_
escape-sequence-list:
    single-escape-sequence
    single-escape-sequence, escape-sequence-list
single-escape-sequence:
    long-unicode-escape-sequence
    short-unicode-escape-sequence
    control-character-escape-sequence
    escape-escape
long-unicode-escape-sequence:
    hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit
short-unicode-escape-sequence:
    hex-digit hex-digit hex-digit hex-digit
control-character-escape-sequence:
    control-character
control-character:
    cr
    lf
    tab
    escape-escape:
    _#

字面值

字面值是用源代码直接表示值。

literal:
    logical-literal
    number-literal
    text-literal
    null-literal
    verbatim-literal

null字面值

null表示缺失值。

某些情况下，null还可以表示null类型，需要结合语句考虑。

null-literal:
null

逻辑字面值

logical-literal:
true
false

数字字面值

123.456

.123456e3

123456E-3

// 123456
0x1E240

-.2

// 错误，小数点后必须有数字
2.

// 错误，小数点后必须有数字
2.e3

// 错误，指数部分必须是整数
2e3.5

number-literal:
    decimal-number-literal
    hexadecimal-number-literal
decimal-number-literal:
    decimal-digits.decimal-digits exponent-part_opt
    .decimal-digits exponent-part_opt
    decimal-digits exponent-part_opt
decimal-digits:
    decimal-digit decimal-digits_opt
decimal-digit: one of
    0 1 2 3 4 5 6 7 8 9
exponent-part:
    esign_opt decimal-digits
    Esign_opt decimal-digits
sign: one of
    + -_
hexadecimal-number-literal:
    _0xhex-digits
    0Xhex-digits
hex-digits:
    hex-digit hex-digits_opt
hex-digit: one of
    0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

文本字面值

如果文本中需要输入双引号，输入两次即可。

// "+"+"
"+""+"

text-literal:
    "text-literal-characters_opt "
text-literal-characters:
    text-literal-character text-literal-characters_opt
text-literal-character:
    single-text-character
    character-escape-sequence
    double-quote-escape-sequence
single-text-character:
    Any character except"(U+0022) or # (U+0023) followed by ( (U+0028)
double-quote-escape-sequence:
    ""(U+0022, U+0022)

逐字字面值

逐字字面值存储用户作为代码输入但无法正确解析为代码的Unicode字符序列。
在运行时，它会生成一个错误。

verbatim-literal:
#!" text-literal-characters_opt "

标识符

标识符用于表示引用值的名词，。标识符分为常规标识符和带引号的标识符。

提示

通俗的说，标识符是指变量名、记录的字段名。

常规标识符规则：

不能是关键字
字母或下划线起始
中间可以是字母、数字、下划线、其他字符
点不能位于开始或结尾

提示

字母指Unicode的Lu、Ll、Lt、Lm、Lo、Nl分类下的字符，其他字符亦是同理。编程中因为不同编程环境的编码不同，通常都是建议使用英文字符作为标识符。对于M，因为通常仅使用Excel和Power BI自带的编辑器，因此可以使用汉字作为标识符。

英文字符是指：a-z、A-Z、_、.、0-9。

带引号的标识符规则同文本字面值。

// 单个、多个下划线或下划线开头
_
_______
_A

// 汉字
我

// 下面都是错误
.A
A.
.
123A

Unicode分类详见：

identifier:
    regular-identifier
    quoted-identifier
regular-identifier:
    available-identifier
    available-identifier dot-character regular-identifier
available-identifier:
    A keyword-or-identifier that is not a keyword
keyword-or-identifier:
    identifier-start-character identifier-part-characters_opt
identifier-start-character:
    letter-character
    underscore-character
identifier-part-characters:
    identifier-part-character identifier-part-characters_opt_
identifier-part-character:
    letter-character
    decimal-digit-character
    underscore-character
    connecting-character
    combining-character
    formatting-character
dot-character:_
    . (U+002E)
underscore-character:
    _ (U+005F)
letter-character:
    A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
combining-character:
    A Unicode character of classes Mn or Mc
decimal-digit-character:
    A Unicode character of the class Nd
connecting-character:
    A Unicode character of the class Pc
formatting-character:
    A Unicode character of the class Cf

带引号的标识符可以用于允许零个或多个Unicode字符的任何序列用作标识符，包括关键字、空格、注释、运算符和标点符号。

quoted-identifier:
#" text-literal-characters_opt "

通用标识符

用于命名和访问字段的标识符被称作通用标识符。
通用标识符中允许存在空格、关键字或其他标识符。

[
    as = 
        [
            a b = 1, 
            if = 2 
        ], 
    is = 3
][as][a b]

generalized-identifier:
    generalized-identifier-part
    generalized-identifier separated only by blanks(U+0020)_
generalized-identifier-part:
    generalized-identifier-segment
    decimal-digit-character generalized-identifier-segment
generalized-identifier-segment:
    keyword-or-identifier
    keyword-or-identifier dot-character keyword-or-identifier_

虽然let...in是[...][...]的语法糖，但是在标识符这里，并不能完全替代。

let 
    #"as" = 
        [
            a b = 1, 
            if = 2
        ], 
    #"is" = 3
in 
    #"as"[a b]

关键字

关键字是保留的类似标识符的字符序列，不能用作标识符，除非使用标识符引用机制或通用标识符。

keyword: one of
and as catch each else error false if in is let meta not null or otherwise section shared then true try type #binary #date #datetime #datetimezone #duration #infinity #nan #sections #shared #table #time

注：其中catch于2022年5月引入。

提示

如果要对关键字进行分类：

类型（函数）：#binary #date #datetime #datetimezone #duration #table #time
字面值： false null true #infinity #nan
运算符：and as each is meta not or type
错误： catch error otherwise try
其它： else if in let then
节（很少能用到）： section shared #sections #shared

操作符和标点符号

操作符用于连接操作数。标点符号用于分隔或分隔。

operator-or-punctuator: one of
, ; = < <= > >= <> + - * / & ( ) [ ] { } @ ! ? ?? => .. ...