Syntactic parsing is the automatic analysis of syntactic structure of natural language, especially syntactic relations (in dependency grammar) and labelling spans of constituents (in constituency grammar).^[1] It is motivated by the problem of structural ambiguity in natural language: a sentence can be assigned multiple grammatical parses, so some kind of knowledge beyond computational grammar rules is needed to tell which parse is intended. Syntactic parsing is one of the important tasks in computational linguistics and natural language processing, and has been a subject of research since the mid-20th century with the advent of computers.

Different theories of grammar propose different formalisms for describing the syntactic structure of sentences. For computational purposes, these formalisms can be grouped under constituency grammars and dependency grammars. Parsers for either class call for different types of algorithms, and approaches to the two problems have taken different forms. The creation of human-annotated treebanks using various formalisms (e.g. Universal Dependencies) has proceeded alongside the development of new algorithms and methods for parsing.

Part-of-speech tagging (which resolves some semantic ambiguity) is a related problem, and often a prerequisite for or a subproblem of syntactic parsing. Syntactic parses can be used for information extraction (e.g. event parsing, semantic role labelling, entity labelling) and may be further used to extract formal semantic representations.

Constituency parsing

CKY

The most popular algorithm for constituency parsing is the Cocke–Kasami–Younger algorithm (CKY),^[4]^[5] which is a dynamic programming algorithm which constructs a parse in worst-case ${\mathcal {O))\left(n^{3}\cdot \left|G\right|\right)$ time, on a sentence of $n$ words and $\left|G\right|$ is the size of a CFG given in Chomsky Normal Form.

Given the issue of ambiguity (e.g. preposition-attachment ambiguity in English) leading to multiple acceptable parses, it is necessary to be able to score the probability of parses to pick the most probable one. One way to do this is by using a probabilistic context-free grammar (PCFG) which has a probability of each constituency rule, and modifying CKY to maximise probabilities when parsing bottom-up.^[6]^[7]^[8]

A further modification is the lexicalized PCFG, which assigns a head to each constituent and encodes rule for each lexeme in that head slot. Thus, where a PCFG may have a rule "NP → DT NN" (a noun phrase is a determiner and a noun) while a lexicalized PCFG will specifically have rules like "NP(dog) → DT NN(dog)" or "NP(person)" etc. In practice this leads to some performance improvements.^[9]^[10]

More recent work does neural scoring of span probabilities (which can take into account context unlike (P)CFGs) to feed to CKY, such as by using a recurrent neural network or transformer^[11] on top of word embeddings.

In 2022, Nikita Kitaev et al.^[12] introduced an incremental parser that first learns discrete labels (out of a fixed vocabulary) for each input token given only the left-hand context, which are then the only inputs to a CKY chart parser with probabilities calculated using a learned neural span scorer. This approach is not only linguistically-motivated, but also competitive with previous approaches to constituency parsing. Their work won the best paper award at ACL 2022.

Transition-based

Following the success of $O(n)$ transition-based parsing for dependency grammars, work began on adapting the approach to constituency parsing. The first such work was by Kenji Sagae and Alon Lavie in 2005, which relied on a feature-based classifier to greedily make transition decisions.^[13] This was followed by the work of Yue Zhang and Stephen Clark in 2009, which added beam search to the decoder to make more globally-optimal parses.^[14] The first parser of this family to outperform a chart-based parser was the one by Muhua Zhu et al. in 2013, which took on the problem of length differences of different transition sequences due to unary constituency rules (a non-existent problem for dependency parsing) by adding a padding operation.^[15]

Note that transition-based parsing can be purely greedy (i.e. picking the best option at each time-step of building the tree, leading to potentially non-optimal or ill-formed trees) or use beam search to increase performance while not sacrificing efficiency.

Sequence-to-sequence

Dependency parsing

Transition-based

Many modern approaches to dependency tree parsing use transition-based parsing (the base form of this is sometimes called arc-standard) as formulated by Joakim Nivre in 2003,^[19] which extends on shift-reduce parsing by keeping a running stack of tokens, and deciding from three operations for the next token encountered:

LeftArc (current token is a child of the top of the stack, is not added to stack)
RightArc (current token is the parent of the top of the stack, replaces top)
Shift (add current token to the stack)

The algorithm can be formulated as comparing the top two tokens of the stack (after adding the next token to the stack) or the top token on the stack and the next token in the sentence.

Training data for such an algorithm is created by using an oracle, which constructs a sequence of transitions from gold trees which are then fed to a classifier. The classifier learns which of the three operations is optimal given the current state of the stack, buffer, and current token. Modern methods use a neural classifier which is trained on word embeddings, beginning with work by Danqi Chen and Christopher Manning in 2014.^[20] In the past, feature-based classifiers were also common, with features chosen from part-of-speech tags, sentence position, morphological information, etc.

This is an $O(n)$ greedy algorithm, so it does not guarantee the best possible parse or even a necessarily valid parse, but it is efficient.^[21] It is also not necessarily the case that a particular tree will have only one sequence of valid transitions that can reach it, so a dynamic oracle (which may permit multiple choices of operations) will increase performance.^[22]

A modification to this is arc-eager parsing, which adds another operation: Reduce (remove the top token on the stack). Practically, this results in earlier arc-formation.

These all only support projective trees so far, wherein edges do not cross given the token ordering from the sentence. For non-projective trees, Nivre in 2009 modified arc-standard transition-based parsing to add the operation Swap (swap the top two tokens on the stack, assuming the formulation where the next token is always added to the stack first). This increases runtime to $O(n^{2})$ in the worst-case but practically still near-linear.^[23]

Grammar-based

A chart-based dynamic programming approach to projective dependency parsing was proposed by Michael Collins^[24] in 1996 and further optimised by Jason Eisner^[25] in the same year.^[26] This is an adaptation of CKY (previously mentioned for constituency parsing) to headed dependencies, a benefit being that the only change from constituency parsing is that every constituent is headed by one of its descendant nodes. Thus, one can simply specify which child provides the head for every constituency rule in the grammar (e.g. an NP is headed by its child N) to go from constituency CKY parsing to dependency CKY parsing.

McDonald's original adaptation had a runtime of $O(n^{5})$ , and Eisner's dynamic programming optimisations reduced runtime to $O(n^{3})$ . Eisner suggested three different scoring methods for calculating span probabilities in his paper.

Graph-based

Exhaustive search of the possible ${\displaystyle n^{2))$ edges in the dependency tree, with backtracking in the case an ill-formed tree is created, gives the baseline $O(n^{3})$ runtime for graph-based dependency parsing. This approach was first formally described by Michael A. Covington in 2001, but he claimed that it was "an algorithm that has been known, in some form, since the 1960s".^[27]

The problem of parsing can also be modelled as finding a maximum-probability spanning arborescence over the graph of all possible dependency edges, and then picking dependency labels for the edges in tree we find. Given this, we can use an extension of the Chu–Liu/Edmonds algorithm with an edge scorer and a label scorer. This algorithm was first described by Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič in 2005.^[28] It can handle non-projective trees unlike the arc-standard transition-based parser and CKY. As before, the scorers can be neural (trained on word embeddings) or feature-based. This runs in $O(n^{2})$ with Tarjan's extension of the algorithm.^[29]

Evaluation

The performance of syntactic parsers is measured using standard evaluation metrics. Both constituency and dependency parsing approaches can be evaluated for the ratio of exact matches (percentage of sentences that were perfectly parsed), and precision, recall, and F1-score calculated based on the correct constituency or dependency assignments in the parse relative to that number in reference and/or hypothesis parses. The latter are also known as the PARSEVAL metrics.^[30]

Dependency parsing can also be evaluated using attachment score. Unlabelled attachment score (UAS) is the percentage of tokens with correctly assigned heads, while labelled attachment score (LAS) is the percentage of tokens with correctly assigned heads and dependency relation labels.^[31]

Conversion between parses

Given that much work on English syntactic parsing depended on the Penn Treebank, which used a constituency formalism, many works on dependency parsing developed ways to deterministically convert the Penn formalism to a dependency syntax, in order to use it as training data. One of the major conversion algorithms was Penn2Malt, which reimplemented previous work on the problem.^[32]

Work in the dependency-to-constituency conversion direction benefits from the faster runtime of dependency parsing algorithms. One approach is using constrained CKY parsing, ignoring spans which obviously violate the dependency parse's structure and thus reducing runtime to $O(n^{2})$ .^[33] Another approach is to train a classifier to find an ordering for all the dependents of every token, which results in a structure isomorphic to the constituency parse.^[34]

Constituency parsing

CKY

Transition-based

Sequence-to-sequence

Dependency parsing

Transition-based

Grammar-based

Graph-based

Evaluation

Conversion between parses

References

Further reading