Formal Methods of Language Modelling
Back
Previous Article: Federal University of Amazonas (UFAM) Seminar: AICodeRepair
Author: Yiannis Charalambous
Reading Time: 10 minutes
Tags: PhD · Notes
The following are notes that have been written for the paper: arXiv:2311.04329. I will update the article with future notes. Please let me know if you find any mistakes, as there are probably many.
Please cite if you plan on using this text, or the source material.
Chapter 3.2
Symbol | Page | Meaning |
---|---|---|
$$p_\theta$$ | 61 | Current model |
$$p_{\theta^\star}$$ | Optimal model (can never be known) | |
$$\widetilde{p_{\theta^\star}}$$ | 62 | Optimal model approximation from corpus $\mathcal{D}$ |
$$p_{\hat{\theta}}$$ | Estimated (after minimizing loss) model | |
$$p_M$$ | Model | |
$$\mathcal{l}(\theta^\star,\theta)$$ | Loss function (outputs similarity between two distributions, in this case it takes the parameters of the distribution). Lower==Good | |
$$H(\widetilde{p_{\theta^\star}}, p_\theta)$$ | 63 | Cross Entropy function (applied loss function) |
$$\mathcal{Y}$$ | Support of a distribution (all domain values that will yield a non-zero probability of a distribution) Note (Edoardo): At (3.56), we sum over the elements of this function, however, this is not necessary, we could have summed over all the values of the distribution. | |
$$\delta_{x’}$$ | Kronecker Delta | |
$$L(\theta)$$ | 64 | Likelihood of the corpus $\mathcal{D}$ under the distribution $p_\theta$. In practice we work with the log likelihood $\mathcal{L}(\theta)$ because it is convex and more numerically stable. Finding parameters that maximize the log-likelihood is equivalent to finding those that minimize the cross-entropy. |
$$\hat{\theta}_{\text{MLE}}$$ | Maximum likeihood function (another measure of similarity to a dataset $\mathcal{D}$). Bigger==good. | |
$$\mathcal{\hat{y}}_{\lt t}$$ | 66 | Model prediction from previous timestep during training |
$$\mathcal{D}{\text{train}} \text{ and } \mathcal{D}{\text{test}}$$ | 68 | The data $\mathcal{D}$ split into two sets, the training set, used during training of the model, and the test set, used during model evaluation |
$$\mathcal{D}_{\text{val}}$$ | Subset of the test data set used during training to evaluate if the model has started to overfit, in which case early stopping will occur | |
$$\mathbf{\triangledown}{\theta\mathcal{s}}\mathcal{l}(\theta_\mathcal{S})$$ | 69 | The gradient of the objective with respect to the current model parameters |
$$S\text{ or } T$$ | Number of steps during gradient descent | |
$$\eta$$ | Learning rate schedule (array) | |
$$\mathcal{l_2}$$ | 72 | L2 regularization |
Name | Page |
---|---|
Mean-seaking | 65 |
Exposure bias | 66 |
Data leakage | 68 |
Xavier Init | 71 |
Grokking | |
Label smoothing | 72 |
Read Coding theory to understand cross entropy and the measuring of difference between two distributions.
Model $p_{LM}$ can be expressed as $p_{\theta^\star}$ in parameterized form for certain unknown parameters $\theta^\star\in\Theta^\star$.
$$\xrightarrow[p_\theta \rightarrow p_{\hat{\theta}} \rightarrow p_{\theta^\star}]{\text{Towards Optimal Model}}$$ But we can never reach $p_{\theta^\star}$.
Chapter 4.1
Finite-State Automata
Symbol | Page | Meaning |
---|---|---|
$$\text{FSA or } \mathcal{A}$$ | 76 | Finite state automata |
$$\Sigma$$ | Alphabet | |
$$Q$$ | Finite set of states | |
$$I$$ | Set of initial states | |
$$F$$ | Set of final or accepting states | |
$$\delta$$ | Finite multiset, elements of this are the transitions | |
$$\mathcal{q}$$ | 77 | Represents a state in the FSA, the state does not represent the context, the context is represented by the transitions |
$$\mathcal{a}$$ | Individual transition from one state $q_1$ to $q_2$. The label of this transition corresponds to the input symbol at the current context | |
$$\epsilon$$ | Empty transition, this allows the transition from one state to another without consuming a transition. | |
$$L(\mathcal{A})$$ | 78 | Definition of language that belongs to a FSA $\mathcal{A}$ |
Weighted Finite-State Automata
An extension of FSA
Symbol | Page | Meaning |
---|---|---|
$$\delta \subseteq Q \times(\Sigma \cup {e}) \times \mathbb{R} \times Q$$ | 79 | A finite multi-set of transitions |
$$\lambda $$ | Weight transition functions for WFSA: initial weight | |
$$ \rho $$ | Weight transition functions for WFSA: final weight | |
$$I$$ | 80 | Initial states are states given non-zero initial weight |
$$F$$ | 80 | Initial states are states given non-zero final weight |
$$ \omega ( q \xrightarrow[]{a/w} q’ ) $$ | Weighted transition $w$ | |
$$T^{(a)}$$ | Symbol specific transition matrix for $a$ labelled transitions | |
$$T$$ | Full transition matrix which is the sum of all the alphabet symbols, $\Sigma$ | |
$$\pi$$ | 81 | Path is an element consisting of consequtive transitions $\delta^\star$ |
$$p(\pi) \text{ and } \eta(\pi)$$ | Origin and destination of a path | |
$$s(\pi)$$ | Yield of the path, is the concatenation of the input symbols. Equivalent to $y$ | |
$$\Pi$$ | The set of paths | |
$$\Pi(\mathcal{A})$$ | 82 | The set of paths in automaton $\mathcal{A}$ |
$$\Pi(\mathcal{A}, \mathbf{y})$$ | The set of paths in automaton $\mathcal{A}$ that will yield $\mathbf{y}$ | |
$$\Pi(\mathcal{A}, q, q’)$$ | The set of all paths of automaton $\mathcal{A}$ from state $q$ to state $q'$ | |
$$\mathbf{w}_I$$ | Inner path weight | |
$$\mathbf{w}$$ | Path weight | |
$$\mathcal{A}(y)$$ | Stringsum/stringweight/acceptance weight of a string $y$ under WFSA $\mathcal{A}$ is the sum of the weights of a path that yield $y$ | |
$$L(\mathcal{A})$$ | 83 | The weighted language of $\mathcal{A}$ |
$$L$$ | Weighted regular language L is a weighted regular language if there exists a WFSA $\mathcal{A}$ such that $L=L(\mathcal{A})$ | |
$$Z(\mathcal{A},q) \text{ or } \beta(q)$$ | State Specific Allsum/Backwards Values of the state $q$ | |
$$Z(\mathcal{A})$$ | The Allsum | |
$$\mathcal{p_A}(\mathbf(y)) \stackrel{def}{=} \mathcal{A}(\mathbf(y))$$ | 86 | The stringsum of a string $y$ in a probabilistic FSM is the probability of a string $y$ |
$$p_\mathcal{A}(\mathbf{y}) \stackrel{def}{=} {\mathcal{A}(y) \over Z(\mathcal{A})}$$ | String probability in a General Weighted FSA that is tight as it normalizes the probabilities. | |
$$p_{LM_\mathcal{A}}(y) \stackrel{def}{=} p_\mathcal{A}(y)$$ | 87 | Language model induced by $\mathcal{A}$ as the probability distribution over $\Sigma^*$ |
$$T$$ | 88 | Transition matrix of automaton $\mathcal{A}$. Where $T_{ij}$ represents a transition weight between state $i$ and state $j$ |
$$T^0 \text{ or } I$$ | The identity matrix. Elements are all 0, except for the diagonal where $i = j$ in which case it is 1 | |
$$\mathcal{G}$$ | Weighted directed graph | |
$$T^d$$ | Allsum of all paths of length exactly $d$ | |
$$T^{\le d}$$ | Allsum of all paths of length at most $d$ | |
$$T^*$$ | 88-89 | Pairwise pathsums over all possible paths. Including infinite paths. Similar to the matrix of the Geometric Sum. If there is an inverse of $I-T$, then $T^*=(I-T)^{-1}$ |
$$\mathcal{A}_L$$ | 91 | Tight probabilistic FSA. Contains same transitions as $\mathcal{A}$ but re-weighted |
$$\lambda_{\mathcal{A}_L}$$ | Tight probabilistic FSA Lambda function | |
$$\rho_{\mathcal{A}_L}$$ | Tight probabilistic FSA Rho function | |
$$\mathcal{w}_{\mathcal{A}_L}$$ | Tight transitions for the probabilistic FSA | |
$$p_\mathcal{A}(\pi)$$ | 92 | Finally: Probability of a path |
$$f^{\delta}{\theta} \text{ } f^{\lambda}{\theta} \text{ } f^{\rho}_{\theta}$$ | 92 | Score function for a parametarized model used during training, one function for transitions, initial and final weights respectively |
$$\mathcal{A}_\theta$$ | Parametarized automata | |
$$\delta_\theta \text{ } \lambda_\theta \text{ } \rho_\theta$$ | Parameterized transition/starting/finish weights | |
$$\hat{p}^{\mathcal{A}\theta}{\text{GN}}$$ | 93 | Globally normalized energy function which makes the Parametric FSA globally normalized |
$$\rho_s(M)$$ | 95 | Spectral radius of a matrix $M$ |
$$T’$$ | Transition sum matrix | |
$$\lvert\lvert A \rvert\rvert_\infty$$ | Infinity matrix norm | |
$$\Lambda(.)$$ | The set of eigen values of a matrix where “.” is the row | |
$$y^{t-1}_{t-n+1}$$ | 97 | The context of $y_t$ |
$$SL_n$$ | 101 | Strictly n-local language |
$$SL$$ | Strictly local language |
Transitions
- Transition weight: Written on the edge after the symbol $a \equiv y$ separated by a \.
- Final weight: Written in the node, after the state number and \.
- Initial weight: Written on the arrow that points to the starting state.
Inner Path Weight: Just the transitions from states
Full Path Weight: Initial weight of origin * Final weight of destination node * inner path weight
Accessible States: State $q$ can be accessed from an initial state $q_i | \lambda(q_i) \ne 0$
Co-Accessible States: State $q$ can access a state $q_\phi | \rho(q_\phi) \ne 0$
Probabilistic Finite State Automata (PFSA) (pg. 84)
$$\sum_{q \in Q}{\lambda(q)} = 1$$
For all $q \in Q$ and all transitions it holds that:
$$\lambda(q) \ge 0$$
$$\rho(q) \ge 0$$
$$\mathcal{w} \ge 0$$
$$\sum_{q\xrightarrow{a/\mathcal{w}}{q’}}{\mathcal{w}+\rho(q) = 1}$$
Chapter 4.2
Symbol | Page | Meaning |
---|---|---|
$$\mathcal{G}\Sigma\mathcal{N}S\mathcal{P}$$ | 108 | Context-free grammar (CFG). Alphabet of terminating symbols. Non empty set of non-terminal symbols. Designated start non terminal symbol. Set of production rules. |
$$p$$ | A rule | |
$$Y$$ | Production rule, it is the left hand element of the rule. The whole structure is: $p = (Y\rightarrow a$) | |
$$X\stackrel{*}{\Rightarrow}_\mathcal{G}A$$ | 110 | For context-free grammar $\mathcal{G}$, $A$ can be derived from $X$ |
$$L(\mathcal{G})$$ | The language generated by $\mathcal{G}$ | |
$$d$$ | 110 | Derivation tree |
$$s(d)$$ | String yielded by derivation tree | |
$$\mathcal{D_G}(y)$$ | 111 | The string derivation set |
Unambiguous | If the string associated with a grammar can only be associated with one derivation tree | |
$$\mathcal{D_G}$$ | 112 | Derivation set of a grammar (of the language essentially) |
$$(X\rightarrow \mathscr{a}) \in d$$ | Refers to a specific production rule in a tree | |
$$D(k)$$ | 113 | Dych-k languages where k is the well-nested brackets of k-types |
$$\mathbf{w}(d)$$ | 117 | The weight of a single derivation tree |
$$\mathcal{G}(y)$$ | 118 | The stringsum of a string $y$ |
$$Z(\mathcal{G}. Y)$$ | The allsum for a non-terminal Y in a grammar $\mathcal{G}$ | |
$$Z(\mathcal{G})$$ | The allsum for a terminal $a \in \Sigma \cup \epsilon$ equals to 1 | |
$$Z(\mathcal{G})=Z(\mathcal{G}, S)$$ | The allsum of a WCFG, this could be 1 if pruned, and probabilistic, and normalized (I think) | |
$$p_\mathcal{G}(y)$$ | 119 | The stringsum/probability of string $y$ in a PCFG |
$$p_{LM_\mathcal{G}}=p_\mathcal{G}(y)$$ | 120 | The language model induced by WCFG as a probability distribution over $\Sigma^*$ |
$$\mathcal{\gamma}_n$$ | The level of a generation sequence in a CFG | |
$$g(s1, …, s_N)$$ | 121 | Production generating function. Represents all the ways a non-terminal can be expanded, along with the probabilities and how often each non-terminal appears in the expansions |
$$\mathcal{P}$$ | 126 | Push down automata |
$$\Gamma$$ | PDA finite set of stack symbols | |
$$\delta$$ | PDA multiset transition function | |
$$\mathcal{q_i\gamma_i}$$ | PDA initial configuration | |
$$\mathcal{q_\phi\gamma_\phi}$$ | PDA final configuration | |
$$\mathcal{\tau}$$ | 127 | PDA scanning/non-scanning element. $\tau\in\delta$ |
$$\pi$$ | A sequence of transitions and configurations in a PDA symbolizing a generalized (from FSA) a path | |
$$\Pi(\mathcal{P}, y)$$ | The set of runs that scan $y$ | |
$$\Pi(\mathcal{P})$$ | The set of all accepting runs of $\mathcal{P}$ | |
$$L(\mathcal{P})$$ | The set of all strings recognized by $\mathcal{P}$ is the language recognized by it. ${y |
Chapter 5.1
Symbol | Page | Meaning |
---|---|---|
$$h$$ | 140 | RNN hidden states |
$$\mathcal{R}$$ | 142 | RNN. Real-valued uses $R^D$ while rational uses $Q^D$ |
$$\Sigma$$ | Alphabet of input symbols | |
$$D$$ | The dimension of $\mathcal{R}$ | |
$$\text{f}$$ | The dynamics map, a funciton defining the transitions between hidden states. Essentially this encodes the hidden states in a function, as opposed to a look-up table as we saw with FSA and context-free grammars | |
$$\text{h}_0$$ | The initial state | |
$$\text{enc}\mathcal{R}(y{\le{t+1}})$$ | 143 | The encoding function that uses an RNN to recursively encode strings of arbitrary length, using its recurrent dynamics map |
$$h_t$$ | 144 | The hidden state that desribes the state of $\mathcal{R}$ after reading $y_t$ |
$$[[y]]$$ | 146 | One-hot encoded string $y$ |
$$d_n(y)$$ | Function for the one-hot encoded string. It assigns the nth canonical basis vector. Where n is a bijection function (an ordering of the alphabet assigning an index to each symbol in $\Sigma$ | |
$$s|x|$$ | 147 | |
$$\mathbf{VU}b_h$$ | 150 | For Elman and Jordan networks: U Linearly transforms the previous hidden state (recurrence matrix). V Linearly transforms the representations of the input symbols (input matrix). Output Matrix: Linearly transforms the hidden states before computing the output values $r_t$. $b_h$ (hidden bias vector) |
$$e’$$ | Fixed embedding function | |
$$\text{g}$$ | 153 | An RNN gate vector, elements can only be 0 or 1 |
Chapter 5.2
For HRNN to PFSA (From Lemma 5.2.1)
Symbol | Page | Meaning |
---|---|---|
$$s$$ | 159 | Bidirectional mapping from hidden state to an index |
For PFSA to HRNN (From Lemma 5.2.2: Elman RNNs can encode PFSAs)
Symbol | Page | Meaning |
---|---|---|
$$n(q, y)$$ | 161 | Maps state symbol pair to an integer |
$$m(y)$$ | Maps the symbol to an integer | |
$$\hat{m}(y)$$ | Same as $m(y)$ but with EOS mapped as well | |
$$U$$ | 163 | The entries of each column $U_{:,n(q,y)}$ are set up to indicate the states that are reachable from state $q$ after processing input $y$. An entry in $U_{:,n(q,y)}$ in $\mathbf{U}$ is set to $1$ if there exists a transition in the automaton that moves from state $q$ to $q′$ upon reading input $y′$. |
$$V$$ | The input matrix $V$ lives in $\mathbb{B}^{\lvert Σ \rvert \lvert Q \rvert \times \lvert Σ \rvert}$ and encodes the information about which states can be reached by which symbols (from any state in $\mathcal{A}$) | |
Conjunction of $V$ and $U$ | By combining the encoding of the current state from $U$ and the encoding of the input symbol from $V$, the system produces a single non-zero entry, representing the unique state reachable from $q_t$ using symbol $y_{t+1}$ |
Chapter 5.3
Symbol | Page | Meaning |
---|---|---|
$$\mathcal{T}$$ | 210 | Transformer network $\Sigma$, $D$, and $enc_\mathcal{T}$ where $\Sigma$ is the input alphabet, $D$ is the dimension of $\mathcal{T}$, and $enc_\mathcal{T}$ is the encoding function |
$$h_t$$ | Hidden states of the transform, unlike RNNs it does not have a dependency on previous hidden states $enc_\mathcal{T}(y_{\le t})$ | |
$$a_t$$ | 211 | The attention mechanism value of hidden state $h_t$ |
$$\text{Att}$$ | The attention mechanism function | |
$$qKV$$ | Query key and value of the attention mechanism. Each symbol has a query vector assigned to it. | |
$$\text{T}$$ | 214 | Transformer layer |
$$\mathbf{XZ}$$ | The input/output to the attention mechanism $a_t$ is an intermediate value generated in this process. $\mathbf{X}$ is the initial static embeddings of the sequence. However, after passing through the first transformer layer, becomes a layer’s input. $\mathbf{Z}$ is the output of a single transformer layer. So the previous layer’s $\mathbf{Z}$ output, will be the next layer’s $\mathbf{X}$ input | |
$$O(a_t)$$ | The output transformation from attention values to a transformer layer output value $\mathbf{z}_t$. In the text it is left generic as it can probably be customized. But it is usually a multi-layer perceptron (MLP) |
Attention layer transformation:
$$\mathbf{X} \xrightarrow[]{\text{Att()} + \text{x}} a \xrightarrow[]{O() + a} \mathbf{Z}$$
Exercises (Page numbers)
- Proof of allsum $T^d$: 88
- Chapter 4.1: 136