GLMs: Intuition behind the Link function and Derivation of Iteratively Reweighted Least Squares
Intuition behind the Link function, discussion of the various model fitting techniques and their advantages & disadvantages, derivation of IRLS using Newton-Raphson and Fisher Scoring
GLMs - A Natural Extension of the Linear Model
The first model we learn on any statistics-based course is Simple Linear Regression (SLR). Despite placing strong (linear) assumptions on the relationship between the response and covariates, as well as the error distribution if we are interested in statistical inference, the Linear model is a surprisingly useful tool for representing many natural processes. However, when we wish to deal with non-linear random variable generating processes, such as the probability of occurrence of an event from a binary or multinomial distribution, or for the modelling of counts within a given time period, we need to generalize the Linear Model.
Generalized Linear Models enable us to explicitly model the mean of a distribution from the exponential family, such that predictions lie in a feasible range and the relationship between the mean and the variance is appropriate to perform reliable inference. In this brief blog we discuss the intuition behind the Generalized Linear Model and Link function and prove the method of Iteratively ReWeighted Least Squares enables us to fit a GLM.
Generalized Linear Models have 3 components:
1) Random Component Error Structure
\[y_i \sim exponential \; family \; distibution\]
We also typically assume each \(y_i\) is independent and identically distributed, although this assumption can be relaxed through the use of Generalized Estimating Equations (GEE’s).
2) Systematic Component/ Linear Predictor
\[\eta_i = \beta_0 + \sum_{i = 1}^{p}\beta_p x_{i, p}\]
3) Link Function
\[\mathop{\mathbb{E}}[y_i | x_{1, i}, ..., x_{p, i}] = \mu_i\] \[g(\mu_i) = \eta_i\]
It should be made clear that we are not modellig the response \(y_i\) explicitly, but rather modelling the mean of the distibution, \(\mu_i\). Predictions for each observation \(i\) are given by \(\mu_i\), with each \(y_i\) assumed to be centred around \(\mu_i\) in expectation but with an error term that has a distibution specified by the member of the exponential family used. Therefore, the link function does not transform the response \(y_i\) but instead transforms the mean \(\mu_i\). Note that the linear model is a specific type of GLM, where \(y_i \sim Normal\) and \(g(\mu_i) = \mu_i = \eta_i = \beta_0 + \sum_{i = 1}^{p}\beta_p x_{i, p}\). For a Poisson GLM, each \(y_i\) is a r.v. simulated from the Poisson distribution with mean \(\mu_i\), hence \(y_i\) has a Poisson error distribution, the difference between \(y_i\) and \(\hat{y_i} = \mu_i\).
Why Bother with Parametric Assumptions
Advantages - Large amounts of data can be modelled as random variables from the exponential family of distributions - If there are relatively few observations, providing a structure for the model generating process can improve predictive performance - Enables us to carry out inference on model covariates - Simple and intuitive to understand. In some industries (such as insurance), this has huge benefits- it is transparent and can fit into a rate structure
Disadvantages - Validity of inference dependent on assumptions being satisfied - Places a very rigid structure on the relationship between the independent and dependent variables - Typically has worse predictive performance than non-linear models, such as Boosted Trees and Neural Networks, due to linearity and inability to account for complex interactions
Link Functions
So Generalized Linear models explicitly introduce a link function (the link function for the linear model is simply the identity, \(g(\mu) = \mu\)) and through the specification of a mean-variance relationship (the response belongs to a member of the exponential family). Using a link function allows us to transform values of the linear predictor to predictions of the mean, such that these predictions are always contained within the range of possible values for the mean, \(\mu_i\).
When choosing a link function, there is no ‘correct’ choice, however there are a few properties we require to be able to interpret and fit a model:
The link function transforms the linear predictor such that the prediction of \(\mu_i\) for each \(y_i\) is within the range of possible values of \(\mu\).
The link function must be monotonic and therefore have a unique inverse. That is, each value on the linear predictor must be mapped to a unique value of the mean and the link function must preserve the order/ranking of predictions.
The link function must be differentiable, in order to estimate model coefficients.
For OLS, the linear predictor \(X\beta\) can take on any value in the range \((-\infty, \infty)\). For a Poisson model, we require the rate parameter \(\mu\) (equivalent to the commonly used \(\lambda\)), to lie in the range \((0, \infty)\), thus we need a link function that transforms the linear predictor \(\eta\) to lie in this range. The common choice of link function for a Poisson GLM is the log-link (\(\log(\mu_i) = \eta_i\)). Exponentiating the linear predictor results in \(\mu_i \in (0, \infty)\) as required for count data modeled via a Poisson. The log-link also results in a nice interpretation, since it exponentiates the linear predictor resulting in a multiplication of exponentiated coefficients: \(log(\mu_i) = exp(\beta_0 + \beta_1 x_{1, i}) = \exp(\beta_0) \exp(\beta_1 x_{1, i})\). Note that we can’t use the square root function as a link function since it does not have a unique inverse (i.e. \(\sqrt(4) = \pm 2\)).
For the Binomial, we choose a link function that maps \(p_i\), the probability of success to the interval \([0, 1]\). The link function \(g(\mu_i) = log(\frac{p_i}{n - p_i}) = X \beta_i\) is one candidate. This transforms the Linear predictor: \(\frac{p_i}{1 - p_i} = \exp(X \beta_i)\) \(\implies p_i = (\frac{e^{X \beta_i}}{1 + e ^ {X \beta_i}}) \in [0, 1]\) to the required range.
Whilst we are able to choose any link function that satisfies these properties, the usual choice is to select the Canonical link function, which arises from writing the distribution in its exponential form. These link functions have nice mathematical properties and simplify the derivation of the Maximum Likelihood Estimators. We could also use an Information Criteria such as AIC to choose the best-fitting link function, although there is usually little deviation in performance, so the common choice is to use the link function with the most intuitive interpretation (which is often the canonical link function anyway).
Some common distributions and Canonical links are show below:
Family | Canonical Link (\(\eta = g(\mu)\)) | Inverse Link (\(\mu = g^{-1}(\eta)\)) |
---|---|---|
Binomial | Logit: \(\eta = log \left( \frac{\mu}{n - \mu} \right)\) | \(\mu = \frac{n}{1 + e^{-n}}\) |
Gaussian | Identity: \(\eta = \mu\) | \(\mu = \eta\) |
Poisson | Log: \(\eta = log(\mu)\) | \(\mu = e^{\eta}\) |
Gamma | Inverse: \(\eta = \frac{1}{\mu}\) | \(\mu = \frac{1}{\eta}\) |
The Exponential Family of Distributions
A random variable \(y\) has a distribution from the exponential family if its probability density function is of the form:
\[f(y;\theta, \phi) = exp \left( \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right)\]
where:
\(\theta\) is the location parameter of the distribution
\(a(\phi)\) is the scale/dispersion parameter
\(c(y, \phi)\) is a normalization term
It can also be shown that \(\mathop{\mathbb{E}}[y] = b'(\theta)\) and \(Var[y] = b''(\theta) a(\phi)\).
Likelihood Analysis and Newton-Raphson Method
The fitting of any form of Statistical Learning algorithm involves an optimization problem. Optimization refers to the task of either minimizing or maximizing some function \(f(\mathbf{\beta})\) by altering \(\beta\). We usually phrase most optimization problems in terms of minimizing \(f(\beta)\). For the case of parametric learning models (we place distributional assumptions on the target/response \(y_i\), such that they are drawn independently from some probability distribution \(p_{model}(y, \theta)\)), one can also use the process of Maximum Likelihood to find model coefficients. Under this approach, we have the following optimization problem:
\[\mathbf{\beta}^* =arg max(l(y_i | \mathbf{\beta}))\]
where \(l(y_i | \mathbf{\beta})\) is the likelihood function. The likelihood can be interpreted as the probability of seeing the data in our sample given the parameters and we naturally wish to maximize this quantity to obtain a good model fit.
Why Maximum Likelihood?
Simple and intuitive method to find estimates for any parametric model.
For large samples, MLE’s have useful properties, assuming large n and i.i.d (independent and identically distributed) samples.
\(\mathop{\mathbb{E}}[\hat{\theta}_{MLE}] = \theta\) (Unbiased)
\(Var(\hat{\theta}_{MLE}) = \frac{1}{n I(\theta)}\), where \(I(\theta)\) is the Fisher Information in the sample. That is, we can calculate the variance of model coefficients and hence perform inference
The MLE also achieves the Cramer Lower Bound, that is it has the smallest variance of any estimator, thus is Asymptotically Efficient.
However, there are several disadvantages to Maximum Likelihood, particularly if the purpose of modelling is for prediction. Under Maximum Likelihood, we fit exactly to the training dataset, resulting in overfitting and poor generalization to unseen data.
Consider the general form of the probability density function for a member of the exponential family of distributions:
\[f(y;\theta, \phi) = exp \left( \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right)\]
The likelihood is then (assuming independence of observations):
\[L(f(y_i)) = \prod_{i = 1}^n exp \left( \frac{1}{a(\phi)} (y_i \theta_i - b(\theta_i) + c(y_i, \phi) \right)\]
with log-likelihood (since the log of the product of exponentials is the sum of the exponentiated terms):
\[log(L(f(y_i))) = l(f(y_i)) = \sum_{i = 1}^n \frac{1}{a(\phi)}(y_i \theta_i - b(\theta_i)) + c(y_i, \phi)\]
Since the logarithmic function is monotonically increasing, order is preserved hence finding the maximum of the log-likelihood yields the same result as finding the maximum of the likelihood. We wish to maximize this log-likelihood, hence we can differentiate, equate to zero and solve for \(\beta_j\) (We could also ensure the second derivative evaluated at \(\beta_j\) is negative, therefore we have maximized (and not minimized) the log-likelihood). Via the Chain Rule we have:
Equation 1
\[\frac{\partial l(f(y_i))}{\partial \beta_j} = \sum_{i = 1}^n \frac{\partial l(f(y_i))}{\partial \theta_i} \frac{\partial \theta_i}{\partial \mu_i} \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial \eta_i}{\partial \beta_j} = 0\]
This is also known as the Score function, since it tells us how sensitive the model is to changes in \(\beta\) at a given value of \(\beta\). Since we differentiate with respect to each coefficent, \(\beta_j\) (\(j \in [0, 1, ..., K]\)), we have a system of \((K + 1)\) equations to solve.
Chain Rule
The Chain Rule is often used in Optimization problems. Here we utilize the chain rule by recognizing that, as seen in Equation 1, the likelihood is a function of the location parameter of the distribution, \(\theta\), which in turn is a function of the mean \(\mu\). Via the link function, we have that \(g(\mu_i) = \eta_i\) and via the linear predictor, \(\eta_i = \beta_0 + \sum_{i = 1}^{p}\beta_p x_{i, p}\).
Each of these individual partial derivatives can then be identified:
\[\frac{\partial l_i}{\partial \theta_i} = \frac{y_i - b'(\theta_i)}{a(\phi)}\]
\[\frac{\partial \theta_i}{\partial \mu_i} = \frac{1}{V(\mu_i)}\]
since \(\frac{\mu_i}{\theta_i} = b''(\theta_i) = V(\mu_i)\) where V is the variance function of the model as it dictates the mean-variance relationship.
\[\frac{\partial \mu_i}{\partial \eta_i} = \frac{1}{g'(\mu_i)}\]
since \(g(\mu_i) = \eta_i \implies \frac{\partial \eta_i}{\partial \mu_i} = g'(\mu_i)\). Finally:
\[\frac{\partial \eta_i}{\partial \beta_j} = x_j\]
Putting this all together yields the Maximum Likelihood Equations:
Equation 2
\[\sum_{i = 1}^n \frac{(y_i - \mu_i) x_{i, j}}{a(\phi) V(\mu_i) g'(\mu_i)} = 0\]
Least Square Vs Maximum Likelihood
Under Gaussian assumptions, both Ordinary Least Squares (OLS) and Maximum Likelihood yield the same solution (assuming there is no collinearity/near collinearity between features such that the design matrix is invertible). Both methods aim to minimize the Residual Sum of Squares under these assumptions. One could also use numeric optimization, such as Gradient Descent, although this will not yield an exact solution.
Equation 2 does not have a closed form solution, except when we have a Gaussian Linear Model.
To solve the Equation 2, we must find coefficients \(\beta_j\) (which for each observation \(y_i\) affect our prediction of \(\mu_i\), \(g'(\mu_i)\) and \(V(\mu_i)\) via our distributional assumptions for the relationship between the mean and the variance), such that summing these terms over all observations yields 0. To solve this, we use the numerical Newton - Raphson Method, which when applied to the fitting of GLMs is knows as Iteratively Reweighted Least Squares.
Iteratively Reweighted Least Squares (IRLS)
Recall the Newton - Raphson method for a single dimension. We wish to find the root of the function; in this case the value of \(\beta\) such that the derivative of the log-likelihood is 0. In One-Dimension, to find the root of function f we have:
\[x_{t+1} = x_t - \frac{f(x_t)}{f'(x_t)}\]
The closer \(f(x_t)\) is to 0, the closer we are to the root, hence the step change between iterations will be smaller.
Instead of the log-likelihood function \(f(x)\), we want to optimize its derivative:
\[x_{t+1} = x_t - \frac{f'(x_t)}{f''(x_t)}\]
or for the vector of coefficient estimates at iteration t:
\[\beta_{t+1} = \beta_t - \left( \frac{\partial l}{\partial \beta_j}(\beta_t) \right) \left( \frac{\partial^2 l}{\partial \beta_j \partial \beta_k} \right)^{-1}\]
The Newton-Raphson technique is derived by considering the Taylor Expansion about the solution \(\beta^*\) (that sets \(\frac{\partial l}{\partial \beta}\) to zero).
\[0 = \frac{\partial l}{\partial \beta}(\beta^*) - (\beta - \beta^*) \frac{\partial^2 l}{\partial \beta_j \beta_k} + ...\]
If we ignore all derivative terms higher than \(2^{nd}\) order, we can derive an iterative solution. Generalizing to multiple dimensions we have:
\[\beta_{t + 1} = \beta_t - \mathbf{H}^{-1}_t \mathbf{U}_t\]
where \(\mathbf{U}_t\) is the Score Vector evaluated at \(\beta_t\). \(\mathbf{H}_t\) denotes the (p + 1) x (p + 1) Hessian matrix of second Derivatives.
Given the Score function \(\frac{\partial l}{\beta_j} = \nabla_{\beta} l = \frac{(y_i - \mu_i)}{a(\phi)} \frac{x_{i,j}}{V(\mu_i)}\frac{1}{g'(\mu_i)}\), the Hessian is:
Equation 3
\[\nabla^2_{\beta} = \sum_{i = 1}^n \frac{x_{i,j}}{a(\phi)} \left( (y_i - \mu_i)' \frac{1}{g'(\mu_i)} \frac{1}{V(\mu_i)} \right) + (y_i - \mu_i) \left( \frac{1}{g'(\mu_i)} \frac{1}{V(\mu_i)} \right) ' \]
by Product Rule. \(y_i\) is a data point so does not depend on \(\beta\). The partial derivative of \(\mu_i\) is then:
\[\frac{\partial \mu_i}{\partial \beta_k} = \frac{\mu_i}{\eta_i} \frac{\eta_i}{\beta_k} = \frac{1}{g'(\mu_i)} x_k\]
substituting, the first term of equation 3 becomes:
\[\frac{x_{i,j}}{a(\phi)} \left( - \frac{x_{i, k}}{g'(\mu_i)} \right) \frac{1}{g'(\mu_i)} \frac{1}{V(\mu_i)} = - \frac{x_{i, j} x_{i, k}}{a(\phi)(g'(\mu_i))^2} \frac{1}{V(\mu_i)}\]
Now consider the \(2^{nd}\) term in equation 3:
\[\left( \frac{1}{g'(\mu_i)} \frac{1}{V(\mu_i)} \right)'\]
Using Newton-Raphson, we would need to calculate this derivative. However, we can instead utilize Fisher Scoring, ensuring this derivative term cancels out.
Fisher Scoring
Fisher Scoring is a form of Newton’s Method used in statistics to solve Maximum Likelihood equations numerically. Instead of using the inverse of the Hessian, we use the inverse of the Fisher Information matrix.
\[\beta_{t+1} = \beta_t + J^{-1} \nabla l\] where \(J = \mathop{\mathbb{E}}[- \nabla^2 l]\), the expected value of the negative Hessian.
Taking the negative expected value from equation 3, the first term becomes
\[\mathop{\mathbb{E}} \left( \frac{x_{i,j} x_{i,k}}{a(\phi) (g'(\mu_i))^2} \frac{1}{V(\mu_i)} \right) = \frac{x_{i,j} x_{i,k}}{a(\phi) (g'(\mu_i))^2} \frac{1}{V(\mu_i)}\]
since none of the above values depend on \(y\) hence are all constant. The \(2^{nd}\) term in equation 3 becomes:
\[\mathop{\mathbb{E}} \left( - \frac{x_{i,j}}{a(\phi)}(y_i - \mu_i) \left( \frac{1}{g'(\mu_i) V(\mu_i)} \right) ' \right) \\ = - \frac{x_{i,j}}{a(\phi)}(y_i - \mu_i) \left( \frac{1}{g'(\mu_i) V(\mu_i)} \right)' \mathop{\mathbb{E}}(y_i - \mu_i)\]
but \(\mathop{\mathbb{E}}(y_i - \mu_i) = 0\) hence the second term of equation 3 vanishes. We then have:
\[J = \sum_{i = 1}^n \frac{x_{i,j} x_{i,k}}{a(\phi) (g'(\mu_i))^2 \frac{1}{V(\mu_i)}} = \mathbf{X}^T \mathbf{W} \mathbf{X}\]
where
\[\mathbf{W} = \frac{1}{a(\phi)} \begin{bmatrix} \frac{1}{V(\mu_1) (g'(\mu_1))^2} & & \\ & ... & \\ & & \frac{1}{V(\mu_n)(g'(\mu_n))^2}\\ \end{bmatrix}\]
From Fisher Scoring, we have:
\[\mathbf{\beta}_{t+1} = \mathbf{\beta_t} = \mathbf{J}^{-1} \nabla_{\beta} l (\beta_t)\]
We can rewrite \(\nabla_{\beta}l = \sum_{i = 1}^n \frac{y_i - \mu_i}{a(\phi)} \frac{1}{a(\phi)} \frac{x_{i,j}}{V(\mu_i g'(\mu_i))}\) as \(\mathbf{X}^T \mathbf{D} \mathbf{V}^{-1} (y - \mu)\) where:
\[\mathbf{D} = \begin{bmatrix} \frac{1}{g'(\mu_1)} & & \\ & ... & \\ & & \frac{1}{g'(\mu_n)} \\ \end{bmatrix}\]
\[\mathbf{V}^{-1} = \frac{1}{a(\phi)} \begin{bmatrix} \frac{1}{V(\mu_1)} & & \\ & ... & \\ & & \frac{1}{V(\mu_n)} \\ \end{bmatrix}\]
Then for the Fisher Equation:
\[\mathbf{\beta}_{t+1} = \mathbf{\beta}_t + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{D} \mathbf{V}^{-1} (\mathbf{y} - \mathbf{\mu})\]
We can write this more generally, by noting that \(\mathbf{W}\) is the same as \(\mathbf{D} \mathbf{V}^{-1}\), except we have \(\frac{1}{(g'(\mu_i))^2}\).
\[\implies \mathbf{D} \mathbf{V}^{-1} = \mathbf{W M}\]
where
\[\mathbf{M} = \begin{bmatrix} \frac{1}{g'(\mu_1)} & & \\ & ... & \\ & & \frac{1}{g'(\mu_1)} \\ \end{bmatrix}\]
\[\implies \beta_{t + 1} = \beta_t + (X^T W X)^{-1} X^T W M (y - \mu)\]
Each term can be calculated and used to generate an iterative model-fitting algorithm for updating the esimates \(\beta_t\) at each iteration. We then set a convergence condition such at if the increase in the value of the likelihood function between iterations is arbitrarily small (\(\epsilon\)), the algorithm stops and outputs the values of \(\beta_t\) at iteration \(t\), as estimates for \(\beta\).
This result can be further simplified
Equation 4- Iteratively Reweighted Least Squares
\[\beta_{t+1} = (X^T W X)^{-1} (X^T W X) \beta_t + (X^T W X)^{-1} X^T W M (y - \mu) \\ = (X^T W X)^{-1} X^T W (X \beta_t + M (y - \mu)) \\ = (X^T W X)^{-1} X^T W Z_t\]
where \(Z_t := \eta_t + M(y - \mu)\)
From Equation 4, the step size at each iteration mainly depends on \(Z_t\) and \(W\). Larger \((y - \mu)\) (in a relative sense) indicates a model which is poorly fit hence we require larger adjustments to \(\beta_t\) to converge more quickly to a reasonable fit before we start iterating in a more granular manner.
Recall, we have shown \(\frac{\partial l_i}{\beta_j} \propto \frac{1}{g'(\mu_i)}\). \(g'(\mu_i)\) is the derivative of the link function, giving the rate of change of model predictions w.r.t. \(\beta\). Thus, larger values of \(g'(\mu_i)\) indicate a model whose fit is sensitive to small changes in \(\beta\), so is not close to at least a local (ideally a global) optima. In this case, larger step changes are required.
The Hessian matrix of \(2^{nd}\) derivatives adjusts the step change depending on the curvature of the gradient of the likelihood function at \(\beta_t\), information encapsulated in the matrix \(W\), allowing one to optimize step changes depending on the shape of the likelihood function at \(\beta\). For example, if the \(2^{nd}\) derivative is negative (the gradient of the likelihood function is becoming more negative for a small increase \(\epsilon\) at \(x\)), the likelihood function curves downwards, so the likelihood will decrease by more than anticipated by the \(1^{st}\) derivative alone.
Derivation of Model Fitting Algorithms/Coefficients
This derivation of Iteratively Reweighted Least Squares for GLMs follows a similar procedure to the derivation of any numerical model fitting algorithm. Firstly, we identify an objective function over which to optimize. Typical Machine Learning problems involve minimizing some loss function, which measures the discrepency between actual and predicted target values. Gradient Descent and its variations are often used for solving such optimization problems.
Conclusion
Generalized Linear Models are used in a wide array of fields. GLM’s are particularly useful when we expect the target variable to have been generated by a distribution from the exponential family, with a mean-variance relationship proportional to that of the modelled distribution.
Whilst GLM’s tend to be outperformed by models capable of accounting for non-linearities and multi-dimensional interactions, they are highly appropriate for inference and may outperform more complex models if there is a lack of data available.