Gauss–Markov Theorem and the Proof of BLUE
Digging into what OLS actually guarantees — the 5 assumptions behind linear regression and why they make your estimates BLUE according to the Gauss-Markov theorem.
Remember that earlier post — derivatives I studied, linear regression in finance? I figured I’d end up linking to it from a million different places, so I yanked the linear regression bit out into its own post……..
In that one I handled it pretty lightly. We set the regression line as $y = ax + b$, played the little game of finding the $a$ and $b$ that minimize the sum of squared errors (SSE — sum of squared errors), and pinned them down with simple differentiation.
But what does any of that actually mean? What are we quietly assuming? Why are we doing it that way? Let’s dig into all of that here.
Quick notation switch first. From now on, instead of $a, b$, I’m going to write the regression line as
$$y_i = \beta_1 + \beta_2 x_i + e_i,$$with sample estimates $b_1, b_2$ — like that. Haven’t really mentioned it yet, but $b$ is an estimate and $\beta$ is the parameter being estimated. I’ll come back to the estimator/estimate distinction in a sec.
OK so. The big question. When we do regression — the thing that minimizes SSE — is that even a good thing?
If you press me with “well, if it’s good, why is it good??????”, the honest answer is: “That method isn’t always good ^^. But when certain conditions hold, you can call it the best!!!!” So let me lay out those ‘certain conditions’ first.
For each value of $x_i$, the corresponding $y$ is linear: $y_i = \beta_1 + \beta_2 x_i + e_i$.
The expected value of the random error $e$ is zero: $E(e_i) = 0$. (We’re already assuming $E(y_i) = \beta_1 + \beta_2 x_i$, so this falls out.)
The variance of the random error $e$ is constant: $\text{var}(e_i) = \sigma^2$. ($y$ and $e$ have the same variance, since they only differ by a fixed arithmetic shift.)
For any pair of errors, $\text{cov}(e_i, e_j) = 0$. (i.e. they don’t influence each other.)
The variable $x$ is not random, and it has to take at least two distinct values.
(Optional.) $e$ is normally distributed, centered at its mean $E(e) = 0$.
If you squish those 5 down into pure formula form, you get just 3:
$$E(y_i) = \beta_1 + \beta_2 x_i$$$$\text{var}(y_i) = \sigma^2$$$$\text{cov}(y_i, y_j) = 0 \quad (i \neq j)$$OK so. Here’s the situation I’m about to talk about.
We have some data. Using the method called the method earlier — least squares, the SSE-minimizer — we get estimates
$$b_1, b_2 \quad \text{for} \quad \beta_1, \beta_2,$$and a fitted regression line $\hat{y} = b_1 + b_2 x$.
And the doubt creeps in: can we even trust $b_1, b_2$?????
The answer: “If your data satisfies those 5 assumptions, then the regression line $\hat{y} = b_1 + b_2 x$ you drew is plenty trustworthy ^^.”
“It’s the best!”
“It’s BLUE — Best Linear Unbiased Estimator ^^*”
— that’s what we get to say. But why?????? how???????? who’s guaranteeing it?
The thing guaranteeing it is the Gauss-Markov Theorem.
OK let me try to picture those conditions a bit.
First, what does $E(e) = 0$ look like? It means all the errors are symmetric (?) around the regression line, and the line itself sits exactly at the spot where they cancel each other out.
The second one — $\text{var}(e_i) = \sigma^2$ — is the homoskedasticity condition. The picture: same variance for any $x$. Doesn’t matter where you are on the x-axis, the spread is the same.
And the third — yeah, the third says the errors aren’t correlated with each other… and that’s gotta be important, right? Completely independent errors. (If this one breaks, we’re walking into autocorrelation-model territory, right?)
So: when the data plays nice with all those assumptions as assumptions, then!!!!! the estimators we cooked up out of the method of least squares —
$$b_2 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad b_1 = \bar{y} - b_2 \bar{x}$$— are Best.
OK now since estimator vs estimate is going to start feeling muddled, let me clean it up.
- “The number you get when you plug a specific sample into a formula” — that’s an estimate.
- “The rule or formula you set up to get that number” — that’s an estimator.
We’re now going to describe the characteristics of the least-squares estimator.
First: “if the mean of the estimator lines up with the parameter value” → “we call that estimator unbiased.”
I’ll show $b_1, b_2$ are unbiased. It’s easy. We just have to show
$$E(b_1) = \beta_1, \quad E(b_2) = \beta_2.$$Let me massage $b_2$ a little first.
$$b_2 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \sum w_i \, y_i, \quad w_i = \frac{x_i - \bar{x}}{\sum_j (x_j - \bar{x})^2}$$Process up to here. And then
$$b_2 = \sum w_i y_i = \sum w_i (\beta_1 + \beta_2 x_i + e_i) = \beta_2 + \sum w_i e_i$$— like that…….. (you can quickly check $\sum w_i = 0$ and $\sum w_i x_i = 1$, which kills the $\beta_1$ term and turns the $\beta_2$ term into just $\beta_2$.)
For $E(b_1) = \beta_1$, in formulas:
$$E(b_2) = \beta_2 + \sum w_i E(e_i) = \beta_2 + 0 = \beta_2,$$$$E(b_1) = E(\bar{y} - b_2 \bar{x}) = E(\bar{y}) - \bar{x}\, E(b_2) = (\beta_1 + \beta_2 \bar{x}) - \bar{x}\beta_2 = \beta_1.$$Boom. Confirmed: our estimator is unbiased.
OK now — as the actual basis for why we can trust $b_1, b_2$ — we go compute $\text{var}(b_1), \text{var}(b_2)$. The smaller the variance, the more we get to say “yeah, you can trust it!!” — right????
Let’s go again,
$$\text{var}(b_2) = \text{var}\!\left( \sum w_i y_i \right) = \sum w_i^2 \,\text{var}(y_i) = \sigma^2 \sum w_i^2 = \frac{\sigma^2}{\sum (x_i - \bar{x})^2}.$$From here on out I’m going to type stuff in the formula editor as much as I possibly can… please forgive me just this once T_T T_T T_T T_T
So the variance of the estimator $b_2$ (estimating $\beta_2$) looks like that. The real question: is that value actually the minimum???? Let’s prove it!!!
Suppose there’s some other estimator $b_2^*$…..
(Honestly, typing in the formula editor is really lol lol lol lol lol lol lol lol lol………….)
That thing….
$$b_2^* = \sum c_i y_i$$let’s say it exists!!!!!
For $b_2^*$ to be linear and unbiased, the $c_i$ have to satisfy $\sum c_i = 0$ and $\sum c_i x_i = 1$.
On the other hand~~~~ (this came up in the intermediate steps above. Pulling that photo back over one more time, here it is.)
$$b_2 = \sum w_i y_i, \quad w_i = \frac{x_i - \bar{x}}{\sum_j (x_j - \bar{x})^2}$$let’s say that, and $k_i$ — let’s set
$$k_i = c_i - w_i.$$Then
$$\text{var}(b_2^*) = \sigma^2 \sum c_i^2 = \sigma^2 \sum (w_i + k_i)^2 = \sigma^2 \sum w_i^2 + 2\sigma^2 \sum w_i k_i + \sigma^2 \sum k_i^2.$$You can check $\sum w_i k_i = 0$ (from the unbiasedness conditions on $c_i$), so
$$\text{var}(b_2^*) = \text{var}(b_2) + \sigma^2 \sum k_i^2 \;\geq\; \text{var}(b_2),$$with equality only when every $k_i = 0$, i.e. $b_2^* = b_2$. So among all linear unbiased estimators, $b_2$ has the minimum variance. BLUE confirmed.
We should run the same thing through for $b_1$ and check……
I’ll leave that as homework…………
Then bye bye bye.
Originally written in Korean on my Naver blog (2017-11). Translated to English for gdpark.blog.
Comments
Discussion happens via GitHub Discussions. You'll need a GitHub account to comment.