182K - views

Point Estimation properties of estimators nitesample properties CB

3 largesample properties CB 101 1 FINITESAMPLE PROPERTIES How an estimator performs for 64257nite number of observations Estimator Parameter Criteria for evaluating estimators Bias does EW Variance of you would like an estimator with a smaller varia

Point Estimation properties of estimators nitesample properties CB

Download Pdf - The PPT/PDF document "Point Estimation properties of estimator..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Point Estimation properties of estimators nitesample properties CB"â€” Presentation transcript:

Page 1
Point Estimation: properties of estimators ﬁnite-sample properties (CB 7.3) large-sample properties (CB 10.1)  1 FINITE-SAMPLE PROPERTIES How an estimator performs for ﬁnite number of observations Estimator: Parameter: Criteria for evaluating estimators: Bias: does EW Variance of (you would like an estimator with a smaller variance) Example: ,...,X i.i.d. µ, Unknown parameters are and Consider: , estimator of , estimator of Bias: nµ . So unbiased Var n EX EX 2( ) +
Page 2
Hence it is biased. To ﬁx this bias, consider the estimator , and Es

(unbiased).  Mean-squared error (MSE) of is . Common criterion for comparing estimators. Decompose: MSE ) = VW + ( EW =Variance + (Bias) Hence, for an unbiased estimator: MSE ) = VW Example: ,...,X [0 , ]. ) = 1 /θ, x [0 , ]. Consider estimator = 2 = 2 . So unbiased MSE ) = VX Consider estimator max ( ,...,X ). In order to derive moments, start by deriving CDF: ) = z,X z,...,X =1 if 1 otherwise Therefore ) = , for 0 ) = dz dz + 1 θ. Bias( ) = θ/ + 1) ) = +1 dz +2 Hence +2 +1 +2)( +1) Accordingly, MSE= +2)( +1)
Page 3
 Continue the previous example. Redeﬁne

+1 max( ,...,X ). Now both estimators and are unbiased. Which is better? (1 /n ). +1 (max( ,...,X )) = +2) (1 /n ). Hence, for large enough, has a smaller variance, and in this sense it is “better”.  Best unbiased estimator : if you choose the best (in terms of MSE) estimator, and restrict yourself to unbiased estimators, then the best estimator is the one with the lowest variance A best unbiased estimator is also called the “Uniform minimum variance unbiased estimator” (UMVUE). Formally: an estimator is a UMVUE of satisﬁes: (i) , for all (unbiasedness) (ii) , for all , and all other

unbiased estimators The “uniform” condition is crucial, because it is always possible to ﬁnd estimators which have zero variance for a speciﬁc value of  It is diﬃcult in general to verify that an estimator is UMVUE, since you have to verify condition (ii) of the deﬁnition, that VW is smaller than all other unbiased estimators. Luckily, we have an important result for the lowest attainable variance of an estimator. Theorem 7.3.9 (Cramer-Rao Inequality): Let ,...,X be a sample with joint pdf ), and let ) be any estimator satisfying d ) = ii Then d log
Page

4
The RHS above is called the Cramer-Rao Lower Bound Proof: CB, pg. 336. Note: the LHS of condition ( ) above is d dX , so by Leibniz rule, this condition basically rules out cases where the support of is dependent on The equality log ) = ∂f dx dx 1 = 0 (1) is noteworthy, because log ) = 0 is the FOC of maximum likelihood estimation problem. (Alternatively, as in CB, apply condition (i) of CR result, using ) = 1.) In the i.i.d. case, this becomes the sample average log ) = 0 And by the LLN: log log where is the true value of . This shows that maximum likelihood estimation of is

equivalent to estimation based on the moment condition log ) = 0 which holds only at the true value . (Thus MLE is “consistent” for the true value , as we’ll see later.) (However, note that Eq. (1) holds at all values of , not just .) What if model is “misspeciﬁed”, in the sense that true density of is ~x ), and that for all Θ, ~x ~x ) (that is, there is no value of the parameter such that the postulated model coincides with the true model )? Does Eq. (1) still hold? What is MLE looking for?
Page 5
In the iid case, the CR lower bound can be simpliﬁed Corollary

7.3.10 : if ,...,X i.i.d. ), then d log Up to this point, Cramer-Rao results not that operational for us to ﬁnd a “best estimator, because the estimator ) is on both sides of the inequality. How- ever, for an unbiased estimator, how can you simplify the expression further? Example: ,...,X i.i.d. µ, ). What is CRLB for an unbiased estimator of Unbiased numerator =1. log ) = log log ∂µ log ) = log VX Hence the CRLB = This is the variance of the sample mean, so that the sample mean is a UMVUE for Sometimes we can simplify the denominator of the CRLB further: Lemma 7.3.11 (Information

inequality): if ) satisﬁes (*) d log  log dx, then log log Rough proof: LHS of (*): Using Eq. (1) above, we get that LHS of (*) =0.
Page 6
RHS of (*):  log dx log fdx ∂f dx log log Putting the LHS and RHS together yields the desired result. Note: the LHS of the above condition (*) is just d log dX so that, by Leibniz’ rule, the condition (*) just states that the bounds of the integration (i.e., the support of ) does not depend on . Normal distribution satisﬁes this (support is always ( )), but [0 , ] does not. Also, the information inequality depends crucially

on the equality log ) = 0, which depends on the correct speciﬁcation of the model. Thus information inequality can be used as basis of “speciﬁcation test”. (How?) Example: for the previous example, consider CRLB for unbiased estimator of We can use the information inequality, because condition (*) is satisﬁed for the normal distribution. Hence: log ) = log log  Hence the CRLB is Example: ,...,X [0 , ]. Check conditions for CRLB for an unbiased estimator ) of d EW ) = 1 (because it is unbiased) d EW ) = 1 Hence, condition (i) of theorem not satisﬁed.
Page

7
But when can CRLB (if it exists) be attained? Corollary 7.3.15: ,...,X i.i.d. ), satisfying the conditions of CR theorem. Let ) = =1 ) denote the likelihood function. Estimator ) unbiased for ) attains CRLB iﬀ you can write log ) = for some function ). Example: ,...,X i.i.d. µ, Consider estimating ∂µ log ) = ∂µ =1 log ∂µ =1 log log =1 Hence, CRLB can be attained (in fact, we showed earlier that CRLB attained by  Loss function optimality Let ). Consider a loss function θ,W )), taking values in [0 ), which penalizes you when your ) estimator is “far” from

the true parameter . Note that θ,W )) is a random variable, since (and )) are random. Consider estimators which minimize expected loss : that is min ··· θ,W )) min ··· θ,W ··· ))
Page 8
where θ,W ··· )) is the risk function . (Note: the risk function is not a random variable, because has been integrated out.) Loss function optimality is a more general criterion than minimum MSE. In fact, because MSE )) = , the MSE is actually the risk function associated with the quadratic loss function θ,W )) = Other examples of loss functions: Absolute error loss: Relative

quadratic error loss: +1 The exercise of minimizing risk takes a given value of as given, so that the minimized risk of an estimator depends on whichever value of you are considering. You might be interested in an estimator which does well regardless of which value of you are considering. (Analogous to the focus on the uniform minimal variance.) For this diﬀerent problem, you want to consider a notion of risk which does not depend on . Two criteria which have been considered are: “Average” risk: min ··· θ,W ··· )) dθ. where ) is some weighting function across . (In a Bayesian

interpretation, ) is a prior density over .) Minmax criterion: min ··· max θ,W ··· )) Here you choose the estimator ··· ) to minimize the maximum risk = max θ,W ··· )), where is set to the “worse” value. So minmax optimizer is the best that can be achieved in a “worst-case” scenario. Example: ,...,X i.i.d. µ, ). Sample mean is: unbiased minimum MSE UMVUE attains CRLB minimizes expected quadratic loss
Page 9
2 LARGE SAMPLE PROPERTIES OF ESTIMATORS It can be diﬃcult to compute MSE, risk functions, etc., for some estimators, especially when estimator does not resemble a

sample average. Large-sample properties: exploit LLN, CLT Consider data ,X ,... by which we construct a sequence of estimators ,W ,... is a random sequence. Deﬁne: we say that is consistent for a parameter iﬀ the random sequence converges (in some stochastic sense) to Strong consistency obtains when as Weak consistency obtains when For estimators like sample-means, consistency (either weak or strong) follows easily using a LLN. Consistency can be thought of as the large-sample version of unbiased- ness.  Deﬁne: an M-estimator is an estimator of which a maximizer of an

objective function ). Examples: MLE: ) = =1 log Least squares: ) = =1 )] . OLS is special case when ) = GMM: ) = ) where ) = =1 =1 ,..., =1 an 1 vector of sample moment conditions, and is an weighting matrix. Notation: Denote the limit objective function ) = plim ) (at each ). Deﬁne argmax ). Consistency of M-estimators Make the following assumptions:
Page 10
1. ) is uniquely maximized at some value (“identiﬁcation”) 2. Parameter space Θ is a compact subset of 3. ) is continuous in 4. converges uniformly in probability to ); that is: sup Theorem: (Consistency of

M-Estimator) Under assumption 1,2,3,4, Proof: We need to show: for any arbitrarily small neighborhood containing ∈N 1. For large enough, the uniform convergence conditions that, for all ,δ> 0, sup </ δ. The event “sup </ 2” implies </ >Q / 2 (2) Similarly, / >Q / (3) Since = argmax ), Eq. (2) implies >Q / (4) Hence, adding Eqs. (3) and (4), we have >Q . (5) So we have shown that sup </ 2 = >Q >Q sup </ 10
Page 11
Now deﬁne as any open neighborhood of , which contains , and is the complement of in . Then is compact, so that max ) exists. Set max ).

Then >Q max ∈N ∈N >Q Since the argument above holds for any arbitrarily small neighborhood of , we are done. In general, the limit objective function ) = plim ) may not be that straightforward to determine. But in many cases, ) is a sample average of some sort: ) = (eg. least squares, MLE). Then by a law of large numbers, we conclude that (for all ) = plim ) = where denote expectation with respect to the true (but unobserved) distri- bution of Most of the time, can be interpreted as a “true value”. But if model is mis- speciﬁed, then this interpretation doesn’t hold (indeed,

under misspeciﬁcation, not even clear what the “true value” is). So a more cautious way to interpret the consistency result is that argmax which holds (given the conditions) no matter whether model is correctly speci- ﬁed. ** Of the four assumptions above, the most “high-level” one is the uniform con- vergence condition. Suﬃcient conditions for this conditions are: 1. Pointwise convergence: For each Θ, ) = (1). 11
Page 12
2. ) is stochastically equicontinuous : for every > ,η> 0 there exists a sequence of random variable , ) and , ) such that for all

n>n > < and for each there is an open set containing with sup ∈N | n>n Note that both and do not depend on : it is uniform result. In a deterministic context, this is an “in probability” version of the notion of uniform equicontinuity: we say a sequence of deterministic functions ) is uniformly equicontinuous if, for every > 0 there exists ) and ) such that for all sup || || < | , n>n Asymptotic normality for M-estimators Deﬁne the “score vector ) = ∂Q ,..., ∂Q Similarly, deﬁne the Hessian matrix i,j i,j K. Note that the Hessian is symmetric. Make the

following assumptions: 1. = argmax 2. interior(Θ) 3. ) is twice continuously diﬀerentiable in a neighborhood of 4. (0 Σ) 5. Uniform convergence of Hessian: there exists the matrix ) which is contin- uous at and sup ∈N ||5 || 0. 6. ) is nonsingular 12
Page 13
Theorem (Asymptotic normality for M-estimator): Under assumptions 1,2,3,4,5, (0 ,H where ). Proof: (sketch) By Assumptions 1,2,3, ) = 0 (this is FOC of maximization problem). Then using mean-value theorem (with denoting mean value): 0 = ) = ) + )( ⇒5 {z (using A5) ) = {z (0 Σ) (using A4) (0

Σ) = (0 ,H Note: A5 is a uniform convergence assumption on the sample Hessian. Compared to analogous step in the Delta method, we used the plim-operator theorem coupled with the “squeeze” result. Here it doesn’t suﬃce, because as , both as well as the function ) = ·· ) are changing.  2.1 Maximum likelihood estimation The consistency of MLE can follow by application of the theorem above for consistency of M-estimators. Essentially, as we noted above,what the consistency theorem showed above was that, for any M-estimator sequence plim = argmax For MLE, there is an argument due to

Wald (1949), who shows that, in the i.i.d. case, the “limiting likelihood function” (corresponding to ) is indeed globally maximized at , the “true value”. Thus, we can directly conﬁrm the identiﬁcation assumption of the M-estimator consistency theorem. This argument is of interest by itself. Argument: (summary of Amemiya, pp. 141–142) Deﬁne MLE argmax log ). Let denote the true value. By LLN: log log for all (not necessarily the true ). 13
Page 14
By Jensen’s inequality: log log But ) = 1, since ) is a density function, for all Hence: log log log log ) is

maximized at the true This is the “identiﬁcation” assumption from the M-estimator consistency theo- rem. Analogously, we also know that, for δ> 0, log 0; log By the SLLN, we know that log [log ~x log ~x )] as so that, with probability 1, log ~x log ~x ) for large enough. Similarly, for large enough, log ~x log ~x ) with probability 1. Hence, for large argmax log ~x δ, That is, the MLE is strongly consistent for Note that this argument requires weaker assumptions than the M-estimator consistency theorem above.  Now we consider another idea, eﬃciency , which can be

thought of as the large-sample analogue of the “minimum variance” concept. In this step, note the importance of assumption A3 in CB, pg. 516. If has support depending on , then it will not integrate to 1 for all 14
Page 15
For the sequence of estimators , suppose that )( (0 , where ) is a polynomial in . Then is denoted the asymptotic variance of In “usual” cases, ) = . For example, by the CLT, we know that (0 , ). Hence, is the asymptotic variance of the sample mean Deﬁnition 10.1.11: An estimator sequence is asymptotically eﬃcient for if (0 ,v )), where the asymptotic

variance ) = log (the CRLB)  By asymptotic normality result for M-estimator, we know what the asymptotic distri- bution for the MLE should be. However, it turns out given the information inequality, the MLE’s asymptotic distribution can be further simpliﬁed. Theorem 10.1.12: Asymptotic eﬃciency of MLE Proof: (following Amemiya, pp. 143–144) MLE satisﬁes the FOC of the MLE problem: 0 = log MLE Using the mean value theorem: 0 = log log MLE log log log log Note that, by the LLN, log log ∂f dx. 15
Page 16
Using same argument as in the information inequality

result above, the last term is: ∂f dx fdx = 0 Hence, the CLT can be applied to the numerator of (**): numerator of (**) ,E log By LLN, and uniform convergence of Hessian term: denominator of (**) log Hence, by Slutsky theorem: log log By the information inequality: log log so that log so that the asymptotic variance is the CRLB. Hence, the asymptotic approximation for the ﬁnite-sample distribution is MLE log 16