librarylfeLoadingrequiredpackageMatrixsetseed42x1rnorm20f1sample8lengthx1replaceTRUE10f2sample8lengthx1replaceTRUE10e1sinf1002f22rnormlengthx1y25x1e1meane1 ID: 264798
Download Pdf The PPT/PDF document "2SIMENGAUREanother(disjoint)setofindivid..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Multicollinearity,identication,andestimablefunctionsSimenGaureAbstract.Sincethereisquitealotofconfusionhereandthereaboutwhathappenswhenfactorsarecollinear;hereisawalkthroughoftheidenticationproblemswhichmayariseinmodelswithmanydummies,andhowlfehandlesthem.(Or,attheveryleast,attemptstohandlethem).1.ContextThelfepackageisusedforordinaryleastsquaresestimation,i.e.modelswhichconceptuallymaybeestimatedbylmas lm(y~x1+x2+...+xm+f1+f2+...+fn)wheref1,f2,...,fnarefactors.Thestandardmethodistointroduceadummyvariableforeachlevelofeachfactor.Thisistoomuchasitintroducesmulticollinearitiesinthesystem.Conceptually,thesystemmaystillbesolved,buttherearemanydierentsolutions.Inallofthem,thedierencebetweenthecoecientsforeachfactorwillbethesame.Theambiguityistypicallysolvedbyremovingasingledummyvariableforeachfactor,thisistermedareference.Thisislikeforcingthecoecientforthisdummyvariabletozero,andtheotherlevelsarethenseenasrelativetothiszero.Otherwaystosolvetheproblemistoforcethesumofthecoecientstobezero,oronemayenforcesomeotherconstraint,typicallyviathecontrastsargumenttolm.Thedefaultinlmistohaveareferencelevelineachfactor,andacommoninterceptterm.Inlfethesameestimationcanbeperformedby felm(y~x1+x2+...+xm|f1+f2+...+fn)Sincefelmconceptuallydoesexactlythesameaslm,thecontrastsapproachmayworktheretoo.Orrather,itisactuallynotnecessarythatfelmhandlesitatall,itisonlynecessaryifoneneedstofetchthecoecientsforthefactorlevelswithgetfe.lfeisintendedforverylargedatasets,withfactorswithmanylevels.Thentheapproachwithasingleconstraintforeachfactormaysometimesnotbesucient.Thestandardexampleintheeconometricsliterature(seee.g.[2])isthecasewithtwofactors,oneforindividuals,andoneforrmstheseindividualsworkfor,chang-ingjobsnowandthen.Whathappensinpracticeisthatthelabourmarketmaybedisconnected,sothatonesetofindividualsmovebetweenonesetofrms,and1 2SIMENGAUREanother(disjoint)setofindividualsmovebetweensomeotherrms.Thishappensfornoobviousreason,andisdatadependent,notintrinsictothemodel.Theremaybeseveralsuchcomponents.I.e.therearemoremulticollinearitiesinthesystemthantheobviousones.Insuchacase,thereisnowaytocomparecoecientsfromdierentconnectedcomponents,itisnotsucientwithasingleindividualrefer-ence.Theproblemmaybephrasedingraphtheoreticterms(seee.g.[1,3,4]),anditcanbeshownthatitissucientwithonereferencelevelineachoftheconnectedcomponents.Thisiswhatlfedoes,inthecasewithtwofactorsitidentiesthesecomponents,andforceoneleveltozeroinoneofthefactors.Intheexamplesbelow,rathersmallrandomlygenerateddatasetsareused.lfeishardlythebestsolutionfortheseproblems,theyaresolelyusedtoillustratesomeconcepts.IcanassurethereaderthatnoCPUs,sleepingpatterns,romanticrela-tionships,treesorcats,noranimalsingeneral,wereharmedduringdatacollectionandanalysis.2.IdenticationwithtwofactorsInthecasewithtwofactors,identicationiswell-known.getfewillpartitionthedatasetintoconnectedcomponents,andintroduceareferencelevelineachcomponent: library(lfe)##Loadingrequiredpackage:Matrixset.seed(42)x1rnorm(20)f1sample(8,length(x1),replace=TRUE)/10f2sample(8,length(x1),replace=TRUE)/10e1sin(f1)+0.02*f2^2+rnorm(length(x1))y2.5*x1+(e1-mean(e1))summary(estfelm(y~x1|f1+f2))####Call:##felm(formula=y~x1|f1+f2)####Residuals:##Min1QMedian3QMax##-0.7331-0.17510.00000.11390.7331####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x11.96090.28546.8720.000998***##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.8097on5degreesoffreedom##MultipleR-squared(fullmodel):0.9849AdjustedR-squared:0.9425##MultipleR-squared(projmodel):0.9042AdjustedR-squared:0.6361##F-statistic(fullmodel):23.23on14and5DF,p-value:0.001318##F-statistic(projmodel):47.22on1and5DF,p-value:0.0009982 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS3Weexaminetheestimablefunctionproducedbyefactory. efefactory(est)is.estimable(ef,est$fe)##[1]TRUEgetfe(est)##effectobscompfeidx##f1.0.10.8423045312f10.1##f1.0.20.4236657541f10.2##f1.0.30.6040985222f10.3##f1.0.40.9016683541f10.4##f1.0.50.6742599622f10.5##f1.0.61.0873761821f10.6##f1.0.7-1.1856316521f10.7##f1.0.80.3876950431f10.8##f2.0.1-2.1776245321f20.1##f2.0.20.0000000061f20.2##f2.0.30.4401316611f20.3##f2.0.4-0.9375407311f20.4##f2.0.50.0000000032f20.5##f2.0.6-0.5959834341f20.6##f2.0.7-0.1680796122f20.7##f2.0.8-0.0247890311f20.8Aswecanseefromthecompentry,therearetwocomponents,thesecondonewithf1=0.1,f1=0.3,f1=0.5,andf2=0.5andf2=0.7.Areferenceisintroducedineachofthecomponents,i.e.f2.0.2=0andf2.0.5=0.Ifwelookatthedataset,thecomponentstructurebecomesclearer: data.frame(f1,f2,comp=est$cfactor)##f1f2comp##10.30.52##20.70.21##30.60.21##40.20.61##50.80.61##60.40.21##70.40.41##80.60.31##90.20.61##100.50.52##110.40.21##120.50.72##130.40.61##140.20.21##150.80.21##160.20.81##170.30.52##180.80.11 4SIMENGAURE ##190.70.11##200.10.72Observations1,10,12,17,and20belongtocomponent2;nootherobservationhasf1%in%c(0.1,0.3,0.5)orf2%in%c(c0.5,0.7),thusitisclearthatcoef-cientsforthesecannotbecomparedtoothercoecients.lmissilentaboutthiscomponentstructure,hencecoecientsarehardtointerpret.Though,predictivepropertiesandresidualsarethesame: f1factor(f1);f2factor(f2)summary(lm(y~x1+f1+f2))####Call:##lm(formula=y~x1+f1+f2)####Residuals:##1234567##2.095e-017.331e-018.327e-17-4.393e-01-6.495e-01-5.678e-011.457e-16##891011121314##6.939e-186.029e-013.469e-178.197e-027.633e-174.859e-01-1.636e-01##151617181920##-8.366e-02-9.021e-17-2.095e-017.331e-01-7.331e-01-4.857e-16####Coefficients:(1notdefinedbecauseofsingularities)##EstimateStd.Errortvalue--52;倀Pr(|t|)##(Intercept)0.67420.89300.7550.484267##x11.96090.28546.8720.000998***##f10.2-2.42821.4826-1.6380.162390##f10.3-0.23821.5798-0.1510.886042##f10.4-1.95021.6137-1.2080.280892##f10.5-0.16801.1778-0.1430.892114##f10.6-1.76451.6753-1.0530.340448##f10.7-4.03751.5625-2.5840.049189*##f10.8-2.46421.5037-1.6390.162193##f20.22.17761.02492.1250.086978.##f20.32.61781.48371.7640.137940##f20.41.24011.65080.7510.486366##f20.50.16811.32690.1270.904134##f20.61.58161.10801.4280.212781##f20.7NANANANA##f20.82.15281.38081.5590.179716##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.8097on5degreesoffreedom##MultipleR-squared:0.9849,AdjustedR-squared:0.9425##F-statistic:23.23on14and5DF,p-value:0.001318 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS53.IdenticationwiththreeormorefactorsInthecasewiththreeormorefactors,thereisnogeneralintuitivetheory(yet)forhandlingidenticationproblems.lferesortstothesimple-mindedapproachthatnon-obviousmulticollinearitiesariseamongthersttwofactors,andassumesitissucientwithasinglereferencelevelforeachoftheremainingfactors,i.e.thattheyinprinciplecouldbespeciedasordinarydummies.Inotherwords,theorderofthefactorsinthemodelspecicationisimportant.Atypicalexamplewouldbe3factors;individuals,rmsandeducation: estfelm(logwage~x1+x2|id+firm+edu)getfe(est)Thiswillresultinthesamenumberofreferencesasifusingthemodel logwage~x1+x2+edu|id+firmthoughitmayrunfaster(orslower).Alternatively,onecouldspecifythemodelas logwage~x1+x2|firm+edu+idThiswouldnotaccountforapartioningofthelabourmarketalongindivid-ual/rm,butalongrm/education,usingasinglereferencelevelfortheindividuals.Inthisexample,thereissomereasontosuspectthatitisnotsucient,dependingonhoweduisspecied.Thereexistsnogeneralschemethatsetsupsuitablerefer-encegroupswhentherearemorethantwofactors.Itmayhappenthatthedefaultissucient.Thefunctiongetfewillcheckwhetherthisisso,anditwillyieldawarningabout'non-estimablefunction'ifnot.Withsomeluckitmaybepossibletorearrangetheorderofthefactorstoavoidthissituation.Thereisnothingspecialwithlfeinthisrespect.Youwillmeetthesameproblemwithlm,itwillremoveareferencelevel(ordummy-variable)ineachfactor,butthesystemwillstillcontainmulticollinearities.Youmayremovereferencelevelsuntilallthemulticollinearitiesaregone,butthereisnoobviouswaytointerprettheresultingcoecients.Toillustrate,theclassicalexampleiswhenyouincludeafactorforage(inyears),afactorforobservationyear,andafactorforyearofbirth.Youpickareferenceindividual,e.g.age=50,year=2013andbirth=1963,butthisisnotsu-cienttoremoveallthemulticollinearities.Ifyouanalyzethisproblem(seee.g.[6])youwillndthatthecoecientsareonlyidentieduptolineartrends.Youmayforcethelineartrendbetweenbirth=1963andbirth=1990tozero,byremovingthereferencelevelbirth=1990,andthesystemwillbefreeofmulticollinearities.Inthiscasethebirthcoecientshavetheinterpretationasbeingdeviationsfromalineartrendbetween1963and1990,thoughyoudonotknowwhichlineartrend.Theageandyearcoecientsarealsorelativetothissameunknowntrend.Intheabovecase,themulticollinearityisobviouslybuiltintothemodel,anditispossibletoremoveitandndsomeintuitiveinterpretationofthecoecients.Inthegeneralcase,wheneitherlmorgetfereportsahandfulofnon-obviousspuriousmulticollinearitesbetweenfactorswithmanylevels,youprobablywillnotbeabletondanyreasonablewaytointerpretcoecients.Ofcourse,certainlinear 6SIMENGAUREcombinationsofcoecientswillbeunique,i.e.estimable,andthesemaybefoundbye.g.theproceduresin[5,8],butthegeneralpictureismuddy.lfedoesnotprovideasolutiontothisproblem,however,getfewillstillprovideavectorofcoecientswhichresultsfromndinganon-uniquesolutiontoacertainsetofequations.Togetanysensefromthis,anestimablefunctionmustbeapplied.Thesimplestoneistopickareferenceforeachfactorandsubtractthiscoecientfromeachoftheothercoecientsinthesamefactor,andaddittoacommonintercept,howeverinthecasethisdoesnotresultinanestimablefunction,youareoutofluck.Ifyouforsomereasonbelievethatyouknowofanestimablefunction,youmayprovidethistogetfeviatheef-argument.Thereisanexampleinthegetfedocumentation.Youmayalsotestitforestimabilitywiththefunctionis.estimable,thisisaprobabilistictestwhichalmostneverfails(see[4,Remark6.2]).4.SpecifyinganestimablefunctionAmodelofthetype y~x1+x2+f1+f2+f3maybewritteninmatrixnotationas(1)y=X+D+;whereXisamatrixwithcolumnsx1andx2andDismatrixofdummiescon-structedfromthelevelsofthefactorsf1,f2,f3.Formally,anestimablefunctioninourcontextisamatrixoperatorwhoserowspaceiscontainedintherowspaceofD.Thatis,anestimablefunctionmaybewrittenasamatrix.Likethecontrastsargumenttolm.However,thelfepackageusesanR-functioninstead.Thatis,felmiscalledrst,itusestheFrisch-Waugh-LovelltheoremtoprojectouttheDtermfrom(1)(see[4,Remark3.2]): estfelm(y~x1+x2|f1+f2+f3)Thisyieldstheparametersforx1andx2,i.e.^.Tond^,theparametersforthelevelsoff1,f2,f3,getfesolvesacertainlinearsystem(see[4,eq.(14)]):(2)D =wherethevectorcanbecomputedwhenwehave^.Thisdoesnotidentify uniquely,wehavetoapplyanestimablefunctionto .TheestimablefunctionFischaracterizedbythepropertythatF 1=F 2whenever 1and 2aresolutionstoequation(2).RatherthancodingFasamatrix,lfecodesitasafunction.Itisofcoursepossibletoletthefunctionapplyamatrix,sothisisnotamaterialdistinction.So,let'slookatanexampleofhowanestimablefunctionmaybemade: library(lfe)x1rnorm(100)f1sample(7,100,replace=TRUE)f2sample(8,100,replace=TRUE)/8f3sample(10,100,replace=TRUE)/10e1sin(f1)+0.02*f2^2+0.17*f3^3+rnorm(100)y2.5*x1+(e1-mean(e1)) MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS7 summary(estfelm(y~x1|f1+f2+f3))####Call:##felm(formula=y~x1|f1+f2+f3)####Residuals:##Min1QMedian3QMax##-2.18822-0.552220.092780.628582.31181####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x12.55380.102624.88***##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.9963on76degreesoffreedom##MultipleR-squared(fullmodel):0.9086AdjustedR-squared:0.8809##MultipleR-squared(projmodel):0.8907AdjustedR-squared:0.8576##F-statistic(fullmodel):32.84on23and76DF,p-value:2.2e-16##F-statistic(projmodel):619.2on1and76DF,p-value:2.2e-16##***Standarderrorsmaybetoohighduetomorethan2groupsandexactDOF=FALSEInthiscase,with3factorswecannotbecertainthatitissucientwithasinglereferenceintwoofthefactors,butwetryitasanexercise.(lfedoesnotincludeanintercept,itissubsumedinoneofthefactors,soitshouldtentativelybesucientwithareferenceforthetwoothers).Theinputtoourestimablefunctionisasolution ofequation(2).Thear-gumentaddnamesisalogical,settoTRUEwhenthefunctionshouldaddnamestotheresultingvector.Thecoecientsisorderedthesamewayasthelevelsinthefactors.Weshouldpickasinglereferenceinfactorsf2,f3,subtractthese,andaddthesumtotherstfactor: effunction(gamma,addnames)fref2gamma[[8]]ref3gamma[[16]]gamma[1:7]gamma[1:7]+ref2+ref3gamma[8:15]gamma[8:15]-ref2gamma[16:25]gamma[16:25]-ref3if(addnames)fnames(gamma)c(paste( f1 ,1:7,sep= . ),paste( f2 ,1:8,sep= . ),paste( f3 ,1:10,sep= . ))ggammagis.estimable(ef,fe=est$fe)##[1]TRUEgetfe(est,ef=ef) 8SIMENGAURE ##effect##f1.1-0.013634682##f1.20.727611420##f1.3-0.521386749##f1.4-0.646496809##f1.5-1.568204155##f1.6-0.151511048##f1.70.286980841##f2.10.000000000##f2.2-0.289569658##f2.30.168627982##f2.4-0.658310494##f2.50.253613291##f2.60.427094488##f2.7-0.249330433##f2.8-0.772323808##f3.10.000000000##f3.2-0.004888500##f3.3-0.205494033##f3.40.449689498##f3.50.729926376##f3.60.697845803##f3.70.569065140##f3.80.583417051##f3.90.113820998##f3.100.005328265Wemaycomparethistothedefaultestimablefunction,whichpicksareferenceineachconnectedcomponentasdenedbythetworstfactors. getfe(est)##effectobscompfeidx##f1.1-0.74124610161f11##f1.20.00000000191f12##f1.3-1.24899817151f13##f1.4-1.37410823121f14##f1.5-2.29581558101f15##f1.6-0.87912247161f16##f1.7-0.44063058121f17##f2.0.1251.29667656111f20.125##f2.0.251.00710691151f20.25##f2.0.3751.46530454141f20.375##f2.0.50.63836607111f20.5##f2.0.6251.55028985121f20.625##f2.0.751.72377105121f20.75##f2.0.8751.04734613141f20.875##f2.10.52435275111f21##f3.0.1-0.5690651482f30.1##f3.0.2-0.57395364112f30.2 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS9 ##f3.0.3-0.77455917102f30.3##f3.0.4-0.1193756472f30.4##f3.0.50.16086124112f30.5##f3.0.60.1287806672f30.6##f3.0.70.00000000142f30.7##f3.0.80.01435191132f30.8##f3.0.9-0.4552441552f30.9##f3.1-0.56373688142f31Weseethatthedefaulthassomemoreinformation.Itusesthelevelnames,andsomemoreinformation,addedlikethis: efactory(est)##function(v,addnames)##{##esumsum(v[extrarefs])##dfv[refsubs]##subifelse(is.na(df),0,df)##dfv[refsuba]##addifelse(is.na(df),0,df+esum)##vv-sub+add##if(addnames){##names(v)nm##attr(v,"extra")list(obs=obs,comp=comp,fe=fef,##idx=idx)##}##v##}##yte;ode;:-52;倀0x556a0d0de320##nvi;ronm;nt:;-525;0x556a0cd58200I.e.whenaskedtoprovidelevelnames,itisalsopossibletoaddadditionalinformationasalist(ordata.frame)asanattribute'extra'.Thevectorsextrarefs,refsubs,refsubaetc.areprecomputedbyefactoryforspeede-ciency.Hereistheaboveexample,butwecreateaninterceptinstead,anddon'treportthezero-coecients,sothatitcloselyresemblestheoutputfromlm f1factor(f1);f2factor(f2);f3factor(f3)effunction(gamma,addnames)fref1gamma[[1]]ref2gamma[[8]]ref3gamma[[16]]#puttheinterceptinthefirstcoordinategamma[[1]]ref1+ref2+ref3gamma[2:7]gamma[2:7]-ref1gamma[8:14]gamma[9:15]-ref2gamma[15:23]gamma[17:25]-ref3length(gamma)23if(addnames)f 10SIMENGAURE names(gamma)c( (Intercept) ,paste( f1 ,levels(f1)[2:7],sep= ),paste( f2 ,levels(f2)[2:8],sep= ),paste( f3 ,levels(f3)[2:10],sep= ))ggammaggetfe(est,ef=ef,bN=1000,se=TRUE)##effectse##(Intercept)-0.0136346820.5114435##f120.7412461020.3160189##f13-0.5077520650.3373377##f14-0.6328621260.3572938##f15-1.5545694720.3811126##f16-0.1378763680.3402928##f170.3006155240.3485067##f20.25-0.2895696570.4047196##f20.3750.1686279820.3873691##f20.5-0.6583104960.4563872##f20.6250.2536132890.4170575##f20.750.4270944890.4318387##f20.875-0.2493304340.4006819##f21-0.7723238080.4004357##f30.2-0.0048885010.4376401##f30.3-0.2054940350.4464411##f30.40.4496894990.4995956##f30.50.7299263750.4339349##f30.60.6978458040.4643069##f30.70.5690651410.4218616##f30.80.5834170500.4345362##f30.90.1138209920.5315648##f310.0053282650.4159621#comparewithlmsummary(lm(y~x1+f1+f2+f3))####Call:##lm(formula=y~x1+f1+f2+f3)####Residuals:##Min1QMedian3QMax##-2.18822-0.552220.092780.628582.31181####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##(Intercept)-0.0136350.579452-0.0240.981289##x12.5537810.10262724.8842e-16***##f120.7412460.3855221.9230.058264.##f13-0.5077520.393773-1.2890.201151 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS11 ##f14-0.6328620.420832-1.5040.136768##f15-1.5545690.444675-3.4960.000792***##f16-0.1378760.387709-0.3560.723111##f170.3006160.4170710.7210.473257##f20.25-0.2895700.455406-0.6360.526785##f20.3750.1686280.4572980.3690.713340##f20.5-0.6583100.516023-1.2760.205934##f20.6250.2536130.4756300.5330.595440##f20.750.4270940.5036380.8480.399090##f20.875-0.2493300.458179-0.5440.587913##f21-0.7723240.460361-1.6780.097524.##f30.2-0.0048890.491509-0.0100.992091##f30.3-0.2054940.510660-0.4020.688513##f30.40.4496890.5674680.7920.430565##f30.50.7299260.5045711.4470.152113##f30.60.6978460.5462661.2770.205320##f30.70.5690650.4668831.2190.226667##f30.80.5834170.4739721.2310.222152##f30.90.1138210.5846930.1950.846172##f310.0053280.4672700.0110.990932##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.9963on76degreesoffreedom##MultipleR-squared:0.9086,AdjustedR-squared:0.8809##F-statistic:32.84on23and76DF,p-value:2.2e-165.Non-estimabilityWeconsideranotherexample.Toensurespuriousrelationstherearealmostasmanyfactorlevelsasthereareobservations,anditwillbehardtondenoughestimablefunctiontointerpretallthecoecients.Thecoecientforx1isstillestimated,butwithalargestandarderror.Notethatthisisanillustrationofnon-obviousnon-estimabilitywhichmayoccurinmuchlargerdatasets,theauthordoesnotendorseusingthiskindofmodelforthekindofdatayoundbelow. set.seed(55)x1rnorm(25)f1sample(9,length(x1),replace=TRUE)f2sample(8,length(x1),replace=TRUE)f3sample(8,length(x1),replace=TRUE)e1sin(f1)+0.02*f2^2+0.17*f3^3+rnorm(length(x1))y2.5*x1+(e1-mean(e1))summary(estfelm(y~x1|f1+f2+f3))####Call:##felm(formula=y~x1|f1+f2+f3)## 12SIMENGAURE ##Residuals:##Min1QMedian3QMax##-0.43725-0.099460.000000.050470.38973####Coefficients:##EstimateStd.ErrortvaluePr(|t|)##x10.97351.31110.7430.593####Residualstandarderror:1.146on1degreesoffreedom##MultipleR-squared(fullmodel):0.9999AdjustedR-squared:0.9977##MultipleR-squared(projmodel):0.3554AdjustedR-squared:-14.47##F-statistic(fullmodel):447.3on23and1DF,p-value:0.03731##F-statistic(projmodel):0.5514on1and1DF,p-value:0.5934##***Standarderrorsmaybetoohighduetomorethan2groupsandexactDOF=FALSEThedefaultestimablefunctionfails,andthecoecientsfromgetfearenotuseable.getfeyieldsawarninginthiscase. efefactory(est)is.estimable(ef,est$fe)##Warninginis.estimable(ef,est$fe):non-estimablefunction,largesterror2e-04incoordinate4("f1.4")##[1]FALSEIndeed,therank-deciencyislargerthanexpected.Therearemorespuriousrelationsbetweenthefactorsthanwhatcanbeaccountedforbylookingatcom-ponentsinthetworstfactors.Inthislow-dimensionalexamplewemayndthematrixDofequation(2),andits(column)rankdeciencyislargerthan2. f1factor(f1);f2factor(f2);f3factor(f3)DmakeDmatrix(list(f1,f2,f3))dim(D)##[1]2525ncol(D)-as.integer(rankMatrix(D))##[1]3Alternativelywecanuseaninternalfunctioninlfeforndingtherankde-ciencydirectly. lfe:::rankDefic(list(f1,f2,f3))##[1]3Thisrank-deciencyalsohasanimpactonthestandarderrorscomputedbyfelm.Iftherank-deciencyissmallrelativetothedegreesoffreedomthestandarderrorsarescaledslightlyupwardsifweignoretherankdeciency,butifitislarge,theimpactonthestandarderrorscanbesubstantial.Theabovementionedrank-computationprocedurecanbeactivatedbyspecifyingexactDOF=TRUEinthecalltofelm,butitmaybetime-consumingifthefactorshavemanylevels.Computingtherankdoesnotinitselfhelpusndestimablefunctionsforgetfe. MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS13 summary(estfelm(y~x1|f1+f2+f3,exactDOF=TRUE))####Call:##felm(formula=y~x1|f1+f2+f3,exactDOF=TRUE)####Residuals:##Min1QMedian3QMax##-0.43725-0.099460.000000.050470.38973####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x10.97350.92711.050.404####Residualstandarderror:0.8105on2degreesoffreedom##MultipleR-squared(fullmodel):0.9999AdjustedR-squared:0.9988##MultipleR-squared(projmodel):0.3554AdjustedR-squared:-6.735##F-statistic(fullmodel):935.2on22and2DF,p-value:0.001069##F-statistic(projmodel):1.103on1and2DF,p-value:0.4038Wecangetanideawhathappensifwekeepthedummiesforf3.Inthiscase,with2factors,lfewillpartitionthedatasetintoconnectedcomponentsandaccountforallthemulticollinearitiesamongthefactorsf1andf2justasabove,butthisisnotsucient.Theinterpretationoftheresultingcoecientsisnotstraightforward. summary(estfelm(y~x1+f3|f1+f2,exactDOF=TRUE))##Warninginchol.default(mat,pivot=TRUE,tol=tol):thematrixiseitherrank-deficientorindefinite##Warninginchol.default(mat,pivot=TRUE,tol=tol):thematrixiseitherrank-deficientorindefinite####Call:##felm(formula=y~x1+f3|f1+f2,exactDOF=TRUE)####Residuals:##Min1QMedian3QMax##-0.43725-0.099460.000000.050470.38973####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x10.97350.92711.0500.403842##f320.43171.03940.4150.718239##f335.16961.13584.5520.045034*##f349.82952.17564.5180.045659*##f3519.07911.349314.1400.004964**##f3634.71342.341414.8260.004519**##f3755.07271.419738.7910.000664***##f38NANANANA 14SIMENGAURE ##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.8105on2degreesoffreedom##MultipleR-squared(fullmodel):0.9999AdjustedR-squared:0.9988##MultipleR-squared(projmodel):0.9994AdjustedR-squared:0.9929##F-statistic(fullmodel):935.2on22and2DF,p-value:0.001069##F-statistic(projmodel):425on8and2DF,p-value:0.002349getfe(est)##effectobscompfeidx##f1.1-24.118434751f11##f1.2-25.631718141f12##f1.3-24.556762131f13##f1.455.036522611f14##f1.5-27.544522621f15##f1.6-22.773403421f16##f1.7-24.357051821f17##f1.8-24.688484931f18##f1.9-26.335563031f19##f2.1-0.270164421f21##f2.2-0.503904631f22##f2.33.665652411f23##f2.4-4.085860511f24##f2.5-1.329232841f25##f2.60.000000061f26##f2.71.065861061f27##f2.83.378635321f28Inthisparticularexample,wemayuseadierentorderofthefactors,andweseethatbypartitioningthedatasetonthefactorsf1,f3insteadoff1,f2,thereare2connectedcomponents(thefactorf2getsitsowncomp-code,butthisisnotagraphtheoreticcomponentnumber,itmerelyindicatesthatthereisaseparatereferenceamongthese). summary(estfelm(y~x1|f1+f3+f2,exactDOF=TRUE))####Call:##felm(formula=y~x1|f1+f3+f2,exactDOF=TRUE)####Residuals:##Min1QMedian3QMax##-0.43725-0.099460.000000.050470.38973####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x10.97350.92711.050.404####Residualstandarderror:0.8105on2degreesoffreedom MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS15 ##MultipleR-squared(fullmodel):0.9999AdjustedR-squared:0.9988##MultipleR-squared(projmodel):0.3554AdjustedR-squared:-6.735##F-statistic(fullmodel):935.2on22and2DF,p-value:0.001069##F-statistic(projmodel):1.103on1and2DF,p-value:0.4038is.estimable(efactory(est),est$fe)##[1]TRUEgetfe(est)##effectobscompfeidx##f1.10.000000051f11##f1.2-1.513283341f12##f1.3-0.438327331f13##f1.40.000000012f14##f1.5-3.426087721f15##f1.61.345031321f16##f1.7-0.238617121f17##f1.8-0.570050331f18##f1.9-2.217128331f19##f3.1-24.118434731f31##f3.2-23.686784021f32##f3.3-18.948811341f33##f3.4-14.288971911f34##f3.5-5.039326551f35##f3.610.594981141f36##f3.730.954267451f37##f3.855.036522612f38##f2.1-0.270164323f21##f2.2-0.503904533f22##f2.33.665652413f23##f2.4-4.085860613f24##f2.5-1.329232843f25##f2.60.000000063f26##f2.71.065861163f27##f2.83.378635523f28Belowisthesameestimationinlm.Weseethatthecoecientforx1isidenticaltotheonefromfelm,butthereisnoobviousrelationbetweene.g.thecoecientsforf1;thedierencef14-f15isnotthesameforlmandfelm.Sincetheseareindierentcomponents,theyarenotcomparable.Butofcourse,ifwecompareinthesamecomponent,e.g.f16-f17ortakeacombinationwhichactuallyoccursinthedataset,itisunique(estimable): data.frame(f1,f2,f3)[1,]##f1f2f3##1263I.e.ifweaddthecoecientsf1.2+f2.6+f3.3andincludetheinterceptforlm,wewillgetthesamenumberforbothlmandfelm.Thatis,forpredictingtheactualdataset,estimabilityplaysnorole,weobtainthesameresidualsanyway.Itisonlyforpredictingoutsideofthedatasetestimabilityisimportant. 16SIMENGAURE summary(estlm(y~x1+f1+f2+f3))####Call:##lm(formula=y~x1+f1+f2+f3)####Residuals:##1234567##3.883e-01-2.873e-01-4.899e-021.485e-013.378e-011.388e-17-5.047e-02##891011121314##5.047e-02-4.372e-013.883e-01-3.407e-01-5.047e-02-9.714e-173.393e-01##15161718192021##4.163e-17-4.163e-174.899e-02-1.485e-01-4.163e-173.897e-01-4.899e-02##22232425##-2.398e-01-3.393e-01-4.163e-17-9.946e-02####Coefficients:(1notdefinedbecauseofsingularities)##EstimateStd.Errortvalue--52;倀Pr(|t|)##(Intercept)-24.38861.1202-21.7720.002103**##x10.97350.92711.0500.403842##f12-1.51331.2712-1.1900.356003##f13-0.43831.0229-0.4290.710016##f1479.15502.256935.0730.000812***##f15-3.42611.2614-2.7160.113027##f161.34502.88790.4660.687194##f17-0.23860.9916-0.2410.832255##f18-0.57012.0710-0.2750.808947##f19-2.21711.1201-1.9790.186330##f22-0.23372.2869-0.1020.927917##f233.93582.70711.4540.283177##f24-3.81573.1342-1.2170.347584##f25-1.05911.2320-0.8600.480585##f260.27020.97010.2780.806791##f271.33601.11191.2020.352520##f283.64881.49172.4460.134276##f320.43171.03940.4150.718239##f335.16961.13584.5520.045034*##f349.82952.17564.5180.045659*##f3519.07911.349314.1400.004964**##f3634.71342.341414.8260.004519**##f3755.07271.419738.7910.000664***##f38NANANANA##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.8105on2degreesoffreedom##MultipleR-squared:0.9999,AdjustedR-squared:0.9988##F-statistic:935.2on22and2DF,p-value:0.001069 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS176.Weeks-WilliamspartitionsThereisapartialsolutiontothenon-estimabilityproblemin[8].Theirideaistopartitionthedatasetintocomponentsinwhichalldierencesbetweenfactorlevelsareestimable.Thecomponentsareconnectedcomponentsofasubgraphofane-dimensionalgridgraphwhereeisthenumberoffactors.Thatis,agraphisconstructedwiththeobservationsasvertices,twoobservationsareadjacent(inagraphtheoreticsense)iftheydierinatmostoneofthefactors.Thedatasetisthenpartitionedinto(graphtheoretic)connectedcomponents.It'sanerpartitioningthantheabove,andconsequentlyintroducesmorereferencelevelsthanisnecessaryforidentication.I.e.itdoesnotndallestimablefunctions,butinsomecases(e.g.in[7])thelargestcomponentwillbesucientlylargeforproperanalysis.Itisofcoursealwaysaquestionwhethersuchanendogenousselectionofobservationswillyieldadatasetwhichresultsinunbiasedcoecients.ThispartitioningcanbedonebythecompfactorfunctionwithargumentWW=TRUE: felist(f1,f2,f3)wwcompcompfactor(fe,WW=TRUE)Ithasmorelevelsthantherankdeciency lfe:::rankDefic(fe)##[1]3nlevels(wwcomp)##[1]17andeachofitscomponentsarecontainedinacomponentofthepreviouslyconsideredcomponents,nomatterwhichtwofactorsweconsider.Forthecaseoftwofactors,theconceptscoincide. nlevels(interaction(compfactor(fe),wwcomp))##[1]17#pickthelargestcomponent:wwdatadata.frame(y,x1,f1,f2,f3)[wwcomp==1,]print(wwdata)##yx1f1f2f3##228.45513-1.812376850277##330.614520.151582984367##532.359770.001908206177##1131.19345-0.048910950377##1434.32095-0.360763148187##2012.579600.993657777376Thatis,wecanstartinoneoftheobservationsandtravelthroughallofthembychangingjustoneoff1,f2,f3atatime.Though,inthisparticularexample,therearemoreparametersthanthereareobservations,soanestimationwouldnotbefeasible.efactorycannoteasilybemodiedtoproduceanestimablefunctioncorre-spondingtoWWcomponents.Thereasonisthatefactory,andthelogicingetfe,workonpartitionsoffactorlevels,notonpartitionsofthedataset,thesearethesameforthetwo-factorcase. 18SIMENGAUREWWpartitionshavethepropertythatifyoupickanytwoofthefactorsandpartitionaWW-componentintothepreviouslymentionednon-WWpartitions,therewillbeonlyonecomponent,henceyoumayuseanyoftheestimablefunctionsfromefactoryoneachpartition.Thatis,awaytouseWWpartitionswithlfeistodothewholeanalysisonthelargestWW-component.felmmaystillbeusedonthewholedataset,anditmayyielddierentresultsthanwhatyougetbyanalysingthelargestWW-component.Hereisalargerexample: set.seed(135)xrnorm(10000)f1sample(1000,length(x),replace=TRUE)f2(f1+sample(18,length(x),replace=TRUE))%%500f3(f2+sample(9,length(x),replace=TRUE))%%500yx+1e-4*f1+sin(f2^2)+cos(f3)^3+0.5*rnorm(length(x))datasetdata.frame(y,x,f1,f2,f3)summary(estfelm(y~x|f1+f2+f3,data=dataset,exactDOF=TRUE))####Call:##felm(formula=y~x|f1+f2+f3,data=dataset,exactDOF=TRUE)####Residuals:##Min1QMedian3QMax##-1.63055-0.29857-0.002360.305991.79423####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x0.9985520.005548180***##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.4957on8001degreesoffreedom##MultipleR-squared(fullmodel):0.9058AdjustedR-squared:0.8822##MultipleR-squared(projmodel):0.8019AdjustedR-squared:0.7524##F-statistic(fullmodel):38.49on1998and8001DF,p-value:2.2e-16##F-statistic(projmodel):3.239e+04on1and8001DF,p-value:2.2e-16Wecountthenumberofconnectedcomponentsinf1,f2,andseethatthisissucienttoensureestimability nlevels(est$cfactor)##[1]1is.estimable(efactory(est),est$fe)##[1]TRUEnrow(alphagetfe(est))##[1]2000 MULTICOLLINEARITY,IDENTIFICATION,ANDESTIMABLEFUNCTIONS19Ithasrankdeciencyonelessthanthenumberoffactors: lfe:::rankDefic(est$fe)##[1]2ThenweanalysethelargestWW-component wwcompcompfactor(est$fe,WW=TRUE)nlevels(wwcomp)##[1]933wwsetwwcomp==1sum(wwset)##[1]3129summary(wwestfelm(y~x|f1+f2+f3,data=dataset,subset=wwset,exactDOF=TRUE))####Call:##felm(formula=y~x|f1+f2+f3,data=dataset,exactDOF=TRUE,subset=wwset)####Residuals:##Min1QMedian3QMax##-1.3765-0.27840.00000.27911.5951####Coefficients:##EstimateStd.Errortvalue--52;倀Pr(|t|)##x0.9943900.009889100.6***##---##Signif.codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1####Residualstandarderror:0.4858on2314degreesoffreedom##MultipleR-squared(fullmodel):0.9182AdjustedR-squared:0.8894##MultipleR-squared(projmodel):0.8138AdjustedR-squared:0.7483##F-statistic(fullmodel):31.91on814and2314DF,p-value:2.2e-16##F-statistic(projmodel):1.011e+04on1and2314DF,p-value:2.2e-16Weseethatwegetthesamecoecientforxinthiscase.Thisisnotsurprising,thereisnoobviousreasontobelievethatourselectionofobservationsisskewedinthisrandomlycreateddataset.Thisonehasthesamerankdeciency: lfe:::rankDefic(wwest$fe)##[1]2butasmallernumberofidentiablecoecients. nrow(wwalphagetfe(wwest))##[1]816Wemaycompareeectswhicharecommontothetwomethods: 20SIMENGAURE head(wwalpha)##effectobscompfeidx##f1.351.932424111f135##f1.380.804965531f138##f1.400.239241331f140##f1.411.089662421f141##f1.420.642877141f142##f1.431.426841141f143alpha[c(35,38,40:43),]##effectobscompfeidx##f1.350.9581561101f135##f1.380.636739091f138##f1.400.8802633121f140##f1.410.8586244131f141##f1.420.8983646131f142##f1.431.2634717121f143butthereisnoobviousrelationbetweene.g.f1.35-f1.36,theyareverydierentinthetwoestimations.Thecoecientsarefromdierentdatasets,andthestandarderrorsarelarge(0:7)withthisfewobservationsforeachfactorlevel.Thenumberofidentiedcoecientsforeachfactorvaries(thesegurescontainthetworeferences): table(wwalpha[, fe ])####f1f2f3##417198201References[1]J.M.Abowd,R.H.Creecy,andF.Kramarz,ComputingPersonandFirmEectsUsingLinkedLongitudinalEmployer-EmployeeData,Tech.ReportTP-2002-06,U.S.CensusBureau,2002.[2]J.M.Abowd,F.Kramarz,andD.N.Margolis,HighWageWorkersandHighWageFirms,Econometrica67(1999),no.2,251{333.[3]J.A.EcclestonandA.Hedayat,OntheTheoryofConnectedDesigns:CharacterizationandOptimality,Ann.Statist.2(1974),1238{1255.[4]S.Gaure,OLSwithMultipleHighDimensionalCategoryVariables,ComputationalStatisticsandDataAnalysis66(2013),8{18.[5]J.D.GodolphinandE.J.Godolphin,Ontheconnectivityofrow-columndesigns,Util.Math.60(2001),51{65.[6]L.L.Kupper,J.M.Janis,I.A.Salama,C.N.Yoshizawa,andB.G.Greenberg,Age-Period-CohortAnalysis:AnIllustrationoftheProblemsinAssessingInteractioninOneObservationPerCellData,Commun.Statist.-Theor.Meth.12(1983),no.23,2779{2807.[7]S.M.Torres,P.Portugal,J.T.Addison,andP.Guimar~aes,TheSourcesofWageVariation:AThree-WayHigh-DimensionalFixedEectsRegressionModel.,IZADiscussionPaper7276,InstitutefortheStudyofLabor(IZA),March2013.[8]D.L.WeeksandD.R.Williams,ANoteontheDeterminationofConnectednessinanN-WayCrossClassication,Technometrics6(1964),no.3,319{324.RagnarFrischCentreforEconomicResearch,Oslo,Norway