InfromsklearndecompositionimportPCApcaloadingspdDataFramePCAfitXcomponentsTindexdfcolumnscolumnsV1V2V3V4pcaloadingsWeseethattherearefourdistinctprincipalcomponentsThisistobeexpectedbecausethereareinge ID: 884796
Download Pdf The PPT/PDF document "Lab18PCAinPythonApril252016ThislabonPrin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 Lab18-PCAinPythonApril25,2016ThislabonPr
Lab18-PCAinPythonApril25,2016ThislabonPrincipalComponentsAnalysisisapythonadaptationofp.401-404,408-410of\IntroductiontoStatisticalLearningwithApplicationsinR"byGarethJames,DanielaWitten,TrevorHastieandRobertTibshirani.OriginaladaptationbyJ.Warmenhoven,updatedbyR.JordanCrouseratSmithCollegeforSDS293:MachineLearning(Spring2016).In[]:importpandasaspdimportnumpyasnpimportmatplotlibasmplimportmatplotlib.pyplotasplt%matplotlibinline110.4:PrincipalComponentsAnalysisInthislab,weperformPCAontheUSArrestsdataset.Therowsofthedatasetcontainthe50states,inalphabeticalorder:In[]:df=pd.read_csv('USArrests.csv',index_col=0)df.head()Thecolumnsofthedatasetcontainfourvariablesrelatingtovariouscrimes:In[]:df.info()Let'sstartbytakingaquicklookatthecolumnmeansofthedata:In[]:df.mean()Weseerightawaythethedatahavevastlydierentmeans.Wecanalsoexaminethevariancesofthefourvariables:In[]:df.var()Notsurprisingly,thevariablesalsohavevastlydierentvariances:theUrbanPopvariablemeasuresthepercentageofthepopulationineachstatelivinginanurbanarea,whichisnotacomparablenumbertothenumberofcrimescommitteedineachstateper100,000individuals.IfwefailedtoscalethevariablesbeforeperformingPCA,thenmostoftheprincipalcomponentsthatweobservedwouldbedrivenbytheAssaultvariable,sinceithasbyfarthelargestmeanandvariance.Thus,itisimportanttostandardizethevariablestohavemeanzeroandstandarddeviation1beforeperformingPCA.Wecandothisusingthescale()functionfromsklearn:In[]:fromsklearn.preprocessingimportscaleX=pd.DataFrame(scale(df),index=df.index,columns=df.columns)Nowwe'llusethePCA()functionfromsklearntocom
2 putetheloadingvectors:1 In[]:fromsklearn
putetheloadingvectors:1 In[]:fromsklearn.decompositionimportPCApca_loadings=pd.DataFrame(PCA().fit(X).components_.T,index=df.columns,columns=['V1','V2','V3','V4'])pca_loadingsWeseethattherearefourdistinctprincipalcomponents.Thisistobeexpectedbecausethereareingeneralmin(n1;p)informativeprincipalcomponentsinadatasetwithnobservationsandpvariables.Usingthefittransform()function,wecangettheprincipalcomponentscoresoftheoriginaldata.We'lltakealookattherstfewstates:In[]:#FitthePCAmodelandtransformXtogettheprincipalcomponentspca=PCA()df_plot=pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2','PC3','PC4'],index=X.index)df_plot.head()Wecanconstructabiplotofthersttwoprincipalcomponentsusingourloadingvectors:In[]:fig,ax1=plt.subplots(figsize=(9,7))ax1.set_xlim(-3.5,3.5)ax1.set_ylim(-3.5,3.5)#PlotPrincipalComponents1and2foriindf_plot.index:ax1.annotate(i,(-df_plot.PC1.loc[i],-df_plot.PC2.loc[i]),ha='center')#Plotreferencelinesax1.hlines(0,-3.5,3.5,linestyles='dotted',colors='grey')ax1.vlines(0,-3.5,3.5,linestyles='dotted',colors='grey')ax1.set_xlabel('FirstPrincipalComponent')ax1.set_ylabel('SecondPrincipalComponent')#PlotPrincipalComponentloadingvectors,usingasecondy-axis.ax2=ax1.twinx().twiny()ax2.set_ylim(-1,1)ax2.set_xlim(-1,1)ax2.set_xlabel('PrincipalComponentloadingvectors',color='red')#Plotlabelsforvectors.Variable'a'isasmalloffsetparametertoseparatearrowtipandtext.a=1.07foriinpca_loadings[['V1','V2']].index:ax2.annotate(i,(-pca_loadings.V1.loc[i]*a,-pca_loadings.V2.loc[i]*a),color='red')#Plotvectorsax2.arrow(0,0,-pca_loadings.V1[0],-pca_load
3 ings.V2[0])ax2.arrow(0,0,-pca_loadings.V
ings.V2[0])ax2.arrow(0,0,-pca_loadings.V1[1],-pca_loadings.V2[1])ax2.arrow(0,0,-pca_loadings.V1[2],-pca_loadings.V2[2])ax2.arrow(0,0,-pca_loadings.V1[3],-pca_loadings.V2[3])ThePCA()functionalsooutputsthevarianceexplainedbyofeachprincipalcomponent.Wecanaccessthesevaluesasfollows:In[]:pca.explained_variance_2 Wecanalsogettheproportionofvarianceexplained:In[]:pca.explained_variance_ratio_Weseethattherstprincipalcomponentexplains62.0%ofthevarianceinthedata,thenextprincipalcomponentexplains24.7%ofthevariance,andsoforth.WecanplotthePVEexplainedbyeachcomponentasfollows:In[]:plt.figure(figsize=(7,5))plt.plot([1,2,3,4],pca.explained_variance_ratio_,'-o')plt.ylabel('ProportionofVarianceExplained')plt.xlabel('PrincipalComponent')plt.xlim(0.75,4.25)plt.ylim(0,1.05)plt.xticks([1,2,3,4])Wecanalsousethefunctioncumsum(),whichcomputesthecumulativesumoftheelementsofanumericvector,toplotthecumulativePVE:In[]:plt.figure(figsize=(7,5))plt.plot([1,2,3,4],np.cumsum(pca.explained_variance_ratio_),'-s')plt.ylabel('ProportionofVarianceExplained')plt.xlabel('PrincipalComponent')plt.xlim(0.75,4.25)plt.ylim(0,1.05)plt.xticks([1,2,3,4])210.6:NCI60DataExampleLet'sreturntotheNCI60cancercelllinemicroarraydata,whichconsistsof6,830geneexpressionmeasure-mentson64cancercelllines:In[]:df2=pd.read_csv('NCI60.csv').drop('Unnamed:0',axis=1)df2.columns=np.arange(df2.columns.size)df2.info()In[]:#Readinthelabelstocheckourworklatery=pd.read_csv('NCI60_y.csv',usecols=[1],skiprows=1,names=['type'])310.6.1PCAontheNCI60DataWerstperformPCAonthedataafterscalingthevariables(genes)tohavestandard
4 deviationone,althoughonecouldreasonablya
deviationone,althoughonecouldreasonablyarguethatitisbetternottoscalethegenes:In[]:#ScalethedataX=pd.DataFrame(scale(df2))X.shape#FitthePCAmodelandtransformXtogettheprincipalcomponentspca2=PCA()df2_plot=pd.DataFrame(pca2.fit_transform(X))Wenowplottherstfewprincipalcomponentscorevectors,inordertovisualizethedata.Theobser-vations(celllines)correspondingtoagivencancertypewillbeplottedinthesamecolor,sothatwecanseetowhatextenttheobservationswithinacancertypearesimilartoeachother:3 In[]:fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,6))color_idx=pd.factorize(y.type)[0]cmap=mpl.cm.hsv#Leftplotax1.scatter(df2_plot.iloc[:,0],df2_plot.iloc[:,1],c=color_idx,cmap=cmap,alpha=0.5,s=50)ax1.set_ylabel('PrincipalComponent2')#Rightplotax2.scatter(df2_plot.iloc[:,0],df2_plot.iloc[:,2],c=color_idx,cmap=cmap,alpha=0.5,s=50)ax2.set_ylabel('PrincipalComponent3')#Customlegendfortheclasses(y)sincewedonotcreatescatterplotsperclass(whichcouldhavetheirownlabels).handles=[]labels=pd.factorize(y.type.unique())norm=mpl.colors.Normalize(vmin=0.0,vmax=14.0)fori,vinzip(labels[0],labels[1]):handles.append(mpl.patches.Patch(color=cmap(norm(i)),label=v,alpha=0.5))ax2.legend(handles=handles,bbox_to_anchor=(1.05,1),loc=2,borderaxespad=0.)#xlabelforbothplotsforaxinfig.axes:ax.set_xlabel('PrincipalComponent1')Onthewhole,celllinescorrespondingtoasinglecancertypedotendtohavesimilarvaluesontherstfewprincipalcomponentscorevectors.Thisindicatesthatcelllinesfromthesamecancertypetendtohaveprettysimilargeneexpressionlevels.Wecangenerateasummaryoftheproportionofvarianceexplained(PVE)oftherstf
5 ewprincipalcomponents:In[]:pd.DataFrame(
ewprincipalcomponents:In[]:pd.DataFrame([df2_plot.iloc[:,:5].std(axis=0,ddof=0).as_matrix(),pca2.explained_variance_ratio_[:5],np.cumsum(pca2.explained_variance_ratio_[:5])],index=['StandardDeviation','ProportionofVariance','CumulativeProportion'],columns=['PC1','PC2','PC3','PC4','PC5'])Usingtheplot()function,wecanalsoplotthevarianceexplainedbytherstfewprincipalcomponents:In[]:df2_plot.iloc[:,:10].var(axis=0,ddof=0).plot(kind='bar',rot=0)plt.ylabel('Variances')However,itisgenerallymoreinformativetoplotthePVEofeachprincipalcomponent(i.e.ascreeplot)andthecumulativePVEofeachprincipalcomponent.Thiscanbedonewithjustalittletweaking:In[]:fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))#Leftplotax1.plot(pca2.explained_variance_ratio_,'-o')ax1.set_ylabel('ProportionofVarianceExplained')ax1.set_ylim(ymin=-0.01)#Rightplot4 ax2.plot(np.cumsum(pca2.explained_variance_ratio_),'-ro')ax2.set_ylabel('CumulativeProportionofVarianceExplained')ax2.set_ylim(ymax=1.05)foraxinfig.axes:ax.set_xlabel('PrincipalComponent')ax.set_xlim(-1,65)Weseethattogether,therstsevenprincipalcomponentsexplainaround40%ofthevarianceinthedata.Thisisnotahugeamountofthevariance.However,lookingatthescreeplot,weseethatwhileeachoftherstsevenprincipalcomponentsexplainasubstantialamountofvariance,thereisamarkeddecreaseinthevarianceexplainedbyfurtherprincipalcomponents.Thatis,thereisanelbowintheplotafterapproximatelytheseventhprincipalcomponent.Thissuggeststhattheremaybelittlebenettoexaminingmorethansevenorsoprincipalcomponents(phew!evenexaminingsevenprincipalcomponentsmaybedicult)