/
Lab18PCAinPythonApril252016ThislabonPrincipalComponentsAnalysisisapyt Lab18PCAinPythonApril252016ThislabonPrincipalComponentsAnalysisisapyt

Lab18PCAinPythonApril252016ThislabonPrincipalComponentsAnalysisisapyt - PDF document

paisley
paisley . @paisley
Follow
343 views
Uploaded On 2021-09-24

Lab18PCAinPythonApril252016ThislabonPrincipalComponentsAnalysisisapyt - PPT Presentation

InfromsklearndecompositionimportPCApcaloadingspdDataFramePCAfitXcomponentsTindexdfcolumnscolumnsV1V2V3V4pcaloadingsWeseethattherearefourdistinctprincipalcomponentsThisistobeexpectedbecausethereareinge ID: 884796

plt pca set plot pca plt plot set ax2 ax1 loadings df2 index ylabel explained variance columns ratio color

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Lab18PCAinPythonApril252016ThislabonPrin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Lab18-PCAinPythonApril25,2016ThislabonPr
Lab18-PCAinPythonApril25,2016ThislabonPrincipalComponentsAnalysisisapythonadaptationofp.401-404,408-410of\IntroductiontoStatisticalLearningwithApplicationsinR"byGarethJames,DanielaWitten,TrevorHastieandRobertTibshirani.OriginaladaptationbyJ.Warmenhoven,updatedbyR.JordanCrouseratSmithCollegeforSDS293:MachineLearning(Spring2016).In[]:importpandasaspdimportnumpyasnpimportmatplotlibasmplimportmatplotlib.pyplotasplt%matplotlibinline110.4:PrincipalComponentsAnalysisInthislab,weperformPCAontheUSArrestsdataset.Therowsofthedatasetcontainthe50states,inalphabeticalorder:In[]:df=pd.read_csv('USArrests.csv',index_col=0)df.head()Thecolumnsofthedatasetcontainfourvariablesrelatingtovariouscrimes:In[]:df.info()Let'sstartbytakingaquicklookatthecolumnmeansofthedata:In[]:df.mean()Weseerightawaythethedatahavevastlydi erentmeans.Wecanalsoexaminethevariancesofthefourvariables:In[]:df.var()Notsurprisingly,thevariablesalsohavevastlydi erentvariances:theUrbanPopvariablemeasuresthepercentageofthepopulationineachstatelivinginanurbanarea,whichisnotacomparablenumbertothenumberofcrimescommitteedineachstateper100,000individuals.IfwefailedtoscalethevariablesbeforeperformingPCA,thenmostoftheprincipalcomponentsthatweobservedwouldbedrivenbytheAssaultvariable,sinceithasbyfarthelargestmeanandvariance.Thus,itisimportanttostandardizethevariablestohavemeanzeroandstandarddeviation1beforeperformingPCA.Wecandothisusingthescale()functionfromsklearn:In[]:fromsklearn.preprocessingimportscaleX=pd.DataFrame(scale(df),index=df.index,columns=df.columns)Nowwe'llusethePCA()functionfromsklearntocom

2 putetheloadingvectors:1 In[]:fromsklearn
putetheloadingvectors:1 In[]:fromsklearn.decompositionimportPCApca_loadings=pd.DataFrame(PCA().fit(X).components_.T,index=df.columns,columns=['V1','V2','V3','V4'])pca_loadingsWeseethattherearefourdistinctprincipalcomponents.Thisistobeexpectedbecausethereareingeneralmin(n�1;p)informativeprincipalcomponentsinadatasetwithnobservationsandpvariables.Usingthefittransform()function,wecangettheprincipalcomponentscoresoftheoriginaldata.We'lltakealookatthe rstfewstates:In[]:#FitthePCAmodelandtransformXtogettheprincipalcomponentspca=PCA()df_plot=pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2','PC3','PC4'],index=X.index)df_plot.head()Wecanconstructabiplotofthe rsttwoprincipalcomponentsusingourloadingvectors:In[]:fig,ax1=plt.subplots(figsize=(9,7))ax1.set_xlim(-3.5,3.5)ax1.set_ylim(-3.5,3.5)#PlotPrincipalComponents1and2foriindf_plot.index:ax1.annotate(i,(-df_plot.PC1.loc[i],-df_plot.PC2.loc[i]),ha='center')#Plotreferencelinesax1.hlines(0,-3.5,3.5,linestyles='dotted',colors='grey')ax1.vlines(0,-3.5,3.5,linestyles='dotted',colors='grey')ax1.set_xlabel('FirstPrincipalComponent')ax1.set_ylabel('SecondPrincipalComponent')#PlotPrincipalComponentloadingvectors,usingasecondy-axis.ax2=ax1.twinx().twiny()ax2.set_ylim(-1,1)ax2.set_xlim(-1,1)ax2.set_xlabel('PrincipalComponentloadingvectors',color='red')#Plotlabelsforvectors.Variable'a'isasmalloffsetparametertoseparatearrowtipandtext.a=1.07foriinpca_loadings[['V1','V2']].index:ax2.annotate(i,(-pca_loadings.V1.loc[i]*a,-pca_loadings.V2.loc[i]*a),color='red')#Plotvectorsax2.arrow(0,0,-pca_loadings.V1[0],-pca_load

3 ings.V2[0])ax2.arrow(0,0,-pca_loadings.V
ings.V2[0])ax2.arrow(0,0,-pca_loadings.V1[1],-pca_loadings.V2[1])ax2.arrow(0,0,-pca_loadings.V1[2],-pca_loadings.V2[2])ax2.arrow(0,0,-pca_loadings.V1[3],-pca_loadings.V2[3])ThePCA()functionalsooutputsthevarianceexplainedbyofeachprincipalcomponent.Wecanaccessthesevaluesasfollows:In[]:pca.explained_variance_2 Wecanalsogettheproportionofvarianceexplained:In[]:pca.explained_variance_ratio_Weseethatthe rstprincipalcomponentexplains62.0%ofthevarianceinthedata,thenextprincipalcomponentexplains24.7%ofthevariance,andsoforth.WecanplotthePVEexplainedbyeachcomponentasfollows:In[]:plt.figure(figsize=(7,5))plt.plot([1,2,3,4],pca.explained_variance_ratio_,'-o')plt.ylabel('ProportionofVarianceExplained')plt.xlabel('PrincipalComponent')plt.xlim(0.75,4.25)plt.ylim(0,1.05)plt.xticks([1,2,3,4])Wecanalsousethefunctioncumsum(),whichcomputesthecumulativesumoftheelementsofanumericvector,toplotthecumulativePVE:In[]:plt.figure(figsize=(7,5))plt.plot([1,2,3,4],np.cumsum(pca.explained_variance_ratio_),'-s')plt.ylabel('ProportionofVarianceExplained')plt.xlabel('PrincipalComponent')plt.xlim(0.75,4.25)plt.ylim(0,1.05)plt.xticks([1,2,3,4])210.6:NCI60DataExampleLet'sreturntotheNCI60cancercelllinemicroarraydata,whichconsistsof6,830geneexpressionmeasure-mentson64cancercelllines:In[]:df2=pd.read_csv('NCI60.csv').drop('Unnamed:0',axis=1)df2.columns=np.arange(df2.columns.size)df2.info()In[]:#Readinthelabelstocheckourworklatery=pd.read_csv('NCI60_y.csv',usecols=[1],skiprows=1,names=['type'])310.6.1PCAontheNCI60DataWe rstperformPCAonthedataafterscalingthevariables(genes)tohavestandard

4 deviationone,althoughonecouldreasonablya
deviationone,althoughonecouldreasonablyarguethatitisbetternottoscalethegenes:In[]:#ScalethedataX=pd.DataFrame(scale(df2))X.shape#FitthePCAmodelandtransformXtogettheprincipalcomponentspca2=PCA()df2_plot=pd.DataFrame(pca2.fit_transform(X))Wenowplotthe rstfewprincipalcomponentscorevectors,inordertovisualizethedata.Theobser-vations(celllines)correspondingtoagivencancertypewillbeplottedinthesamecolor,sothatwecanseetowhatextenttheobservationswithinacancertypearesimilartoeachother:3 In[]:fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,6))color_idx=pd.factorize(y.type)[0]cmap=mpl.cm.hsv#Leftplotax1.scatter(df2_plot.iloc[:,0],df2_plot.iloc[:,1],c=color_idx,cmap=cmap,alpha=0.5,s=50)ax1.set_ylabel('PrincipalComponent2')#Rightplotax2.scatter(df2_plot.iloc[:,0],df2_plot.iloc[:,2],c=color_idx,cmap=cmap,alpha=0.5,s=50)ax2.set_ylabel('PrincipalComponent3')#Customlegendfortheclasses(y)sincewedonotcreatescatterplotsperclass(whichcouldhavetheirownlabels).handles=[]labels=pd.factorize(y.type.unique())norm=mpl.colors.Normalize(vmin=0.0,vmax=14.0)fori,vinzip(labels[0],labels[1]):handles.append(mpl.patches.Patch(color=cmap(norm(i)),label=v,alpha=0.5))ax2.legend(handles=handles,bbox_to_anchor=(1.05,1),loc=2,borderaxespad=0.)#xlabelforbothplotsforaxinfig.axes:ax.set_xlabel('PrincipalComponent1')Onthewhole,celllinescorrespondingtoasinglecancertypedotendtohavesimilarvaluesonthe rstfewprincipalcomponentscorevectors.Thisindicatesthatcelllinesfromthesamecancertypetendtohaveprettysimilargeneexpressionlevels.Wecangenerateasummaryoftheproportionofvarianceexplained(PVE)ofthe rstf

5 ewprincipalcomponents:In[]:pd.DataFrame(
ewprincipalcomponents:In[]:pd.DataFrame([df2_plot.iloc[:,:5].std(axis=0,ddof=0).as_matrix(),pca2.explained_variance_ratio_[:5],np.cumsum(pca2.explained_variance_ratio_[:5])],index=['StandardDeviation','ProportionofVariance','CumulativeProportion'],columns=['PC1','PC2','PC3','PC4','PC5'])Usingtheplot()function,wecanalsoplotthevarianceexplainedbythe rstfewprincipalcomponents:In[]:df2_plot.iloc[:,:10].var(axis=0,ddof=0).plot(kind='bar',rot=0)plt.ylabel('Variances')However,itisgenerallymoreinformativetoplotthePVEofeachprincipalcomponent(i.e.ascreeplot)andthecumulativePVEofeachprincipalcomponent.Thiscanbedonewithjustalittletweaking:In[]:fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))#Leftplotax1.plot(pca2.explained_variance_ratio_,'-o')ax1.set_ylabel('ProportionofVarianceExplained')ax1.set_ylim(ymin=-0.01)#Rightplot4 ax2.plot(np.cumsum(pca2.explained_variance_ratio_),'-ro')ax2.set_ylabel('CumulativeProportionofVarianceExplained')ax2.set_ylim(ymax=1.05)foraxinfig.axes:ax.set_xlabel('PrincipalComponent')ax.set_xlim(-1,65)Weseethattogether,the rstsevenprincipalcomponentsexplainaround40%ofthevarianceinthedata.Thisisnotahugeamountofthevariance.However,lookingatthescreeplot,weseethatwhileeachofthe rstsevenprincipalcomponentsexplainasubstantialamountofvariance,thereisamarkeddecreaseinthevarianceexplainedbyfurtherprincipalcomponents.Thatis,thereisanelbowintheplotafterapproximatelytheseventhprincipalcomponent.Thissuggeststhattheremaybelittlebene ttoexaminingmorethansevenorsoprincipalcomponents(phew!evenexaminingsevenprincipalcomponentsmaybedicult)

Related Contents


Next Show more