T Road Calcutta 700 035 India Received 24 June 1998 received in revised form 29 April 1999 accepted 29 April 1999 Abstract A genetic algorithmbased clustering technique called GAclustering is proposed in this article The searching capability of gene ID: 30003
Download Pdf The PPT/PDF document "Pattern Recognition Genetic algorith..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
PatternRecognition33(2000)1455Geneticalgorithm-basedclusteringtechniqueUjjwalMaulik,SanghamitraBandyopadhyayDepartmentofComputerScience,GoernmentEngineeringCollege,Kalyani,Nadia,IndiaMachineIntelligenceUnit,IndianStatisticalInstitute,203,B.T.Road,Calcutta-700035,India Ageneticalgorithm-basedclusteringtechnique,calledGA-clustering,isproposedinthisarticle.Thesearchingcapabilityofgeneticalgorithmsisexploitedinordertosearchforappropriateclustercentresinthefeaturespacesuchthatasimilaritymetricoftheresultingclustersisoptimized.Thechromosomes,whicharerepresentedasstringsofrealnumbers,encodethecentresofaxednumberofclusters.ThesuperiorityoftheGA-clusteringalgorithmoverthecommonlyused-meansalgorithmisextensivelydemonstratedforfourarti Correspondingauthor.Presentaddress:SchoolofCom-puterScienceandEngineering,UniversityofNewSouthWales,Sydney2052,Australia;Tel.:00-61-2-9385-3975;fax:00-61-2-E-mailaddresses:hotmail.com(U.Maulik),san-cse.unsw.edu.au(S.Bandyopadhyay).OnleavefromIndianStatisticalInstitute.overandareappliedonthesestringstoyield clusteringproblem.Wehaveimplementedelitismateachgenerationbypreservingthebeststringseenuptothatgenerationinalocationoutsidethepopulation.Thusontermination,thislocationcontainsthecentresoftheThenextsectionprovidestheresultsofimplementa-tionoftheGA-clusteringalgorithm,alongwithitscom-parisonwiththeperformanceofthe-meansalgorithmforseveralarticialandreal-lifedatasets.4.ImplementationresultsTheexperimentalresultscomparingtheGA-clusteringalgorithmwiththe-meansalgorithmareprovidedforfourarticialdatasets(Data1Data2Data3Data4andthreereal-lifedatasets(CrudeOilrespectively.Thesearerstdescribedbelow:4.1.ArticialdatasetsData1:Thisisanonoverlappingtwo-dimensionaldatasetwherethenumberofclustersistwo.Ithas10points.Thevalueofischosentobe2forthisdataset.Data2:Thisisanonoverlappingtwo-dimensionaldatasetwherethenumberofclustersisthree.Ithas76points.TheclustersareshowninFig.2:Thevalueofischosentobe3forthisdataset.Data3:Thisisanoverlappingtwo-dimensionaltri-angulardistributionofdatapointshavingnineclasseswherealltheclassesareassumedtohaveequalaprioriprobabilities( ).Ithas900datapoints.Therangesforthenineclassesareasfollows:Class1:[[0.7,3.3],Class2:[1.3,1.3][0.7,3.3],Class3:[0.7,3.3][0.7,3.3],Class4:[[!1.3,1.3], Fig.2.Data2 Fig.3.Data3pointsfromclass1,pointsfromclasspointsfromclass9). Fig.4.TriangulardistributionalongtheClass5:[1.3,1.3]1.3]!1.3,1.3],Class6:[0.7,3.3]3.3]!1.3,1.3],Class7:[[!3.3,!0.7],Class8:[1.3,1.3]1.3]!3.3,!0.7],Class9:[0.7,3.3]3.3]!3.3,!0.7].Thusthedomainforthetriangulardistributionforeachclassandforeachaxisis2.6.Consequently,theheightwillbe (since121).ThedatasetisshowninFig.3.Thevalueofischosentobe9forthisdataset.Data4:Thisisanoverlappingten-dimensionaldatasetgeneratedusingatriangulardistributionoftheformshowninFig.4fortwoclasses,1and2.Ithas1000datapoints.Thevalueofischosentobe2forthisdataset.Therangeforclass1is[0,2][0,2][0,2]10times,andthatforclass2is[1,3][0,2][0,2]9times,withthecorrespondingpeaksat(1,1)and(2,1).TheU.Maulik,S.BandyopadhyayPatternRecognition33(2000)1455 Table1obtainedby-meansalgorithmforvedierentinitialgurationsforData1 Initialcon 15.38313222.22549832.22549845.38313252.225498 Table2obtainedbyGA-clusteringalgorithmforvedierentinitialpopulationsforData1after100iterationswhen InitialpopulationGA-clustering 12.22549822.22549832.22549842.22549852.225498 Table3obtainedby-meansalgorithmforvedierentinitialgurationsforData2 Initialcon 151.013294264.646739367.166768451.013294564.725676 Theresultsofimplementationofthe-meansalgo-rithmandGA-clusteringalgorithmareshown,respec-tively,inTables1and2forData1,Tables3and4forData2,Tables5and6forData3,Tables7and8for,Tables9and10for,Tables11and12forTables13and14forCrudeOil.Boththealgorithmswererunfor100simulations.Forthepurposeofdemonstra-vedierentinitialcongurationsofthealgorithmandvedierentinitialpopulationsoftheGA-clusteringalgorithmareshowninthetables.Data1(Tables1and2)itisfoundthattheGA-clusteringalgorithmprovidestheoptimalvalueof2.225498inalltheruns.-meansalgorithmalsoattainsthisvaluemostofthetimes(87%ofthetotalruns).Howeverintheothercases,itgetsstuckatavalueof5.383132.ForData2(Tables3and4),GA-clusteringattainsthebestvalueof51.013294inalltheruns.means,ontheotherhand,attainsthisvaluein51%oftheTable4obtainedbyGA-clusteringalgorithmforvedierentinitialpopulationsforData2after100iterationswhen InitialpopulationGA-clustering 151.013294251.013294351.013294451.013294551.013294 Table5obtainedby-meansalgorithmforvedierentinitialgurationsforData3 Initialcon 1976.2356072976.3789903976.3789904976.5641895976.378990 Table6obtainedbyGA-clusteringalgorithmforvedierentinitialpopulationsforData3after100iterationswhen InitialpopulationGA-clustering 1966.3504812966.3816013966.3504854966.3125765966.354085 Table7obtainedby-meansalgorithmforvedierentinitialgurationsforData4 Initialcon 11246.23915321246.23915331246.23668041246.23915351246.237127 totalruns,whileinotherrunsitgetsstuckatdisub-optimalvalues.Similarly,forData3(Tables5and6)Data4(Tables7and8)theGA-clusteringalgorithmattainsthebestvaluesof966.312576and1246.218355in20%and85%ofthetotalruns,respectively.ThebestU.Maulik,S.BandyopadhyayPatternRecognition33(2000)1455 Table15obtainedbyGA-clusteringalgorithmforvedierentinitialpopulationsforafter500iterationswhen InitialpopulationGA-clustering 1149344.2292452149370.7629003149342.9903774149352.2893635149362.661869 (reached60%ofthetimes)and279.484810(reached30%ofthetimes),respectively.FromTables9and10for,itisfoundthatun-liketheothercases,GA-clusteringalgorithmattainsonevalue(149406.851288)thatispoorerthanthebestvalue-meansalgorithm(149373.097180).Inordertoinves-tigatewhethertheGA-clusteringalgorithmcanimproveitsclusteringperformance,itwasexecutedupto500iterations(ratherthan100iterationsaswasdonepre-viously).TheresultsareshowninTable15.Asexpected,itisfoundthattheperformanceofGA-clusteringimproves.Thebestvaluethatitnowattainsis149342.990377andtheworstis149370.762900,bothofwhicharebetterthanthoseobtainedafter100iterations.Moreover,nowitsperformanceinalltherunsisbetterthantheperformanceof-meansalgorithmforanyofthe100runs.5.DiscussionandconclusionsAgeneticalgorithm-basedclusteringalgorithm,calledGA-clustering,hasbeendevelopedinthisarticle.Geneticalgorithmhasbeenusedtosearchfortheclustercentreswhichminimizetheclusteringmetric.InordertodemonstratetheeectivenessoftheGA-clusteringalgorithminprovidingoptimalclusters,severalartiandreallifedatadatasetswiththenumberofdimensionsrangingfromtwototenandthenumberofclustersrangingfromtwotoninehavebeenconsidered.TheresultsshowthattheGA-clusteringalgorithmprovidesaperformancethatissignicantlysuperiortothatofthe-meansalgorithm,averywidelyusedclusteringtech-Floating-pointrepresentationofchromosomeshasbeenadoptedinthisarticle,sinceitisconceptuallyclosesttotheproblemspaceandprovidesastraightforwardwayofmappingfromtheencodedclustercen-trestotheactualones.Inthiscontext,abinaryrepres-entationmaybeimplementedforthesameproblem,andtheresultsmaybecomparedwiththepresentpointform.Suchaninvestigationiscurrentlybeingper-NotethattheclusteringmetricthattheGAattemptstominimizeisgivenbythesumoftheabsoluteEuclideandistancesofeachpointfromtheirrespectiveclustercentres.WehavealsoimplementedthesamealgorithmbyusingthesumofthesquaredEuclideandistancesastheminimizingcriterion.Thesameconclusionsasobtainedinthisarticlearestillfoundtoholdgood.IthasbeenprovedinRef.[23]thatanelitistmodelofGAswilldenitelyprovidetheoptimalstringasthenumberofiterationsgoestoinnity,providedtheprob-abilityofgoingfromanypopulationtotheonecontain-ingtheoptimalstringisgreaterthanzero.Notethatthishasbeenprovedfornonzeromutationprobabilityvaluesandisindependentoftheprobabilityofcrossover.How-ever,sincetheofconvergencetotheoptimalstringwilldenitelydependontheseparameters,aproperchoiceofthesevaluesisimperativeforthegoodperfor-manceofthealgorithm.Notethatthemutationoperatorasusedinthisarticlealsoallowsnonzeroprobabilityofgoingfromanystringtoanyotherstring.Therefore,ourGA-clusteringalgorithmwillalsoprovidetheoptimalclustersasthenumberofiterationsgoestoinnity.SuchaformaltheoreticalproofiscurrentlybeingdevelopedthatwilleectivelyserveasatheoreticalproofoftheoptimalityoftheclustersprovidedbytheGA-clusteringalgorithm.However,itisimperativetoonceagainrealizethatforpracticalpurposesaproperchoiceofthegeneticparameters,whichmaypossiblybekeptadaptive,iscrucialforagoodperformanceofthealgorithm.Inthiscontext,onemaynotethatalthoughthe-meansalgo-rithmgotstuckatsub-optimalsolutions,evenforthesimpledatasets,GA-clusteringalgorithmdidnotexhibitanysuchunwantedbehaviour.6.SummaryClusteringisanimportantunsupervisedclassitechniquewhereasetofpatterns,usuallyvectorsinamulti-dimensionalspace,aregroupedintoclustersinsuchawaythatpatternsinthesameclusteraresimilarinsomesenseandpatternsindierentclustersaredis-similarinthesamesense.Forthisitisnecessarytoneameasureofsimilaritywhichwillestablisharuleforassigningpatternstothedomainofaparticularclustercentre.OnesuchmeasureofsimilaritymaybetheEuclideandistancebetweentwopatterns.Smallerthedistancebetweengreateristhesimilaritybetweenthetwoandviceversa.Anintuitivelysimpleandeectiveclusteringtechniqueisthewell-known-meansalgorithm.However,itisknownthatthe-meansalgorithmmaygetstuckatsuboptimalsolutions,dependingonthechoiceoftheinitialclustercentres.Inthisarticle,weproposeasolu-tiontotheclusteringproblemwheregeneticalgorithms(GAs)areusedforsearchingfortheappropriateclusterU.Maulik,S.BandyopadhyayPatternRecognition33(2000)1455 AbouttheAuthorUJJWALMAULIKdidhisBachelorsinPhysicsandComputerSciencein1986and1989,respectively.Sub-sequently,hedidhisMastersandPh.DinComputerSciencein1991and1997,respectively,fromJadavpurUniversity,India.Dr.MaulikhasvisitedCenterforAdaptiveSystemsApplications,LosAlamos,NewMexico,USAin1997.HeiscurrentlytheHeadoftheDepartmentofComputerScience,KalyaniEngineeringCollege,India.HisresearchinterestsincludeParallelProcessingandInterconnectionNetworks,NaturalLanguageProcessing,EvolutionaryComputationandPatternRecognition.AbouttheAuthorSANGHAMITRABANDYOPADHYAYdidherBachelorsinPhysicsandcomputerSciencein1988and1991,respectively.Subsequently,shedidherMastersinComputerSciencefromIndianInstituteofTechnology,Kharagpurin1993andPh.DinComputerSciencefromIndianStatisticalInstitute,Calcuttain1998.Dr.BandyopadhyayistherecipientofDr.ShankerDayalSharmaGoldMedalandInstituteSilverMedalforbeingadjudgedthebestallroundpostgraduateperformerin1993.ShehasvisitedLosAlamosNationalLaboratoryin1997.SheiscurrentlyonapostdoctoralassignmentinUniversityofNewSouthWales,Sydney,Australia.HerresearchinterestsincludeEvolutionaryComputation,PatternRecognition,ParallelProcessingandInterconnectionU.Maulik,S.BandyopadhyayPatternRecognition33(2000)1455