Trolling,AggressionandCyberbullying(TRAC-2018),pages1–11.Kwok,I.a - PDF document

pagi . @pagi

342 views
Uploaded On 2021-01-05

Trolling,AggressionandCyberbullying(TRAC-2018),pages1–11.Kwok,I.a - PPT Presentation

Agartala NLP TeamatSemEval2019task6AnensembleapproachtoidentifyingandcategorizingoffensivelanguageintwittersocialmediacorporaInProceedingsofthe13thInternationalWorkshoponSemanticEvaluationpages69 ID: 826672

2018 tasken weighted 150 tasken 2018 150 weighted 2017 2019 xgb 2020 and0 2019task6 2014 forsub usa trac arxivpreprintarxiv

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/826672" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Pdf The PPT/PDF document "Trolling,AggressionandCyberbullying(TRAC..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Trolling,AggressionandCyberbullying(TRAC
Trolling,AggressionandCyberbullying(TRAC-2018),pages1–11.Kwok,I.andWang,Y.(2013).Locatethehate:Detectingtweetsagainstblacks.InTwenty-seventhAAAIconfer-enceonarticialintelligence.Le,Q.andMikolov,T.(2014).Distributedrepresentationsofsentencesanddocuments.InInternationalconferenceonmachinelearning,pages1188–1196.Litjens,G.,Kooi,T.,Bejnordi,B.E.,Setio,A.A.A.,Ciompi,F.,Ghafoorian,M.,VanDerLaak,J.A.,VanGinneken,B.,andS´anchez,C.I.(2017).Asurveyondeeplearninginmedicalimageanalysis.Medicalim-ageanalysis,42:60–88.Liu,P.,Li,W.,andZou,L.(2019).NULIatSemEval-2019task6:Transferlearningforoffensivelanguagede-tectionusingbidirectionaltransformers.InProceedingsofthe13thInternationalWorkshoponSemanticEvalua-tion,pages87–91,Minneapolis,Minnesota,USA,June.AssociationforComputationalLinguistics.Mahata,D.,Zhang,H.,Uppal,K.,Kumar,Y.,Shah,R.R.,Shahid,S.,Mehnaz,L.,andAnand,S.(2019).MIDASatSemEval-2019task6:Identifyingoffensivepostsandtargetedoffensefromtwitter.InProceedingsofthe13thInternationalWorkshoponSemanticEvaluation,pages683–690,Minneapolis,Minnesota,USA,June.Associa-tionforComputationalLinguistics.Mikolov,T.,Sutskever,I.,Chen,K.,Corrado,G.S.,andDean,J.(2013).Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinneuralinformationprocessingsystems,pages3111–3119.Mohaouchane,H.,Mourhir,A.,andNikolov,N.S.(2019).Detectingoffensivelanguageonarabicsocialmediaus-ingdeeplearning.In2019SixthInternationalConfer-enceonSocialNetworksAnalysis,ManagementandSe-curity(SNAMS),pages466–471.IEEE.Mubarak,H.,Darwish,K.,andMagdy,W.(2017).Abu-sivelanguagedetectiononarabicsocialmedia.InPro-ceedingsoftheFirstWorkshoponAbusiveLanguageOn-line,pages52–56.Nikhil,N.,Pahwa,R.,Nirala,M.K.,andKhilnani,R.(2018).Lstmswithattentionforaggressiondetection.InProceedingsoftheFirstWorkshoponTrolling,Aggres-sionandCyberbullying(TRAC-2018),pages52–57.Pelicon,A.,Martinc,M.,andKraljNovak,P.(2019).Em-beddiaatSemEval-2019task6:Detectinghatewithneu-ralnetworkandtransferlearningapproaches.InPro-ceedingsofthe13thInternationalWorkshoponSeman-ticEvaluation,pages604–610,Minneapolis,Minnesota,USA,June.AssociationforComputationalLinguistics.Pennington,J.,Socher,R.,andManning,C.(2014).Glove:Globalvectorsforwordrepresentation.InPro-ceedingsofthe2014conferenceonempiricalmethodsinnaturallanguageprocessing(EMNLP),pages1532–1543.Peters,M.E.,Neumann,M.,Iyyer,M.,Gardner,M.,Clark,C.,Lee,K.,andZettlemoyer,L.(2018).Deepcontextualizedwordrepresentations.arXivpreprintarXiv:1802.05365.Ramiandrisoa,F.andMothe,J.(2018).Iritattrac2018.InProceedingsoftheFirstWorkshoponTrolling,Ag-gressionandCyberbullying(TRAC-2018),pages19–27,SantaFe,NewMexico,USA,August.AssociationforComputationalLinguistics.RiteshKumar,AtulKr.Ojha,S.M.andZampieri,M.(2020).Evaluatingaggressionidenticationinsocialmedia.InRiteshKumar,etal.,editors,ProceedingsoftheSecondWorkshoponTrolling,AggressionandCy-berbullying(TRAC-2020),Paris,France,may.EuropeanLanguageResourcesAssociation(ELRA).Ross,B.,Rist,M.,Carbonell,G.,Cabrera,B.,Kurowsky,N.,andWojatzki,M.(2017).Measuringthereliabilityofhatespeechannotations:Thecaseoftheeuropeanrefugeecrisis.arXivpreprintarXiv:1701.08118.Roy,A.,Kapil,P.,Basak,K.,andEkbal,A.(2018).Anensembleapproachforaggressionidenticationinen-glishandhinditext.InProceedingsoftheFirstWork-shoponTrolling,AggressionandCyberbullying(TRAC-2018),pages66–73.Samghabadi,N.S.,Mave,D.,Kar,S.,andSolorio,T.(2018).Ritual-uhattrac2018sharedtask:aggressionidentication.arXivpreprintarXiv:1807.11712.Schmidt,A.andWiegand,M.(2017).ASurveyonHateSpeechDetectionUsingNaturalLanguageProcessing.InProceedingsoftheFifthInternationalWorkshoponNaturalLanguageProcessingforSocialMedia.Associ-ationforComputationalLinguistics,pages1–10,Valen-cia,Spain.Soliman,A.B.,Eissa,K.,a

ndEl-Beltagy,S.R.(2017).Aravec:Asetofara
ndEl-Beltagy,S.R.(2017).Aravec:Asetofarabicwordembeddingmodelsforuseinarabicnlp.ProcediaComputerScience,117:256–265.Spertus,E.(1997).Smokey:Automaticrecognitionofhostilemessages.InAaai/iaai,pages1058–1065.Su,H.-P.,Huang,Z.-J.,Chang,H.-T.,andLin,C.-J.(2017).Rephrasingprofanityinchinesetext.InPro-ceedingsoftheFirstWorkshoponAbusiveLanguageOn-line,pages18–24.Swamy,S.D.,Jamatia,A.,Gamb¨ack,B.,andDas,A.(2019).NITAgartalaNLPTeamatSemEval-2019task6:Anensembleapproachtoidentifyingandcategoriz-ingoffensivelanguageintwittersocialmediacorpora.InProceedingsofthe13thInternationalWorkshoponSemanticEvaluation,pages696–703,Minneapolis,Min-nesota,USA,June.AssociationforComputationalLin-guistics.Zampieri,M.,Malmasi,S.,Nakov,P.,Rosenthal,S.,Farra,N.,andKumar,R.(2019a).PredictingtheTypeandTar-getofOffensivePostsinSocialMedia.InProceedingsofNAACL.Zampieri,M.,Malmasi,S.,Nakov,P.,Rosenthal,S.,Farra,N.,andKumar,R.(2019b).SemEval-2019Task6:IdentifyingandCategorizingOffensiveLanguageinSo-cialMedia(OffensEval).InProceedingsofThe13thInternationalWorkshoponSemanticEvaluation(Se-mEval).Figure2:Sub-taskEN-A,theconfusionmatrixforXGB-USEapproachFigure3:Sub-taskEN-B,theconfusionmatrixforXGB-USEapproachbest-reportedresultsforsub-taskEN-Aachieved0.6075F1(weighted)and0.8567F1(weighted)forsub-taskEN-B.Inthefuture,wewilluseseveralfeaturesandanalyzethemtogetthebestfeaturesforaggressiondetection.Moreover,wewillstudytheimpactofdataaugmentationtypesontheperformanceofvariousMLmodels.AcknowledgementsThisresearchispartiallyfundedbyJordanUniversityofScienceandTechnology.6.BibliographicalReferencesAl-Hassan,A.andAl-Dossari,H.(2019).Detectionofhatespeechinsocialnetworks:asurveyonmultilin-gualcorpus.In6thInternationalConferenceonCom-puterScienceandInformationTechnology.Aroyehun,S.T.andGelbukh,A.(2018).Aggressionde-tectioninsocialmedia:Usingdeepneuralnetworks,dataaugmentation,andpseudolabeling.InProceedingsoftheFirstWorkshoponTrolling,AggressionandCyber-bullying(TRAC-2018),pages90–97.Bengio,Y.,Ducharme,R.,Vincent,P.,andJauvin,C.(2003).Aneuralprobabilisticlanguagemodel.Journalofmachinelearningresearch,3(Feb):1137–1155.Bhattacharya,S.,Singh,S.,Kumar,R.,Bansal,A.,Bhagat,A.,Dawer,Y.,Lahiri,B.,andOjha,A.K.(2020).Devel-opingamultilingualannotatedcorpusofmisogynyandaggression.Cer,D.,Yang,Y.,Kong,S.-y.,Hua,N.,Limtiaco,N.,John,R.S.,Constant,N.,Guajardo-Cespedes,M.,Yuan,S.,Tar,C.,etal.(2018).Universalsentenceencoder.arXivpreprintarXiv:1803.11175.Chen,T.andGuestrin,C.(2016).Xgboost:Ascalabletreeboostingsystem.InProceedingsofthe22ndacmsigkddinternationalconferenceonknowledgediscoveryanddatamining,pages785–794.Chung,J.,Gulcehre,C.,Cho,K.,andBengio,Y.(2014).Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodeling.arXivpreprintarXiv:1412.3555.Davidson,T.,Warmsley,D.,Macy,M.,andWeber,I.(2017).Automatedhatespeechdetectionandtheprob-lemofoffensivelanguage.InEleventhinternationalaaaiconferenceonwebandsocialmedia.Devlin,J.,Chang,M.-W.,Lee,K.,andToutanova,K.(2018).Bert:Pre-trainingofdeepbidirectionaltrans-formersforlanguageunderstanding.arXivpreprintarXiv:1810.04805.Djuric,N.,Zhou,J.,Morris,R.,Grbovic,M.,Radosavlje-vic,V.,andBhamidipati,N.(2015).Hatespeechdetec-tionwithcommentembeddings.InProceedingsofthe24thinternationalconferenceonworldwideweb,pages29–30.Fiser,D.,Erjavec,T.,andLjubesi´c,N.(2017).Legalframework,datasetandannotationschemaforsociallyunacceptableonlinediscoursepracticesinslovene.InProceedingsoftherstworkshoponabusivelanguageonline,pages46–51.Fortuna,P.andNunes,S.(2018).ASurveyonAutomaticDetectionofHateSpeechinText.ACMComputingSur-veys(CSUR),51(4):85.Han,J.,Wu,S.,andLiu,X.(2019).jhan014atSemEval-2019task6:Identifyingandcategorizingoffensivelan-guageinsocialmedia.InProceedingsofthe13thInter-nationalWorkshoponSemanticEvaluat

ion,pages652–656,Minneapolis,Minnes
ion,pages652–656,Minneapolis,Minnesota,USA,June.AssociationforComputationalLinguistics.Kumar,R.,Ojha,A.K.,Malmasi,S.,andZampieri,M.(2018).Benchmarkingaggressionidenticationinso-cialmedia.InProceedingsoftheFirstWorkshopon4.Results4.1.EvaluationMeasuresInordertoevaluatetheimplementedapproach,weightedF1hasbeenusedaccordingtotheprovidedsharedtask.Moreover,theaccuracyhasbeenincludedtousedforthecomparisonaswell.4.2.DiscussionFocusingonbothsub-taskEN-AandBforEnglishlan-guagetotackletheproblemofaggressiveidentication,table7presentsthereportedresultsforourproposedapproachesforsub-taskEN-Aaggressionidenticationsharedtask.ThebestresultsachievedwithXGB-USEap-proachincludingthehyper-parameterthatdiscussedabove,whereitachieves0.6075F1(weighted)and0.6833accu-racy.Moreover,thesecondapproachhasbeenusedforthesamesub-taskachieves0.5965F1(weighted)and0.6758accuracywhereXGB-USE-PLLapproachthesamemainapproachincludingthepseudolabeltesting.Thelastap-proachusingthene-tuningBERTembeddingwithGRUwhereitachieves0.5461F1(weighted)and0.6392accu-racy.It'sobviousthattheXGB-USEhadthebestresultsaccordingtoF1(weighted)andaccuracySystemF1(weighted)AccuracyXGB-USE0.60750.6833XGB-USE-PLL0.59650.6758BERT-GRU0.54610.6392Table7:ResultsforSub-taskEN-A(wherePLLpseudolabeltesting,USEuniversalsentenceencoder,andXGBXGBootclassier)Forsub-taskEN-Bmisogynisticaggressionidenticationsharedtask,table8presentsthereportedresultsforourproposedapproaches.ThebestresultsachievedwithXGB-USEapproachincludingthehyper-parameterthatdis-cussedabove,whereitachieves0.8567F1(weighted)and0.8758accuracy.Moreover,thesecondapproachhasbeenusedforthesamesub-taskachieves0.8547F1(weighted)and0.8825accuracywhereXGB-USE-PLLapproachthesamemainapproachincludingthepseudolabeltestingasafeature.Thelastapproachusingthene-tuningBERTem-beddingwithGRUwhereitachieves0.8180F1(weighted)and0.8433accuracy.It'sobviousthattheXGB-USEhadthebestresultsaccordingtoF1(weighted)andaccuracySystemF1(weighted)AccuracyXGB-USE0.85670.8758XGB-USE-PLL0.85470.8825BERT-GRU0.81800.8433Table8:ResultsforSub-taskEN-B(wherePLLpseudolabeltesting,USEuniversalsentenceencoder,andXGBXGBootclassier)4.3.ResultsandFindingsInordertoshowthereportedresultsforfocusingonsub-taskAtable9showsthereportedresultsforthetopteams.Thebestresultsachievewith0.8029F1(weighted)pre-sentedby(Julian)teamcomparedtoourteam(SAJA)achieved0.6075F1(weighted).TeamF1(weighted)Julian0.8029sdhanshu0.7592Ms8qQxMbnjJMgYcw0.7568zhixuan0.7393SAJA0.6075Table9:ResultsforSub-taskEN-Acomparedtootherteams.Forsub-taskEN-B,table10showsthereportedresultsforthetopteamsaswell.Thebestresultsachievewith0.8715F1(weighted)presentedby(Ms8qQxMbnjJMgYcw)teamcomparedtoourteam(SAJA)achieved0.8566F1(weighted).Wecanseealltheresultsareclosetoeachother.Wearerankingnumbersixinthissub-task.TeamF1(weighted)Ms8qQxMbnjJMgYcw0.8715abaruah0.8701na140.8579sdhanshu0.8578SAJA0.8566Table10:ResultsforSub-taskEN-Bcomparedtootherteams.Thegure2showstheconfusionmatrixofourbestmodelofsub-taskEN-Aallforthethreeclasses,it'sclearthattheXGB-USEmodelisperformingwellatclassifyingthenon-aggressive(NAG)inputscomparedtootherclasses.How-ever,gure3representstheconfusionmatrixforsub-taskEN-BobviouslytheXGB-USEmodelisperformingbetterfordetectingthenon-gendered'NGEN'classcomparedtogendered'GEN'class.5.ConclusionInthispaper,wepresentedourparticipationtoTRAC2020sharedtaskonaggressionidenticationintheEnglishlan-guageforbothsub-taskEN-AandEN-B.Combinationoftransformershavebeendevelopedtosolvetheprovidedproblem,XGB-USEhasbeenusedasthemainapproachforthispaperwhichextractstheUSEembeddingstoper-formstransferlearningusingXGBclassier.Wehavebeenrankedfourteenthoutofsixteenteamsforsub-taskEN-A.Forsub-taskEN-B,wehavebeenrankedsixoutoffteenteamswhichareencouragingresultsespeciallyt

hediffer-encebetweenourresultsandthetopr
hediffer-encebetweenourresultsandthetoprankedteamsareveryclose.ThispapershowsthatthedevelopedmodelproducedgreatresultscomparedtodeeplearningapproachesandtransferlearningwithBERTtransformers.Wehaveusedarefer-encedatasetthatprovidedfortheTRAC2020sharedtaskonaggressionidenticationmultilinguallanguages.Thethemeaningofthesentence.Moreover,Googlehasbeenreleasedapre-trainedUSEembeddingusingTensorFlowHubModule1toextracttheembeddingdirectlyandndthesemanticsimilaritiesfortheprovidedsentences.Theproposedmodelbasedontransferlearningarchitecturethathasusedincommonespeciallyinimageclassicationandcomputervision(Litjensetal.,2017).Moreover,aswementionedearlier,theappliedtransformersshowsignicantresultscomparedtodeeplearningapproaches.Forinstance,USEdeveloperscreatedseveralversionsofthepre-trainedmodelssuchasmultilingualUSEtorepresentthesemanticrelationshipsamongtextaswellasitcouldbeappliedasanindependentclassierindifferentNLPdomains(i.g.aggressiveidentication).Moreover,theextractedembeddingdimensionforUSEis512.Inthisresearch,weusedUSE2pre-trainedmodeltoextractthesentenceembeddingbasedontransferlearningarchitecturetotacklethesharedtaskproblem.TheXGboost,distributedgradientboostinglibrary(XGB)classier(ChenandGuestrin,2016)havebeenbuilttobehighlyefcient.TheXGBhasbeenusedasatexttransferlearningmodelpoweredbytheUSEembeddingwhereasXGBconsideredasapowerfulclassiercomparedtoothermachinelearningclassiersaswellascomparedtodeeplearning.ItbecomesapopularmethodtosolveNLPtasks.ThereasonwhyXGBhasbeenusedasfollows:a)XGBconsideredasaregularizedboostingtechniquepreparedtopreventovertting,b)ithasastructuretohandlethemissingvalues,c)itisfastcomparedtoothersgradientboosting.Aswementionedaboveforthesakeofthisresearch,theXGBclassierapproachhasbeendevelopedbasedontransferlearningwithUniversalSentenceEncoder(XGB-USE).ThisdevelopedapproachperformedtosolvetheaggressiveidenticationfortheEnglishlanguage.USEembeddinghasbeenextractedfromthepre-trainedmodelwith512dimensionsforeachinputsentencebeforetheypreparedtotrainstepusingXGB.Table6providesmoredetailsaboutthevalueofeachparameterhavebeenusedduringthetrainingstep,whichrepresentsthebestparametersareused.TheXGB-USEarchitectureshownasdepictedinFigure1.ForBERT-GRUtrainingprocedure,wene-tunedtheBERTbyexcludingthelast3layersaswellasaddingtheGaussianNoiselayerfollowedbyGRU(Chungetal.,2014)layerconsistof300neurons,andglobalaveragepoolingaimstoextractthediscriminativefeaturesfromthepastlayeraimstopassthemtothenextlayer.L2regu-larizationandDropouthavebeenusedtopreventovert-ting.Thelastlayerusedtopredicttheoutputpredictionswithadenselayerof1neuron,sigmoidactivationfunc-tion,andTruncatedNormalkernalinitializers.wetrainedTRAC2020datasetwithoutanyexternalresources,how-ever,inthefuturewewilltryanexternaldatasetforthe1https://www.tensorflow.org/hub/api_docs/python/hub/ModuleparameterValueEmbeddingdimension(USE)512#ofEstimators3000Sub-sample0.3maxdepth5gamma0.2objective(multi:softmax/forsub-taskEN-A)(bi-nary:logistic/forsub-taskEN-B)boostergbtreenumclass3forsub-taskEN-ATable6:TheXGBclassierparametersusedbygridsearchinthetrainingphaseFigure1:Thearchitectureofoursystem(XGB-USE)experimentationstep.Thethebestparameterasfollows:batchsize=16,optimizer=Adam,learningrate=2e-5,andnallyBERTmaxlength=40.DatasetFileTotalCountofLabelssub-taskEN-ATrainSet4263OAG=435CAG=453NAG=3375ValidationSet1066OAG=113CAG=117NAG=836TestSet1200-Table2:RepresentsthedistributionoftheprovideddatasetfortheEnglishlanguageforsuntaskADatasetFileTotalCountofLabelssub-taskEN-BTrainSet4263NGEN=3954GEN=309ValidationSet1066NGEN=993GEN=73TestSet1200-Table3:RepresentsthedistributionoftheprovideddatasetfortheEnglishlanguageforsub-taskEN-BURLs,andusermentions.Whereas,pre-processingstepisrequiredtoimpr

ovetheanalysisprocessappliedtotherawtwee
ovetheanalysisprocessappliedtotherawtweets.Wehavebeendonevariouspre-processingtoachieveacleanversionoftheprovideddataset,suchthat,eachtweetwasnormalized.andthentokenized.Thenor-malizationisanecessaryprocesssincesomewordsarewrittenonshortcutformat(i.e.dontreturnedto(donot)).Finally,numbersandnon-Englishcharacterswerealsore-moved.Thefollowingareexamplesofpre-processingstephavebeenshownintable5fortheprovideddataset.3.4.EmbeddingsRecently,wordembeddingswidelyusedinNLPapplica-tionsandtheirresearch,wherewordembeddingaimstoobtainthevectorrepresentationoftheinputoftextualdatatoinputnumericfordeepneuralnetworks.Wordembed-dingstendtocapturethesemanticfeaturesforeachwordandthelinguisticrelationshipamongthem,whereastheseembeddingshelptoimprovesystemperformanceinsev-eralNLPdomains(e.gtextmining).Since2003(Bengioetal.,2003)hasbeenstartedtogeneratewordembeddingusingneuralprobabilisticlanguagemodel,thenWord2Vecby(Mikolovetal.,2013),Glove(Penningtonetal.,2014),AraVec(Solimanetal.,2017)andtherecentmodelElMoEmbeddingby(Petersetal.,2018),BERTcontextualem-bedding(Devlinetal.,2018),andUniversalSentenceEn-coderUSE(Ceretal.,2018).Thedistributionallinguistichypothesisit'sthemainintuitionofwordembeddingidea,whereaseachmodelhasitsownwaytocapturetheseman-ticmeaningortheideaofhowtheytrained.Consequently,eachmodelcancapturedifferentsemanticattributescom-IDOriginalTextLabelsub-taskEN-ALabelsub-taskEN-BC68.872Nicevideo..NAGNGENC10.689SheisatraitorofIndiaOAGNGENC32.128”Wrongmessageforyouth.Fight,dontbeacoward”CAGNGENC65.70HotNAGGENTable4:Examplesthatrepresentsthedatasetforbothsub-tasksIDOriginalTextProcessedTextLabelsub-taskEN-ALabelsub-taskEN-BC68.872Nicevideo..nicevideoNAGNGENC10.689SheisatraitorofIndiasheisatraitorofindiaOAGNGENC32.128”Wrongmes-sageforyouth.Fight,dontbeacoward”wrongmes-sageforyouthghtdonotbeacowardCAGNGENC65.70HothotNAGGENTable5:Datapre-processingperformedontheavailabledatasetforbothsubtasksparedtoothermodels.Inthisresearch,wedependonpre-trainedsentenceUSEembeddingtotrainedthedevelopedmodel.ItisalanguagerepresentationmodelandsentenceembeddingprovidedbyGooglewhichaimstoextractthesentenceembeddingsfromtheprovideddataset.Moreover,itwillbecomeoneofthestateofartmodelformostofNLPresearch.3.5.ProposedModel-(XGB-USE)ThetransformerandcontextualembeddingaddedmuchprogressintheNLPresearcharea.Inaddition,itout-performsthedeeplearningapproachesaccordingtothepromisingresultsachieved.Thetransformerconsideredanencoder-decoderarchitectureappliedtoattentionmechanismstasks.Moreparticularly,GooglehasbeenreleasedUniversalSentenceEncoder(USE)Ceretal.(2018)whichaimstomapaninputsentencetovectorrepresentations,thiskindofrepresentationaimstocaptureusedisArabicYouTubecomments.ThebestperformingmodelwasCNNwith84.05%F1-score.In(Liuetal.,2019)Proposedane-tunedtechniquefortheBidirectionalEncoderRepresentationfromTransformer(BERT)tosolvethesharedtaskofidentifyingandcategorizingoffensivelanguageinsocialmediaatSemEval2019.Severalfeatureswereusedsuchaswordunigramsandword2vec.(Peliconetal.,2019)AddsLSTMneuralnetworkarchitec-turetoperformne-tunedaBERT.Severalautomaticallyandmanuallycraftedfeatureswereusednamely:wordembeddings,TFIDF,POSsequences,BOW,thelengthofthelongestpunctuationsequence,andthesentimentofthetweets.(Mahataetal.,2019)ProposedanensembletechniqueconsistofCNN,BidirectionalLSTMwithattention,andBidirectionalLSTM+BidirectionalGRUtotacklethesharedSemEval2019-Task6IdentifyingandCategorizingOffensiveLanguage.Thetraindatausedtoobtainasetofheuristicsasfeatures.(Hanetal.,2019)Presentedtwoapproachesnamely:bidirectionalwithGRUandprobabilisticmodelmodiedsentenceoffensivenesscalculation(MSOC)forthesameTaskIdentifyingandCategorizingOffensiveLanguage.Word2vecembeddingusedasafeature.(Swamyetal.,2019)IntroducedanensembleapproachconsistofL1-regulari

sedLogisticRegression,L2-regularisedLogi
sedLogisticRegression,L2-regularisedLogisticRegression,LinearSVC,andLSTMneuralnetwork.Severalfeatureswereused,forinstance,GloVeembedding,word/charactern-gramsbyTF-IDF,POStags,sentimentScoreandcountfeaturesforURLs,mentions,hashtags,punctuationmarks.In2018sharedtaskTRAC1hasbeenreleased,(Ramian-drisoaandMothe,2018)havebeendevelopedanapproachtodetectaggressivelanguageforEnglishlanguage.Threeapproacheshavebeendevelopedbasedonmachinelearn-inganddeeplearningmodels.Severalfeatureshavebeenused(i.ePart-Of-Speech,emoticonssentimentfrequencyandlogisticregressionbuiltwithdocumentvectorizationusingDoc2vec).(Samghabadietal.,2018)discussingthelexicalandsemanticfeaturesforEnglishandHindilan-guages.(Royetal.,2018)presentedanensemblesolutiondependingonCNNandSVMforEnglishandHindilan-guages.Wordembedding,n-grams,andTf-Idfvectorshavebeendiscussed.(Nikhiletal.,2018)demonstrateLSTMapproachwithanattentionunitaccordingtotheremark-ableresultsforthisapproachinNLPtasks.ItperformsforEnglishandHindilanguageaswell.Moreover,(Aroye-hunandGelbukh,2018)presentsaninvestigationbetweendeeplearningandtraditionalmachinelearning(i.eNBandSVM)toachievethebestefcientmodel.Theremarkablepointinthispapertoimprovetheoverallperformance,theaugmenteddataandpseudolabeledmethodhavebeenusedduringthetrainingstep.3.DataandMethodologySharedtaskonAggressionIdenticationfocusedontheEnglishlanguagewhichprovidedtoidentifyingtheaggressivelanguagethoughtthesocialmediacontent.3.1.TaskDescriptionThesharedtaskTRAC2020(RiteshKumarandZampieri,2020;Bhattacharyaetal.,2020),isamultilingualtask,whichprovidestwosubtasksnamely:A-aggressioniden-ticationsharedtask,itrepresentsaclassicationtasktoclassifyagiventextintothreeclassesbetween(1)OvertlyAggressivewhereitrepresentsthehumanbehaviormeanttohurtacommunitythroughtheverbal,physicalandpsy-chologicalattitude.(2)CovertlyAggressivewhereitrepre-sentsthehiddenaggressiveattackconsistofthenegativeironicemotionsand(3)Non-aggressive.Table1repre-sentstheaggressivetypecases.B-misogynisticaggressionidenticationsharedtask,itrepresentsabinaryclassica-tionthataimstoclassifyagiventexttogenderedornon-gendered.TypeCasesOvertlyAggressive(OAG)verbalattackdirectlypointedtowardsanygrouporindividuals,abusivewordsorcom-paringinaderogatorymanner,supportingfalseattackCovertlyAggressive(CAG)foucusongurativewordsaimstoat-tack(individual,nation,religion),Praisingsome-onebycriticizinggroupirrespectiveofbeingrightorwrong.NonAggressive(NAG)Inthiscase,theabsenceoftheintentiontobeag-gressive.Table1:TheclassesoftheAggressiveincludingtheircasesforsub-taskEN-A3.2.DatasetThissharedtaskrepresentsamultilingualdatasetBhat-tacharyaetal.(2020)whichcontainsthreelanguagesnamely:English,Bangla,andHindi.Inthispaper,wepar-ticipateintheEnglishlanguageforbothsubtasksAandB.Thesharedtaskprovidesthreelestrain,validation,andtestlewhichconsistsof5000annotatedrowsfromsocialmediathathavebeenrepresentedforbothsubtasks.Tables2and3providemoredetailsaboutthedistributionoftheprovideddataset.Table4representsexamplesoftheprovideddatasetforbothsubtasks.3.3.DataPre-processingThepre-processingsteponatextiscrucialprocesses,es-peciallysocialnetworkdatasetssuchthat,FacebookandTwitterwherepostsandtweetsarenoisyandcontainalotofslanglanguage.Inordertohaveacleanversionoftheprovideddatasettoremovetheunnecessarynoise,forin-stance,specialcharacter,punctuationmarks(*,ˆ@#-(—),SAJAatTRAC2020SharedTask:TransferLearningforAggressiveIdenticationwithXGBoostSajaKhaledTawalbeh,MahmoudHammad,MohammadAL-SmadiJordanUniversityofScienceandTechnologyIrbid,Jordansajatawalbeh91@gmail.com,fm-hammad,masmadig@just.edu.joAbstractThispaperdescribestheparticipationoftheSAJAteamtotheTRAC2020sharedtaskonaggressiveidenticationintheEnglishtext.wehavedevelopedasystembasedontransferlearningtechniquedependingonuni

versalsentenceencoder(USE)embeddingthatw
versalsentenceencoder(USE)embeddingthatwillbetrainedinourdevelopedmodelusingxgboostclassiertoidentifytheaggressivetextdatafromEnglishcontent.AreferencedatasethasbeenprovidedfromTRAC2020toevaluatethedevelopedapproach.Thedevelopedapproachachievedinsub-taskEN-A60.75%F1(weighted)whichrankedfourteenthoutofsixteenteamsandachieved85.66%F1(weighted)insub-taskEN-Bwhichrankedsixoutoffteenteams.Keywords:aggressionidentication,socialmedia,NLP,USE,transferlearning,XGBoost1.IntroductionIntoday'stime,theadvancesinthewebandthecommuni-cationtechnologiesisoneofthemainreasonstoincreasetheimpactofthenastycontentonsocialmedia,blogs,andotherwebsites.Detectingaggressiveandinsultingcontentisgainedrecentattentionaccordingtothenegativeeffectsonitsusers.Forinstance,demeaningcomments,incidentsofaggression,trolling,cyberbullies,hatespeech,insulting,andtoxicutterancehavenegativeimpactofusers.Unfor-tunately,duringtherecentyears,thepercentageofusingtoxicutterancehasbeenincreased.Consequently,ledtoproblemsaffectingrealsocieties.In2018therstsharedtaskonaggressionidenticationhasbeenannounced(Kumaretal.,2018).(Davidsonetal.,2017)presentedworkforaggressionclassicationbyperformingthelogisticregressionclassierdependingonseveralhand-craftedfeatures.(Djuricetal.,2015)focusedontheembeddingthathasbeenlearntfromaninputtextusingparagraph2vec(LeandMikolov,2014)totrainthelogisticregressionclassier.In2013,(KwokandWang,2013)developedaNaiveBayesclassierbasedonunigramfeatures.(Bhattacharyaetal.,2020)thesecondsharedtaskonaggressionidenticationwillbeholdonTrolling,Aggression,andCyberbullying(TRAC2020)focusingonthreelanguagesasaMultilingualsharedtask.Itaimstoclassifysocialmediapostsintooneofthreelabels(Overtlyaggressive'OAG',Covertlyaggressive'CAG',Non-aggressive'NAG').Moreover,toclassifysocialmediapostsasbinaryclassicationsinto(gendered'GEN'ornon-gendered'NGEN').ThemajorcontributionofthispaperistodescribeourparticipationoftheSAJAteamtotheTRAC2020sharedtaskonaggressiveidenticationandmorepreciselyweparticipateinEnglishlanguage.Wehavedevelopedasystembasedontransferlearningtechniquedependingonuniversalsentenceencoder(USE)embeddingthatwillbetrainedinourdevelopedmodelusingXGBoostclassiertoidentifytheaggressivetextdatafromEnglishcontent.Severalapproacheshavebeenperformedtosolvetheprovidedtask.Wementionedthebest-reportedresultsaccordingtotheevaluationstep.AreferencedatasethasbeenprovidedfromTRAC2020toevaluatethedevelopedapproach.Thedevelopedapproachachievedinsub-taskEN-A60.75%F1(weighted)andachieved85.66%F1(weighted)insub-taskEN-B.Wediscusstheproblemstatementinsection2.Section3containsdetailsaboutourmethodologyandtheuseddataset.InSection4,wediscusstheresultsandSection5concludesourwork.2.RelatedWorkMicro-bloggingisconsideredasoneofthemostpopularsocialnetworkapplications.Inrecentyears,therapidofusingsocialmediatoexpresstheusersfeelingandsharetheirideas.Ontheotherhand,theusesofaggressive,hatespeech,andoffensivelanguageobviouslyhasbeenincreasedgradually.Presentcomprehensivestudiesforhatespeechdetectionby(SchmidtandWiegand,2017)and(FortunaandNunes,2018),(Davidsonetal.,2017)presentingtheHateSpeechDetectiondataset.Additionally,(Spertus,1997)considerastheearliesteffortsinhatespeechdetection,hadbeenpresentedadecisiontree-basedclassierwith88.2%accuracy.Moreover,OffensiveidenticationforsentenceshavebeentriedforseverallanguagesbehindtheEnglishsuchthat,Arabic(Mubaraketal.,2017)and(Al-HassanandAl-Dossari,2019),German(Rossetal.,2017),(Fiseretal.,2017),and(Suetal.,2017).Inparticular,Zampierietal.(2019a)OLIDdatasetpre-sentedlastyearforoffensivelanguagedetection(Zampierietal.,2019b).(Mohaouchaneetal.,2019)presentsseveralneuralnetworksnamely:(i)CNN,(ii)Bi-LSTM,(iii)Bi-LSTMwithattentionmechanism,(iv)CombinedCNNandLSTMonArabiclanguage.Moreover,thedatasethasb