/
TaskAutotemplateAutolabelwords TaskAutotemplateAutolabelwords

TaskAutotemplateAutolabelwords - PDF document

emery
emery . @emery
Follow
342 views
Uploaded On 2021-09-24

TaskAutotemplateAutolabelwords - PPT Presentation

SST2positivenegativeS1xTJx 10x Tf x654x5 0 xTd AMASKoneirresistiblepatheticS1xTJx 10x Tf x654x5 0 xTd AMASKpiecewonderfulbadS1xTJx 10x Tf x654x5 0 xTd AllinallMASKdeliciousbadSST5verypositive ID: 884772

mask xtd 148 2020 xtd mask 2020 148 147 tuning 151 2019 prompt entailment 2018 negative utze positive xin

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "TaskAutotemplateAutolabelwords" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 TaskAutotemplateAutolabelwords SST-2(pos
TaskAutotemplateAutolabelwords SST-2(positive/negative)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;A[MASK]one.irresistible/patheticS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;A[MASK]piece.wonderful/badS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Allinall[MASK].delicious/bad SST-5(verypositive/positive/neutral/negative/verynegative)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Themovieis[MASK].wonderful/remarkable/hilarious/better/awfulS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Themusicis[MASK].wonderful/perfect/hilarious/better/awfulS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Butitis[MASK].unforgettable/extraordinary/good/better/terrible MR(positive/negative)Itwas[MASK]!S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;epic/terribleS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;It's[MASK].epic/awfulS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;A[MASK]pieceofwork.exquisite/horrible CR(positive/negative)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;It's[MASK]!fantastic/horribleS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Thequalityis[MASK].neat/pointlessS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Thatis[MASK].magnicent/unacceptable MPQA(positive/negative)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;is[MASK].important/closeS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;,[MASK]!needed/badS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK].unexpected/shocking Subj(subjective/objective)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;It'sall[MASK].everywhere/tragicS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;It's[MASK].everywhere/horrifyingS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Isit[MASK]?something/surreal TREC(abbreviation/entity/description/human/location/numeric)Q:[MASK]:S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Application/Advisor/Discussion/Culture/Assignment/MinuteS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Why[MASK]?Production/AE/Context/Artist/Assignment/MinuteS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Answer:[MASK].Personality/Advisor/Conclusion/Hum/Assignment/Minute CoLA(grammatical/not grammatical)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Youare[MASK].one/proofItis[MASK].S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;wrong/sadIam[MASK].S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;misleading/disappointing MNLI(entailment/neutral/contradiction)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],youareright,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Fine/Plus/OtherwiseS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]you'rerightS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;There/Plus/OtherwiseS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]!S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Meaning/Plus/Otherwise SNLI(entailment/neutral/contradiction)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],no,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Alright/Watch/ExceptS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],inthiscaseS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Hi/Watch/WorseS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]thistimeS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Regardless/Fortunately/Unless QNLI(entailment/not entailment)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK].Yes,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Okay/NonethelessS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK].ItisknownthatS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Notably/YetS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],however,S2&#x]TJ/;

2 &#x 10.;邑&#x Tf ;.54; 0 ;&#
&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Specically/Notably RTE(entailment/not entailment)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],IbelieveS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Clearly/YetS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],IthinkthatS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Accordingly/meanwhileS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],IthinkS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;So/Meanwhile MRPC(equivalent/not equivalent)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]!S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Rather/AlasS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK].ThisisthersttimeS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;At/ThusS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK].That'sright.S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Instead/Moreover QQP(equivalent/not equivalent)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],butS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Me/SinceS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],please,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Um/BestS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],IwanttoknowS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Ironically/Beyond STS-B(yu/yl)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]sirS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Note/NextS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],itisnot.S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Yesterday/meanwhileS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK].ItisS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Yeah/meanwhile TableE.1:Top3automaticallygeneratedtemplatesandlabelwordsforalltasksbasedononesplitofK=16trainingexamples.NotethatautomatictemplateresultsarebasedonmanuallabelwordsandautomaticlabelwordresultsarebasedonmanualtemplatesprovidedinTable1. CategoryDatasetjYjL#Train#TestTypeLabels(classicationtasks) SST-22196,920872sentimentpositive,negativeSST-55188,5442,210sentimentv.pos.,positive,neutral,negative,v.neg.MR2208,6622,000sentimentpositive,negativesingle-CR2191,7752,000sentimentpositive,negativesentenceMPQA238,6062,000opinionpolaritypositive,negativeSubj2238,0002,000subjectivitysubjective,objectiveTREC6105,452500questioncls.abbr.,entity,description,human,loc.,num.CoLA288,5511,042acceptabilitygrammatical,not grammatical MNLI322/11392,7029,815NLIentailment,neutral,contradictionSNLI314/8549,3679,842NLIentailment,neutral,contradictionsentence-QNLI211/30104,7435,463NLIentailment,not entailmentpairRTE249/102,490277NLIentailment,not entailmentMRPC222/213,668408paraphraseequivalent,not equivalentQQP212/12363,84640,431paraphraseequivalent,not equivalentSTS-BR11/115,7491,500sent.similarity- TableB.1:Thedatasetsevaluatedinthiswork.jYj:#ofclassesforclassicationtasks(withoneexception:STS-Bisareal-valuedregressiontaskovertheinterval[0;5]).L:average#ofwordsininputsentence(s).NotethatweonlysampleDtrainandDdevofKjYjexamplesfromtheoriginaltrainingsetinourfew-shotexperiments(§3). BERT-largeSST-2SNLITRECMRPC Fine-tuning79.551.480.374.4 Prompt-basedFT85.659.279.066.8+demo(1-seg)87.550.477.268.5+demo(2-seg)86.161.377.973.2+demo(n-seg)86.458.679.671.0 RoBERTa-largeSST-2SNLITRECMRPC Fine-tuning81.448.488.876.6 Prompt-basedFT92.777.284.874.5+demonstrations92.679.787.577.8 TableD.1:AcomparisonofBERT-largevsRoBERTa-large.Weusemanualpromptsintheseexperiments.mean-tokens14fromReimersandGurevych(2019)asoursentenceembeddingmodel.DComparisonsofBERTvsRoBERTaTableD.1comparestheresultsofBERT-large(uncased)andRoBERTa-largeinoursettings.Pre-trainedBERTprovidestwosegmentembeddings(A/B)fordifferentpartsofinput.Thecommonpr

3 actice,whenne-tuningBERT,isthatusing
actice,whenne-tuningBERT,isthatusingonlysegmentAforsingle-sentencetasks,andusingseg-mentA/Bforthetwosentencesinsentence-pairtasks.Inourcaseofincorporatingdemonstrations,however,wehavemorethantwosentences.Thusweexplorethefollowingdifferentstrategiesforseg-ments:(1)usingtheAsegmentforallsentences 14https://github.com/UKPLab/sentence-transformers(1-seg);(2)usingtheAsegmentfortheoriginalinputandtheBsegmentforthedemonstrations(2-seg);(3)usingdifferentsegmentembeddingsforeachsentence(n-seg),e.g.,forSNLI,weusedifferentsegmentsforeachpremiseandhypoth-esisinboththeoriginalinputandthedemonstra-tions,whichleadstoatotalnumberof8segmentembeddings.Thisintroducesnewsegmentem-beddings(randomlyinitializedandlearnedduringne-tuning)asthepre-trainedBERTonlyhastwo.TableD.1showsthatprompt-basedne-tuningwithdemonstrationsalsoworksforBERT,and2-segworksthebestwhenincorporatingdemonstra-tions.Still,wetakeRoBERTa-largeasourmainmodel,forRoBERTaperformsmuchbetterthanBERTandRoBERTasavesthetroubletotunetheusageofsegmentembeddings.EGeneratedPromptsWedemonstratethetop3automaticallygener-atedtemplatesandlabelwordsforalltasksinTa-bleE.1.Ingeneral,mostautomatictemplatesarereasonableandgrammaticallycorrect.Forthelabelwords,thegeneratedresultslookintuitiveformostsinglesentencetasks.Forothertasks,theautomaticonescanbecounterintuitiveinsomecases.Itisstillunclearwhythelanguagemodelpicksthesewordsandsometimestheyactuallyworkwell.Weleavethisforfuturestudy. AImpactofDevelopmentSetsTableA.1showshowthesizeofthedevelopmentsetscanaffectthenalperformanceofthemodel.For“NoDdev”,wetakethesamehyper-parametersfromSchickandSch¨utze(2021a,b):batchsize=16,learningrate=1e-5andtrainingsteps=250.Wealsoexperimentwithavariantthatwesampleadevelopmentsetof10timeslargerthanthetrainingset.Wecanseethatusinglargerdevelopmentsetsleadstobetterperformance,andthisiswhywesticktojDtrainj=jDdevjinourfew-shotsetting. Fine-tuningSST-2SNLITRECMRPC NoDdev79.549.283.977.8jDdevj=jDtrainj81.448.488.876.6jDdevj=10jDtrainj83.552.089.479.6 Prompt-basedFTSST-2SNLITRECMRPC NoDdev92.175.384.870.2jDdevj=jDtrainj92.777.284.874.5jDdevj=10jDtrainj93.079.789.380.9 TableA.1:Impactofdifferentsizesofdevelopmentsets.Standarddeviationsareomittedheretosavespace.ForNojDdevj,weusethesamesetofhyper-parametersasSchickandSch¨utze(2021a,b).BDatasetsForSNLI(Bowmanetal.,2015)anddatasetsfromGLUE(Wangetal.,2019),includingSST-2(Socheretal.,2013),CoLA(Warstadtetal.,2019),MNLI(Williamsetal.,2018),QNLI(Ra-jpurkaretal.,2016),RTE(Daganetal.,2005;BarHaimetal.,2006;Giampiccoloetal.,2007;Bentivoglietal.,2009),MRPC(DolanandBrock-ett,2005),QQP12andSTS-B(Ceretal.,2017),wefollowZhangetal.(2021)andusetheiroriginaldevelopmentsetsfortesting.Fordatasetswhichre-quireacross-validationevaluation—MR(PangandLee,2005),CR(HuandLiu,2004),MPQA(Wiebeetal.,2005),Subj(PangandLee,2004)—wesim-plyrandomlysample2,000examplesasthetestingsetandleavethemoutfromtraining.ForSST-5(Socheretal.,2013)andTREC(VoorheesandTice,2000),weusetheirofcialtestsets.WeshowdatasetstatisticsinTableB.1.CExperimentalDetailsC.1Hyper-parameterselectionForgridsearch,wetakelearningratesfromf1e-5,2e-5,5e-5gandbatchsizesfromf2,4,8g.These 12https://www.quora.com/q/quoradata/numbersarepickedbypilotexperimentsontheSST-2andSNLIdatasets.Wealsouseearlystop-pingtoavoidovertting.Foreachtrial,wetrainthemodelfor1,000steps,validatetheperformanceevery100steps,andtakethebestcheckpoint.C.2Prompt-basedne-tuningTable1showsallthemanualtemplatesandla-belwordsweuseinexperiment.Forautomaticallytemplategeneration,wetaketheT5-3B13model,whichisthelargestpubliclyavailableonethatcantonasingleGPU.Forautomaticallysearchingla-belwords,wesetkto100foralltasksexceptSST-5andTREC.ForSST-5wesetasmallerk=30,asitisa5-wayclassicationtask.ForTREC,weob-servethatlteringVcusingconditionallikelihoodaloneisstillnoisy,thuswesetk=1000,andthenre-rankVcbythenearestn

4 eighborsoftheoriginalmanuallabelwordsand
eighborsoftheoriginalmanuallabelwordsandtakethetop30perclass.Wesetnto100inallexperiments.Duetothelargenumberoftrialsinautomaticsearch,wetakeaxedsetofhyper-parametersinthispart:batchsizeof8andlearningrateof1e-5.Sincetheideaofprompt-basedne-tuningistomaketheinputandoutputdistributionclosetothepre-training,theimplementationdetailsarecrucial.Fortemplates,weputextraspacebeforesentencesifitisnotatthebeginningoftheinput.Also,welowercasetherstletterofthesentenceifitisconcatenatedwithaprex(e.g.,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;inTable1).Alsoifonesentenceisappendedanypunctuation(e.g.,S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;inTable1),thenthelastcharacteroftheoriginalsentenceisdiscarded.Finally,weprependaspaceforlabelwordsinM(Y).Forexample,weuse“ great”insteadof“great”intheRoBERTavocabulary,where“ ”standsforspace.C.3Fine-tuningwithdemonstrationsWhenusingdemonstrations,wesample16dif-ferentsetsofdemonstrationsforeachinputandaveragethepredictedlogprobabilityforeachclassduringinference.Wendthatfurtherincreasingthenumberofsamplesdoesnotbringsubstantialimprovement.Additional,wehavetrieddifferentaggregationmethodsliketakingtheresultwiththemaximumcondenceandwedidnotndameaningfulimprovement.Forselectivedemonstra-tions,wetakeroberta-large-nli-stsb 13WetaketheT51.0checkpoint,whichistrainedonbothunsupervisedanddownstreamtaskdata.WecomparedittoT51.1(withoutdownstreamtaskdata)anddidnotndasignicantdifferenceingeneratedtemplates. andBowenZhou.2018.Diversefew-shottextclas-sicationwithmultiplemetrics.InNorthAmericanChapteroftheAssociationforComputationalLin-guistics(NAACL).TianyiZhang,FelixWu,ArzooKatiyar,KilianQWeinberger,andYoavArtzi.2021.Revisitingfew-sampleBERTne-tuning.InInternationalConfer-enceonLearningRepresentations(ICLR).ZexuanZhong,DanFriedman,andDanqiChen.2021.Factualprobingis[MASK]:Learningvs.learningtorecall.InNorthAmericanAssociationforComputa-tionalLinguistics(NAACL). BoPangandLillianLee.2004.Asentimentaleduca-tion:Sentimentanalysisusingsubjectivitysumma-rizationbasedonminimumcuts.InAssociationforComputationalLinguistics(ACL).BoPangandLillianLee.2005.Seeingstars:Exploit-ingclassrelationshipsforsentimentcategorizationwithrespecttoratingscales.InAssociationforCom-putationalLinguistics(ACL).FabioPetroni,TimRockt¨aschel,SebastianRiedel,PatrickLewis,AntonBakhtin,YuxiangWu,andAlexanderMiller.2019.Languagemodelsasknowl-edgebases?InEmpiricalMethodsinNaturalLan-guageProcessing(EMNLP).JasonPhang,ThibaultF´evry,andSamuelRBow-man.2018.SentenceencodersonSTILTs:Supple-mentarytrainingonintermediatelabeled-datatasks.arXivpreprintarXiv:1811.01088.AlecRadford,KarthikNarasimhan,TimSalimans,andIlyaSutskever.2018.Improvinglanguageunder-standingbygenerativepre-training.Technicalre-port,OpenAI.AlecRadford,JeffWu,RewonChild,DavidLuan,DarioAmodei,andIlyaSutskever.2019.Languagemodelsareunsupervisedmultitasklearners.Techni-calreport,OpenAI.ColinRaffel,NoamShazeer,AdamRoberts,KatherineLee,SharanNarang,MichaelMatena,YanqiZhou,WeiLi,andPeterJLiu.2020.Exploringthelimitsoftransferlearningwithauniedtext-to-textTrans-former.TheJournalofMachineLearningResearch(JMLR),21(140).PranavRajpurkar,JianZhang,KonstantinLopyrev,andPercyLiang.2016.SQuAD:100,000+questionsformachinecomprehensionoftext.InEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP).NilsReimersandIrynaGurevych.2019.Sentence-BERT:SentenceembeddingsusingSiameseBERT-networks.InEmpiricalMethodsinNaturalLan-guageProcessingandInternationalJointConfer-enceonNaturalLanguageProcessing(EMNLP-IJCNLP).TimoSchick,HelmutSchmid,andHinrichSch¨utze.2020.Automaticallyidentifyingwordsthatcanserveaslabelsforfew-shottextclassication.InInternationalConferenceonComputationalLinguis-tics(COLING).TimoSchickandHinrichSch¨utze.2021a.Exploit-ingclozequestionsforfew-shottextclassicationandnaturallanguag

5 einference.InEuropeanChap-teroftheAssoci
einference.InEuropeanChap-teroftheAssociationforComputationalLinguistics(EACL).TimoSchickandHinrichSch¨utze.2021b.It'snotjustsizethatmatters:Smalllanguagemodelsarealsofew-shotlearners.InNorthAmericanChap-teroftheAssociationforComputationalLinguistics(NAACL).TaylorShin,YasamanRazeghi,RobertL.LoganIV,EricWallace,andSameerSingh.2020.AutoPrompt:Automaticpromptconstructionformaskedlanguagemodels.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).RichardSocher,AlexPerelygin,JeanWu,JasonChuang,ChristopherD.Manning,AndrewNg,andChristopherPotts.2013.Recursivedeepmodelsforsemanticcompositionalityoverasentimenttree-bank.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).AlonTalmor,YanaiElazar,YoavGoldberg,andJonathanBerant.2020.oLMpics-onwhatlanguagemodelpre-trainingcaptures.TransactionsoftheAs-sociationofComputationalLinguistics(TACL),8.TrieuHTrinhandQuocVLe.2018.Asimplemethodforcommonsensereasoning.arXivpreprintarXiv:1806.02847.EllenMVoorheesandDawnMTice.2000.Buildingaquestionansweringtestcollection.Inthe23rdannualinternationalACMSIGIRconferenceonRe-searchanddevelopmentininformationretrieval.AlexWang,AmanpreetSingh,JulianMichael,FelixHill,OmerLevy,andSamuelRBowman.2019.GLUE:Amulti-taskbenchmarkandanalysisplat-formfornaturallanguageunderstanding.InInter-nationalConferenceonLearningRepresentations(ICLR).AlexWarstadt,AmanpreetSingh,andSamuelR.Bow-man.2019.Neuralnetworkacceptabilityjudgments.TransactionsoftheAssociationofComputationalLinguistics(TACL),7.JanyceWiebe,TheresaWilson,andClaireCardie.2005.Annotatingexpressionsofopinionsandemo-tionsinlanguage.Languageresourcesandevalua-tion,39(2-3).AdinaWilliams,NikitaNangia,andSamuelBowman.2018.Abroad-coveragechallengecorpusforsen-tenceunderstandingthroughinference.InNorthAmericanChapteroftheAssociationforComputa-tionalLinguistics:HumanLanguageTechnologies(NAACL-HLT).QizheXie,ZihangDai,EduardHovy,ThangLuong,andQuocLe.2020.Unsuperviseddataaugmenta-tionforconsistencytraining.AdvancesinNeuralInformationProcessingSystems(NeurIPS),33.WenpengYin,NazneenFatemaRajani,DragomirRadev,RichardSocher,andCaimingXiong.2020.Universalnaturallanguageprocessingwithlimitedannotations:Tryfew-shottextualentailmentasastart.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).MoYu,XiaoxiaoGuo,JinfengYi,ShiyuChang,SaloniPotdar,YuCheng,GeraldTesauro,HaoyuWang, ReferencesTrapitBansal,RishikeshJha,andAndrewMcCal-lum.2020a.Learningtofew-shotlearnacrossdi-versenaturallanguageclassicationtasks.InInter-nationalConferenceonComputationalLinguistics(COLING).TrapitBansal,RishikeshJha,TsendsurenMunkhdalai,andAndrewMcCallum.2020b.Self-supervisedmeta-learningforfew-shotnaturallanguageclassi-cationtasks.InEmpiricalMethodsinNaturalLan-guageProcessing(EMNLP).YujiaBao,MenghuaWu,ShiyuChang,andReginaBarzilay.2020.Few-shottextclassicationwithdis-tributionalsignatures.InInternationalConferenceonLearningRepresentations(ICLR).RoyBarHaim,IdoDagan,BillDolan,LisaFerro,DaniloGiampiccolo,BernardoMagnini,andIdanSzpektor.2006.ThesecondPASCALrecognisingtextualentailmentchallenge.IzBeltagy,MatthewE.Peters,andArmanCohan.2020.Longformer:Thelong-documentTrans-former.arXiv:2004.05150.LuisaBentivogli,PeterClark,IdoDagan,andDaniloGiampiccolo.2009.ThefthPASCALrecognizingtextualentailmentchallenge.InTAC.SamuelBowman,GaborAngeli,ChristopherPotts,andChristopherDManning.2015.Alargeannotatedcorpusforlearningnaturallanguageinference.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).TomBBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredKaplan,PrafullaDhariwal,ArvindNeelakantan,PranavShyam,GirishSastry,AmandaAskell,etal.2020.Languagemodelsarefew-shotlearners.InAdvancesinNeuralInformationPro-cessingSystems(NeurIPS).DanielCer,MonaDiab,EnekoAgirre,I˜nigoLopez-Gazpio,andLuciaSpecia.2017.SemEval-2017task1:Semantictextualsimilaritymultilingualandcrosslingualfocusedevaluation.Inthe11thInterna-tionalWorkshoponSemanticEvaluation(SemE

6 val-2017).JiaaoChen,ZichaoYang,andDiyiYa
val-2017).JiaaoChen,ZichaoYang,andDiyiYang.2020.Mix-Text:Linguistically-informedinterpolationofhid-denspaceforsemi-supervisedtextclassication.InAssociationforComputationalLinguistics(ACL).IdoDagan,OrenGlickman,andBernardoMagnini.2005.ThePASCALrecognisingtextualentailmentchallenge.IntheFirstInternationalConferenceonMachineLearningChallenges:EvaluatingPredic-tiveUncertaintyVisualObjectClassication,andRecognizingTextualEntailment.JoeDavison,JoshuaFeldman,andAlexanderMRush.2019.Commonsenseknowledgeminingfrompre-trainedmodels.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova.2019.BERT:Pre-trainingofdeepbidirectionalTransformersforlanguageunder-standing.InNorthAmericanChapteroftheAssoci-ationforComputationalLinguistics(NAACL).JesseDodge,GabrielIlharco,RoySchwartz,AliFarhadi,HannanehHajishirzi,andNoahSmith.2020.Fine-tuningpretrainedlanguagemodels:Weightinitializations,dataorders,andearlystop-ping.arXivpreprintarXiv:2002.06305.WilliamB.DolanandChrisBrockett.2005.Automati-callyconstructingacorpusofsententialparaphrases.IntheThirdInternationalWorkshoponParaphras-ing(IWP2005).DaniloGiampiccolo,BernardoMagnini,IdoDagan,andBillDolan.2007.ThethirdPASCALrecog-nizingtextualentailmentchallenge.IntheACL-PASCALWorkshoponTextualEntailmentandPara-phrasing.XuHan,HaoZhu,PengfeiYu,ZiyunWang,YuanYao,ZhiyuanLiu,andMaosongSun.2018.Fewrel:Alarge-scalesupervisedfew-shotrelationclassi-cationdatasetwithstate-of-the-artevaluation.InEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).JeremyHowardandSebastianRuder.2018.Universallanguagemodelne-tuningfortextclassication.InAssociationforComputationalLinguistics(ACL).MinqingHuandBingLiu.2004.Miningandsumma-rizingcustomerreviews.InACMSIGKDDinterna-tionalconferenceonKnowledgediscoveryanddatamining.ZhengbaoJiang,FrankFXu,JunAraki,andGrahamNeubig.2020.Howcanweknowwhatlanguagemodelsknow?TransactionsoftheAssociationofComputationalLinguistics(TACL).CheolhyoungLee,KyunghyunCho,andWanmoKang.2020.Mixout:Effectiveregularizationtonetunelarge-scalepretrainedlanguagemodels.InInter-nationalConferenceonLearningRepresentations(ICLR).YinhanLiu,MyleOtt,NamanGoyal,JingfeiDu,Man-darJoshi,DanqiChen,OmerLevy,MikeLewis,LukeZettlemoyer,andVeselinStoyanov.2019.RoBERTa:ArobustlyoptimizedBERTpretrainingapproach.arXivpreprintarXiv:1907.11692.PascalMettes,ElisevanderPol,andCeesSnoek.2019.Hypersphericalprototypenetworks.InAdvancesinNeuralInformationProcessingSystems(NeurIPS).TakeruMiyato,AndrewMDai,andIanGoodfel-low.2017.Adversarialtrainingmethodsforsemi-supervisedtextclassication.InInternationalCon-ferenceonLearningRepresentations(ICLR). SST-2SNLITRECMRPC Prompt-basedFT92.777.284.874.5 Uniformsampling92.378.885.670.9+RoBERTasel.92.779.583.476.6+SBERTsel.92.679.787.577.8 Table7:Impactofdemonstrationsamplingstrategies.Uniformsamplingrandomlysamplesdemonstrations,whileselective(sel.)samplingonlytakestopsentencesmeasuredbythesentenceencoders(§6).7.3AnalysisofdemonstrationsamplingTable7comparestheperformanceofdemonstra-tionsusinguniformsamplingtoselectivesamplingbySBERT.WeacknowledgethatSBERTistrainedonSNLIandMNLIdatasets,thuswealsotriedasimplesentenceencoderusingmeanpoolingofhiddenrepresentationsfromRoBERTa-large.Wendthatineithercase,usingselectivesamplingoutperformsuniformsampling,highlightingtheimportanceofsamplingsimilarexamplesforincor-poratingdemonstrationsincontext.7.4SampleefciencyFigure3illustrateshowstandardne-tuningandourLM-BFFcompareasKincreases.ForasimpletasksuchasSST-2(alsoseeMR,CRandMPQAinTable3),despiteusingonly32totalexamples,LM-BFFhasalreadynearlysaturateditsperformanceandiscomparabletostandardne-tuningovertheentiredataset.OnthehardertaskofSNLI,LM-BFFcontinuestoimproveasKincreaseswhilestillmaintainingaperformancegapoverstandardne-tuning,untilthetwoconvergearoundK=256.8DiscussionReformulatingNLPtasksasMLMhase

7 xcitingimplicationsforfew-shotlearning,b
xcitingimplicationsforfew-shotlearning,butalsohaslim-itations.First,whileLM-BFFgreatlyoutperformsstandardne-tuning,Table3showsthat,overall,theperformancestillsubstantiallylagsbehindne-tuningwiththousandsofexamples,especiallyforhardertasks.Additionally,justlikestandardne-tuning,ourresultsalsosufferfromhighvariance.Asdescribedin§2,severalrecentstudieshavetriedtocounterinstabilityinfew-shotne-tuningandweexpectthesemethodstoalsohelphere.Withrespecttoautomaticpromptgeneration,de-spiteitseffectiveness,westillnditpracticallychallengingtoexpandthesearchspace,orgeneral-izewellbasedononlyapproximately32examples. Figure3:Standardne-tuningvsourLM-BFFasafunctionofK(#instancesperclass).ForlowerK,ourmethodconsistentlyoutperformsstandardne-tuning.Thisispartlyduetoourlingeringrelianceonsomemanualdesign—eithermanualtemplates(forlabelwordsearch)ormanuallabelwords(fortemplatesearch),whichallowsustogetoursearchofftheground,butdoesalsobiasittowardsareasofthesearchspacethatwemighthavealreadyimagined.Finally,itisimportanttoclarifythatLM-BFFfa-vorscertaintaskswhich(1)canbenaturallyposedasa“ll-in-the-blank”problem;(2)haverelativelyshortinputsequences;and(3)donotcontainmanyoutputclasses.Issues(2)and(3)mightbeame-lioratedwithlonger-contextlanguagemodels(e.g.,Beltagyetal.,2020).Fortasksthatarenotstraight-forwardtoformulateinprompting,suchasstruc-turedprediction,issue(1)ismorefundamental.Weleaveitasanopenquestionforfuturework.9ConclusionInthispaperwepresentedLM-BFF,asetofsimplebuteffectivetechniquesforne-tuninglan-guagemodelsusingonlyafewexamples.Ourapproachproposesto(1)useprompt-basedne-tuningwithautomaticallysearchedprompts;and(2)includeselectedtaskdemonstrations(trainingexamples)aspartoftheinputcontext.Weshowthatourmethodoutperformsvanillane-tuningbyupto30%(and11%onaverage).Weconcludedbydiscussingthelimitationsofourapproach,andposedopenquestionsforfuturestudy.AcknowledgementsWethankthemembersofPrinceton,MIT,Ts-inghuaNLPgroupsandtheanonymousreviewersfortheirvaluablefeedback.TGissupportedbyaGraduateFellowshipatPrincetonUniversityandAFissupportedbyanNSFGraduateResearchFel-lowship.ThisresearchisalsopartlysupportedbyaGoogleResearchScholarAward. Prompt-basedFine-tuningMNLIRTE OursinglemanualP68.3(2.3)69.1(3.6)PPET71.9(1.5)69.2(4.0)Pours,jPoursj=jPPETj70.4(3.1)73.0(3.2)+demonstrations74.0(1.9)71.9(4.6)Pours,jPoursj=2072.7(2.5)73.1(3.3)+demonstrations75.4(1.6)72.3(4.5) Table4:EnsemblemodelsusingmanualpromptsfromPET(SchickandSch¨utze,2021a,b)andourautomatictemplates.PETuses4promptsforMNLIand5forRTE.WealsouseanequalnumberoftemplatesinjPoursj=jPPETjforafaircomparison. SST-2SNLITRECMRPC Manual92.777.284.874.5 AutoT92.377.188.276.2AutoL91.575.687.077.2AutoT+L92.177.089.274.0 Table5:Comparisonbetweenmanualpromptsanddifferentautomaticpromptgenerationmethods:auto-generatedtemplates(AutoT),auto-generatedlabelwords(AutoL),andtheircombination(AutoT+L).Second,prompt-basedne-tuningcangreatlyoutperformstandardne-tuning,bothwhenusingamanualpromptorageneratedone.CoLAisoneinterestingexception,astheinputmaybeanon-grammaticalsentencewhichisoutofthedistribu-tionofL.Generally,ourautomaticallysearchedtemplatescanachievecomparableorevenhigherresultsthanmanualones,especiallyfortasksinwhichconstructingstrongmanualtemplatesislessintuitive(e.g.,TREC,QNLIandMRPC).Finally,usingdemonstrationsincontextleadstoconsistentgainsinamajorityoftasks.Insummary,ourcombinedsolution—ne-tuningwithautomati-callysearchedtemplatesandsampleddemonstra-tionsets—achievesa30%gainonSNLIcomparedtostandardne-tuning,and11%gainonaverage.Ensembleresults.Anadvantageofautomaticpromptsearchisthatwecangenerateasmanypromptsaswewant,trainindividualmodels,andcreatelargeensembles.PET(SchickandSch¨utze,2021a,b)alsoensemblesmultiplemodelstrainedwithmanualprompts.10InTable4,wemakeadirectcomparisonofoursearchedpromptsandPET'sma

8 nualpromptsonMNLIandRTE(two 10Theythenus
nualpromptsonMNLIandRTE(two 10Theythenuseunlabeleddataanddistillationtogetasinglemodel,whichisoutsideofourscope. SST-2(positive/negative) AutoTM(Y)=fgreat,terribleg#1.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;A[MASK]one.#2.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;A[MASK]piece.#3.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Allinall[MASK]. AutoLT(xin)=S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].#1.irresistible/pathetic#2.wonderful/bad#3.delicious/bad SNLI(entailment/neutral/contradiction) AutoTM(Y)=fYes,Maybe,Nog#1.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],no,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;#2.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],inthiscaseS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;#3.S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK]thistimeS2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [; AutoLT(xin)=S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;#1.Alright/Watch/Except#2.Hi/Watch/Worse#3.Regardless/Fortunately/Unless Table6:Examplesofourautomaticallygeneratedtem-plates(AutoT)andlabelwords(AutoL).datasetsthatweevaluateincommon).11Astheresultsshow,anensemblewithmultipletemplatesalwaysimprovesperformance.Anensembleofthesamenumberofautomatictemplatesachievescom-parableorbetterperformancethantheensembleofPET'smanualprompts.Increasingthenumberofautomatictemplatesbringsfurthergains.7.2AnalysisofgeneratedpromptsTable5givestheresultsofusingmanualvsau-tomaticprompts.Forautomaticprompts,wecom-paretemplatesearch(AutoT),labelwordsearch(AutoL),andajointvariant(AutoT+L)inwhichwestartfrommanuallabelwords,applyAutoT,andthenAutoL.Inmostcases,AutoTachievescomparableorhigherperformancethanmanualones,andisconsistentlythebestvariant.AutoLoutperformsmanualpromptsonTRECandMRPC—butisconsiderablyworseonSNLI.AutoT+LisoftenbetterthanAutoL,butonlysome-timesbetterthanAutoT.Table6showsexamplesfromAutoTandAutoL(AfulllistinAppendixE).AutoTtemplatesgenerallytthecontextandla-belwordswell,butcancontainbiasedpeculiarities(e.g.,“fYes/Nog,no”inSNLI).ForAutoLwords,thingsaremixed:whilemostlookintuitivelyrea-sonable,therearealsosomemysteriousabnormali-ties(e.g.,“Hi”forthe“entailment”classinSNLI). 11InthePETNLItemplates,thehypothesisisputbeforethepremise,whichweactuallyfoundtobesuboptimal.Inourexperiments,weswapthetwoandgetbetterresults. SST-2SST-5MRCRMPQASubjTRECCoLA(acc)(acc)(acc)(acc)(acc)(acc)(acc)(Matt.) Majorityy50.923.150.050.050.050.018.80.0Prompt-basedzero-shotz83.635.080.879.567.651.432.02.0“GPT-3”in-contextlearning84.8(1.3)30.6(0.9)80.5(1.7)87.4(0.8)63.8(2.1)53.6(1.0)26.2(2.4)-1.5(2.4)Fine-tuning81.4(3.8)43.9(2.0)76.9(5.9)75.8(3.2)72.0(3.8)90.8(1.8)88.8(2.1)33.9(14.3) Prompt-basedFT(man)92.7(0.9)47.4(2.5)87.0(1.2)90.3(1.0)84.7(2.2)91.2(1.1)84.8(5.1)9.3(7.3)+demonstrations92.6(0.5)50.6(1.4)86.6(2.2)90.2(1.2)87.0(1.1)92.3(0.8)87.5(3.2)18.7(8.8)Prompt-basedFT(auto)92.3(1.0)49.2(1.6)85.5(2.8)89.0(1.4)85.8(1.9)91.2(1.1)88.2(2.0)14.0(14.1)+demonstrations93.0(0.6)49.5(1.7)87.7(1.4)91.0(0.9)86.5(2.6)91.4(1.8)89.4(1.7)21.8(15.9) Fine-tuning(full)y95.058.790.889.487.897.097.462.6 MNLIMNLI-mmSNLIQNLIRTEMRPCQQPSTS-B(acc)(acc)(acc)(acc)(acc)(F1)(F1)(Pear.) Majorityy32.733.033.849.552.781.20.0-Prompt-basedzero-shotz50.851.749.550.851.361.949.7-3.2“GPT-3”in-contextlearning52.0(0.7)53.4(0.6)47.1(0.6)53.8(0.4)60.4(1.4)45.7(6.0)36.1(5.2)14.3(2.8)Fine-tuning45.8(6.4)47.8(6.8)48.4(4.8)60.2(6.5)54.4(3.9)76.6(2.5)60.7(4.3)53.5(8.5) Prompt-basedFT(man)68.3(2.3)70.5(1.9)77.2(3.7)64.5(4.2)69.1(3.6)74.5(5.3)65.5(5.3)71.0(7.0)+demonstrations70.7(1.3)72.0(1.2)79.7(1.5)69.2(1.9)68.7(2.3)77.8(2.0)69.8(1.8)73.5(5.1)Prompt-basedFT(auto)68.3(2.5)70.1(2.6)77.1(2.1)68.3(7.4)73.9

9 (2.2)76.2(2.3)67.0(3.0)75.0(3.3)+demonst
(2.2)76.2(2.3)67.0(3.0)75.0(3.3)+demonstrations70.0(3.6)72.0(3.1)77.5(3.5)68.5(5.4)71.1(5.3)78.1(3.4)67.7(5.8)76.4(6.2) Fine-tuning(full)y89.889.592.693.380.991.481.791.9 Table3:OurmainresultsusingRoBERTa-large.y:fulltrainingsetisused(seedatasetsizesinTableB.1);z:notrainingexamplesareused;otherwiseweuseK=16(perclass)forfew-shotexperiments.Wereportmean(andstandarddeviation)performanceover5differentsplits(§3).Majority:majorityclass;FT:ne-tuning;man:manualprompt(Table1);auto:automaticallysearchedtemplates(§5.2);“GPT-3”in-contextlearning:usingthein-contextlearningproposedinBrownetal.(2020)withRoBERTa-large(noparameterupdates).thecontext,orevengetconfusedbytheadditionalexamples.Toaddressthisissue,wedeviseasimplestrategyinwhichweonlysampleexamplesthataresemanticallyclosetoxin.Specically,weuseapre-trainedSBERT(ReimersandGurevych,2019)modeltoobtainembeddingsforallinputsentences(forsentence-pairtasks,weusetheconcatenationofthetwosentences).HerewejustfeedtherawsentenceswithoutthetemplatesintoSBERT.Foreachqueryxinandeachlabelc2Y,wesortalltraininginstanceswiththelabelx2Dctrainbytheirsimilarityscoretothequerycos(e(xin);e(x)),andonlysamplefromthetopr=50%instancesforeachclasstouseasdemonstrations.7ExperimentsWepresentourmainresults,andaddressseveralresearchquestionspertainingtoourLM-BFFap-proach.ImplementationdetailsareinAppendixC.7.1MainresultsWeuseaRoBERTa-largemodelandsetK=16inourexperiments.AcomparisonofusingRoBERTavsBERTcanbefoundinAppendixD.Forautomaticpromptsearch,inourmaintablewereportautomatictemplatesearchonly(whichconsistentlyperformsthebest,seeTable5).Toputourresultsinperspective,wecomparetoanumberofbaselines,namely(1)standardne-tuninginourfew-shotsetting;(2)standardne-tuningusingthefulltrainingset;(3)simplytakingthemostfrequentclass(measuredonthefulltrainingset);(4)prompt-basedzero-shotpredictionwherewetakeourmanualpromptsanduseL“out-of-the-box”withoutusinganytrainingexamples;and(5)“GPT-3”in-contextlearning,whereweusethesameprompt-basedzero-shotsetting,butaugmentthecontextwithrandomlysampled32demonstrations(andstilluseRoBERTa-large,notGPT-3).Single-promptresults.Table3showsourmainresultsusingasingleprompt,eitherfromourman-uallydesignedones(Table1),orthebestgener-atedones.First,prompt-basedzero-shotpredictionachievesmuchbetterperformancethanthema-jorityclass,showingthepre-encodedknowledgeinRoBERTa.Also,“GPT-3”in-contextlearningdoesnotalwaysimproveoverzero-shotprediction,likelybecausesmallerlanguagemodelsarenotexpressiveenoughtouseoff-the-shelflikeGPT-3. Figure2:Ourapproachfortemplategeneration.alargepre-trainedtext-to-textTransformer.T5ispre-trainedtollinmissingspans(replacedbyT5masktokens,e.g.,&#xX000;or&#xY000;)initsinput.Forexample,giventheinput“Thankyou&#xX000;metoyourparty&#xY000;week”,T5istrainedtogenerate“&#xX000;forinviting&#xY000;last&#xZ000;”,meaningthat“forinviting”isthereplacementfor&#xX000;and“last”isthereplacementfor&#xY000;.Thisiswellsuitedforpromptgeneration:wecansimplytakeinputsen-tencesfromDtrainandlettheT5modelconstructthetemplateT,withouthavingtospecifyapre-denednumberoftokensforit.Givenaninputexample(xin;y)2Dtrain,weconsiderthefollowingsimpleconversions,denotedasTg(xin;y),forformulatingtheT5modelinputs:7 S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;�!&#xX000;M(y)&#xY000;S1&#x]TJ/;&#x 10.;邑&#x Tf ;(.9; 0;&#x Td ;&#x[000;;S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;�!S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;&#xX000;M(y)&#xY000;;S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;;S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;�!S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;&#xX000;M(y)&#xY000;S2&#x]TJ/;&#x 10.;邑&#x Tf ;(.9; 0;&#x Td ;&#x[000;: AsshowninFigure2,werelyontheT5modeltollintheplac

10 eholders.Whendecoding,ourgoalhereisto
eholders.Whendecoding,ourgoalhereistondanoutputthatcanworkwellforallexamplesinDtrain,i.e.,theoutputtemplateTthatmaximizesP(xin;y)2DtrainlogPT5(TjTg(xin;y)),wherePT5denotestheoutputprobabilitydistribu-tionofT5.Itcanbedecomposedaccordingto: jTjXj=1X(xin;y)2DtrainlogPT5�tjjt1;:::;tj�1;Tg�xin;y; (4)where(t1;:::;tjTj)arethetemplatetokens.Weusebeamsearchtodecodemultipletemplatecandidates.Concretely,weuseawidebeamwidth(e.g.,100)tocheaplyobtainalargesetofdiversetemplates.Wethenne-tuneeachgeneratedtem-plateonDtrainanduseDdevtoeitherpickthesingletemplatewiththebestperformance(Table3),or 7Weconsiderputtingthelabelwordbothbeforeandaftertheinputsentenceforsingle-sentencetasks.However,wendthatitisalwaysbettertoputthelabelwordsinthemiddle(betweenthetwosentences)forsentence-pairtasks.thetopktemplatestouseasanensemble(Table4).Thoughitmightappeartobeexpensivetone-tunethemodeloneachindividualtemplate,thisisfastinpracticeduetothesmallsizeofDtrain,andisalsofullyautomated:makingiteasytouse,comparedtomanuallytuningpromptsforeachdataset.6Fine-tuningwithDemonstrationsInthissection,westudywhetherwecanleveragedemonstrationswhenne-tuningmedium-sizedLMs,andndbetterwaystoexploitthem.6.1TrainingexamplesasdemonstrationsGPT-3'snaiveapproachtoin-contextlearningsimplyinvolvesconcatenatingtheinputwithupto32examplesrandomlydrawnfromthetrainingset.Thisapproachissuboptimalas(1)thenum-berofavailabledemonstrationsisboundedbythemodel'smaximuminputlength;8and(2)mixingnumerousrandomexamplesfromdifferentclassestogethercreatesextremelylongcontextswhichcanbehardtoleverage,especiallyforasmallermodel.Toaddresstheseissues,weproposeasimplerso-lution:ateachtrainingstep,werandomlysampleone9example�x(c)in;y(c)2Dtrainfromeachclass,convertitintoT�x(c)inwith[MASK]replacedbyM(y(c))—wedenotethisas~T�x(c)in;y(c)—andthenconcatenatethemwithxin(Figure1(c)): T�xin~T�x(1)in;y(1)~T�x(jYj)in;y(jYj): Heredenotesconcatenationofinputsequences.Duringbothtrainingandinferencewesamplemul-tipledemonstrationsetsforeachxin.NotethatbothxinanddemonstrationexamplesaresampledfromthesamesetDtrainduringtraining.Attestingtime,westillsampledemonstrationsetsfromDtrainandensemblepredictionsacrossallsets.6.2SamplingsimilardemonstrationsWeobservethatcontrollingtheconstructionofthedemonstrationexamplesf(x(c)in;y(c))giscru-cialforgoodnalperformance.Forexample,ifthesetofcontrastivedemonstrationsx(c)inarealldramaticallydifferent—fromeachother,orfromthequeryxin—thenitbecomeschallengingforthelanguagemodeltodeciphermeaningfulpat-terns.Asaresult,themodelmaysimplyignore 8GPT-3usesacontextsizeof2,048whilemostsmallerlanguagemodels(e.g.,RoBERTa)haveacontextsizeof512.9Wealsoexploredsamplingmultipleexamplesperclass,butdidnotobserveanyimprovements. &#xX000; terrible &#xY000;This junk. &#xX000; TemplateLabelwordsAccuracy SST-2(positive/negative)mean(std) S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].great/terrible92.7(0.9)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].good/bad92.5(1.0)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].cat/dog91.5(1.4)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].dog/cat86.2(5.4)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].terrible/great83.2(6.9)Fine-tuning-81.4(3.8) SNLI(entailment/neutral/contradiction)mean(std) S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Yes/Maybe/No77.2(3.7)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;.[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Yes/Maybe/No76.2(3.3)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK]S2&#x]TJ/;&#x 10.;邑&#x Tf ;H.5;F 0;&#x Td ;&#x[000;Yes/Maybe/No74.9(3.0)S1&#x]TJ/

11 ;&#x 10.;邑&#x Tf ;.54; 0 ;&
;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;S2&#x]TJ/;&#x 10.;邑&#x Tf ;.8; 0;&#x Td ;&#x[000;[MASK]Yes/Maybe/No65.8(2.4)S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Yes/Maybe/No62.9(4.1)S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Maybe/No/Yes60.6(4.8)Fine-tuning-48.4(4.8) Table2:Theimpactoftemplatesandlabelwordsonprompt-basedne-tuning(K=16).M:fyl;yug!V,andmodelp(yujxin)thesameasEq.(1).Wene-tuneLtominimizetheKL-divergencebetweentheinferredp(yujxin)andtheobservedmixtureweight,(y�vl)=(vu�vl).4.3Manualprompts:thegoodandthebadThekeychallengeistoconstructthetemplateTandlabelwordsM(Y)—werefertothesetwoto-getherasapromptP.Previousworks(SchickandSch¨utze,2021a,b)hand-craftboththetemplatesandlabelwords,whichusuallyrequiresdomainexpertiseandtrial-and-error.Table1summarizesmanualtemplatesandlabelwordschosenforeachdatasetinourexperiments.Thesetemplatesandlabelwordsweredesignedbyintuition,andbyconsideringformatsusedinpreviousliterature.Tobetterunderstandwhatconstitutesagoodtemplateorlabelword,weconductapilotstudyonSST-2andSNLI.Table2showsthatdifferentpromptscanleadtosubstantialdifferencesinnalaccuracy.Specically,whenatemplateisxed,thebetterthelabelwordsmatchthe“semanticclasses”,thebetterthenalaccuracyis(great/terrible�good/bad�cat/dog).Inextremecaseswhereweswapplausiblelabelwords(e.g.,terrible/great),weachievetheworstoverallperformance.6Fur-thermore,withthesamesetoflabelwords,evenasmallchangeinthetemplatecanmakeadifference.Forexample,forSNLI,ifweput[MASK]attheend,orswapsentenceorder,weobservea�10%drop.Theaboveevidenceclearlyunderlinesthe 6Itisunclear,however,whyRoBERTathinksthat“cat”ismorepositivethan“dog”.Theauthorstendtodisagree.importanceofselectinggoodtemplatesandlabelwords.Searchingforprompts,however,ishard,asthesearchspacecanbeverylarge—especiallyforthetemplate.Evenworse,weonlyhaveafewexamplestousetoguideoursearch,whichcaneasilyovert.Wewilladdresstheseissuesnext.5AutomaticPromptGenerationWenowexploreprincipledwaysofautomatingthesearchprocessforlabelwords(§5.1)andtem-plates(§5.2).Ourgoalsaretoreducethehumaninvolvementrequiredtodesignprompts,andtondmoreoptimalsettingsthanthosethatwemanuallychoose.Here,weassumeaclassicationtask,buttheprocessforregressionisanalogous.5.1AutomaticselectionoflabelwordsWerststudyhowtoconstructalabelwordmappingMthatmaximizesaccuracyonDdevaf-terne-tuning,givenaxedtemplateT.Naivelysearchingallpossibleassignments,however,is(1)generallyintractable,asthesearchspaceisexpo-nentialinthenumberofclasses;and(2)pronetoovertting,aswewilltendtouncoverspuriouscorrelationsgivenonlyafewannotations.Asasimplesolution,foreachclassc2Y,weconstructaprunedsetVcVofthetopkvocabularywordsbasedontheirconditionallikelihoodusingtheini-tialL.Thatis,letDctrainDtrainbethesubsetofallexamplesofclassc.WetakeVcas Top-kv2V8:Xxin2DctrainlogPL[MASK]=vjT(xin)9=;; (3)wherePLdenotestheoutputprobabilitydistribu-tionofL.Tofurthernarrowdownthesearchspace,wendthetopnassignmentsovertheprunedspacethatmaximizezero-shotaccuracyonDtrain(bothnandkarehyper-parameters,seeAppendixC.2).Thenwene-tunealltopnassignments,andre-ranktondthebestoneusingDdev.Thisapproachissimilartotheautomaticverbalizersearchmeth-odsinSchickandSch¨utze(2021a);Schicketal.(2020),exceptthatweuseamuchsimplersearchprocess(brute-force)andalsoapplyre-ranking—whichwendtobequitehelpful.5.2AutomaticgenerationoftemplatesNext,westudyhowtogenerateadiversesetoftemplatesfTgautomaticallyfromaxedsetoflabelwordsM(Y).Toaddressthischallengingproblem,weproposetouseT5(Raffeletal.,2020), TaskTemplateLabelwords SST-2S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xT

12 d [;Itwas[MASK].positive:great,negative:
d [;Itwas[MASK].positive:great,negative:terribleSST-5S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].v.positive:great,positive:good,neutral:okay,negative:bad,v.negative:terribleMRS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].positive:great,negative:terribleCRS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Itwas[MASK].positive:great,negative:terribleSubjS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Thisis[MASK].subjective:subjective,objective:objectiveTREC[MASK]:S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;abbreviation:Expression,entity:Entity,description:Descriptionhuman:Human,location:Location,numeric:NumberCOLAS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;Thisis[MASK].grammatical:correct,not grammatical:incorrect MNLIS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;entailment:Yes,netural:Maybe,contradiction:NoSNLIS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;entailment:Yes,netural:Maybe,contradiction:NoQNLIS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;entailment:Yes,not entailment:NoRTES1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;?[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;entailment:Yes,not entailment:NoMRPCS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;equivalent:Yes,not equivalent:NoQQPS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;equivalent:Yes,not equivalent:NoSTS-BS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;[MASK],S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;yu:Yes,yl:No Table1:Manualtemplatesandlabelwordsthatweusedinourexperiments.STS-Bisaregressiontask(§4.2).streamclassicationtaskswithalabelspaceY,wetrainatask-specichead,softmax(Woh[CLS]),bymaximizingthelog-probabilityofthecorrectlabel,whereh[CLS]isthehiddenvectorof[CLS],andWo2RjYjdisasetofrandomlyinitializedparametersintroducedatthestartofne-tuning.Similarly,foraregressiontask,wecanintroducewo2Rdandoptimizethemeansquarederrorbe-tweenwoh[CLS]andthegoldlabel.Ineithercase,thenumberofnewparameterscanbesubstantial—forexample,asimplebinaryclassicationtaskwillintroduce2,048newparametersforaRoBERTa-largemodel—makingitchallengingtolearnfromasmallamountofannotateddata(e.g.,32examples).Analternativeapproachtosolvingthisproblemisprompt-basedne-tuning,inwhichLisdirectlytaskedwith“auto-completing”naturallanguageprompts.Forinstance,wecanformulateabinarysentimentclassicationtaskusingapromptwithinputx1(e.g.,“Noreasontowatchit.”)as: xprompt=[CLS]x1Itwas[MASK].[SEP] andletLdecidewhetheritismoreappropriatetollin“great”(positive)or“terrible”(negative)for[MASK].Wenowformalizethisapproachforclassicationandregression(§4.1and§4.2),anddiscusstheimportanceofpromptselection(§4.3).4.1ClassicationLetM:Y!Vbeamappingfromthetasklabelspacetoindividualwords5inthevocabulary 5Moregenerally,wecanconsideraone-to-manymappingM:Y!2jYjinwhichwemaplabelstosetsofwords.However,wedidnotndsignicantgainsinourexperiments.VofL.Thenforeachxin,letthemanipulationxprompt=T(xin)beamaskedlanguagemodeling(MLM)inputwhichcontainsone[MASK]token.Inthisway,wecantreatourtaskasanMLM,andmodeltheprobabilityofpredictingclassy2Yas: p(yjxin)=p([MASK]=M(y)jxprompt)=exp�wM(y)h[MASK] Py02Yexp�wM(y0)h[MASK]; (1)whereh[MASK]isthehiddenvectorof[MASK]andwvdenotesthepre-softmaxvectorcorrespondingtov2V.Whensupervisedexamplesf(xin;y)gareavailable,Lcanbene-tunedtominimizethecross-entropyloss.Itisimportanttonote

13 thatthisapproachre-usesthepre-trainedwei
thatthisapproachre-usesthepre-trainedweightswvanddoesnotintroduceanynewparameters.Italsore-ducesthegapbetweenpre-trainingandne-tuning,makingitmoreeffectiveinfew-shotscenarios.4.2RegressionWeassumethesamebasicsetupasinclassi-cation,buttreatthelabelspaceYasaboundedinterval[vl;vu].InspiredbyMettesetal.(2019),wemodeltheproblemasaninterpolationbetweentwoopposingpoles,fyl;yug,withvaluesvlandvurespectively.Forinstance,wecanformulateourprevioussentimentanalysistaskasaregres-sionproblemintherange[0;1],whereweslidebetween“terrible”(vl=0)and“great”(vu=1).Inthisway,wecanexpressyasamixturemodel:y=vlp(yljxin)+vup(yujxin);(2)wherep(yujxin)istheprobabilityofyu,andp(yljxin)=1�p(yujxin).Thenwedene erateinlimiteddomains,suchasndingpatternstoexpressspecicrelations(Jiangetal.,2020),orrequirealargenumberofexamplesforgradient-guidedsearch(Shinetal.,2020;Zhongetal.,2021).Ourapproachaimstodevelopgeneral-purposesearchmethodsthatrelyonlyonafewannotations.Fine-tuningoflanguagemodels.Anumberofrecentstudieshavefocusedonbettermethodsforne-tuninglanguagemodels(HowardandRuder,2018;Dodgeetal.,2020;Leeetal.,2020;Zhangetal.,2021).Theseworksmainlyfocusonopti-mizationandregularizationtechniquestostabilizene-tuning.Hereweusestandardoptimizationtechniques,andinsteadmainlyfocusoureffortsonbetterprompt-basedne-tuninginamoreextremefew-shotsetting.Weanticipatethatresultsofthesestudiesarelargelycomplementarytoours.Few-shotlearning.Broadlyspeaking,ourset-tingisalsoconnectedtootherfew-shotlearningparadigmsinNLP,including(1)semi-supervisedlearning(Miyatoetal.,2017;Xieetal.,2020;Chenetal.,2020),whereasetofunlabeledexamplesaregiven;(2)meta-learning(Yuetal.,2018;Hanetal.,2018;Bansaletal.,2020a,b;Baoetal.,2020),whereasetofauxiliarytasksaregiven;and(3)in-termediatetraining(Phangetal.,2018;Yinetal.,2020),wherearelated,intermediatetaskisgiven.Wedeviatefromthesesettingsbymakingminimalassumptionsaboutavailableresources:weonlyassumeafewannotatedexamplesandapre-trainedlanguagemodel.Ourfocusisonunderstandinghowfarwecanpushwithoutanyotheradvantages.3ProblemSetupTaskformulation.Inthiswork,weassumeaccesstoapre-trainedlanguagemodelLthatwewishtone-tuneonataskDwithalabelspaceY.Forthetask,weonlyassumeKtrainingexamplesperclass3forthetask'strainingsetDtrain,suchthatthetotalnumberofexamplesisKtot=KjYj,andDtrain=f(xiin;yi)gKtoti=1.Ourgoalisthentodeveloptask-agnosticlearningstrategiesthatgener-alizewelltoanunseentestset(xtestin;ytest)Dtest.Formodelselectionandhyper-parametertuning,weassumeadevelopmentsetDdev,ofthesamesizeasthefew-shottrainingset,i.e.,jDdevj=jDtrainj.Thisdistinctionisimportant:usingalargerdevel-opmentsetconfersasignicantadvantage(seeour 3Forregression,wepartitionthedataintotwo“classes”accordingtobeingaboveorbelowthemedianvalue.experimentsinAppendixA),andsubvertsourini-tialgoaloflearningfromlimiteddata.4Forallofthefollowingexperiments(unlessspeciedother-wise),wetakeL=RoBERTa-largeandK=16.Evaluationdatasets.Weconductasystematicstudyacross8single-sentenceand7sentence-pairEnglishtasks,including8tasksfromtheGLUEbenchmark(Wangetal.,2019),SNLI(Bowmanetal.,2015),and6otherpopularsentenceclas-sicationtasks(SST-5,MR,CR,MPQA,Subj,TREC).AllofthedatasetdetailsareprovidedinAppendixB.Forsingle-sentencetasks,thegoalistomakeapredictionbasedonaninputsentencexin=x1,suchaswhetheramoviereviewisposi-tiveornot.Forsentence-pairtasks,thegoalistotakeapairofinputsentencesxin=(x1;x2)andpredicttherelationshipbetweenthem.Wealsoin-terchangeablyrefertotheinputsasS1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;or(S1&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;,S2&#x]TJ/;&#x 10.;邑&#x Tf ;.54; 0 ;&#xTd [;).NotethatwemainlyuseSST-2andSNLIforpilotexperimentsandmodeldevelopment,mak-ingitclosetoatruefew-shotsetting,atleastforalltheotherdatasetsweevaluateon.Evaluationprotocol.Syst

14 ematicallyevaluatingfew-shotperformancec
ematicallyevaluatingfew-shotperformancecanbetricky.Itiswell-knownthatne-tuningonsmalldatasetscansufferfrominstability(Dodgeetal.,2020;Zhangetal.,2021),andresultsmaychangedramaticallygivenanewsplitofdata.Toaccountforthis,wemeasureaverageperformanceacross5differentrandomlysampledDtrainandDdevsplits.ThisissuehasalsobeendiscussedinSchickandSch¨utze(2021b)—theysuggestusingaxedsetoftrainingexamples.Wearguethatsamplingmultiplesplitsgivesamorerobustmeasureofperformance,andabetteresti-mateofthevariance.Wealsoobservethathyper-parameterscanmakeasignicantdifference,thuswesweepmultiplehyper-parametersforeachdatasample,andtakethebestsettingasmeasuredontheDdevofthatsample(seeAppendixC.1).4Prompt-basedFine-tuningGivenamaskedlanguagemodelL,werstcon-vertinputxintoatokensequence~x,andthelan-guagemodelLthenmaps~xtoasequenceofhid-denvectorsfhk2Rdg.Duringstandardne-tuning,weusuallytake~xsingle=[CLS]x1[SEP]or~xpair=[CLS]x1[SEP]x2[SEP].Fordown- 4Incontrast,SchickandSch¨utze(2021a,b)donotuseadevelopmentset,andadoptasetofhyper-parametersbasedonpracticalconsiderations.Thisisakinto“shootinginthedark”onasettingthatweshowcanhaveunintuitiveoutcomes. Figure1:Anillustrationof(a)maskedlanguagemodel(MLM)pre-training,(b)standardne-tuning,and(c)ourproposedLM-BFFusingprompt-basedne-tuningwithdemonstrations.Theunderlinedtextisthetask-specictemplate,andcoloredwordsarelabelwords.tocheaplyobtaineffectivepromptsthatmatchoroutperformourmanuallychosenones.Second,weadopttheideaofincorporatingdemonstrationsasadditionalcontext.GPT-3'snaive“in-contextlearning”paradigmpicksupto32randomlysampledexamples,andconcatenatesthemwiththeinput.Thismethodisnotguaran-teedtoprioritizethemostinformativedemonstra-tions,andmixingrandomexamplesfromdifferentclassestogethercreateslongcontextswhichcanbehardtolearnfrom.Additionally,thenumberofusabledemonstrationsisboundedbythemodel'smaximuminputlength.Wedevelopamorerenedstrategy,where,foreachinput,werandomlysam-pleasingleexampleatatimefromeachclasstocreatemultiple,minimaldemonstrationsets.Wealsodeviseanovelsamplingstrategythatpairsin-putswithsimilarexamples,therebyprovidingthemodelwithmorediscriminativecomparisons.Wepresentasystematicevaluationforanalyzingfew-shotperformanceon8single-sentenceand7sentence-pairNLPtasks.Weobservethatgivenasmallnumberoftrainingexamples,(1)prompt-basedne-tuninglargelyoutperformsstandardne-tuning;(2)ourautomaticpromptsearchmethodmatchesoroutperformsmanualprompts;and(3)incorporatingdemonstrationsiseffectiveforne-tuning,andboostsfew-shotperformance.Together,thesesimple-yet-effectivemethodscontributeto-wardsadramaticimprovementacrossthetasksweevaluateon,andweobtaingainsupto30%abso-luteimprovement(11%onaverage)comparedtostandardne-tuning.Forinstance,wendthataRoBERTa-largemodelachievesaround90%accu-racyonmostbinarysentenceclassicationtasks,whileonlyrelyingon32trainingexamples.Were-fertoourapproachasLM-BFF,b etterf ew-shotf ine-tuningofl anguagem odels:astrong,task-agnosticmethodforfew-shotlearning.2RelatedWorkLanguagemodelprompting.TheGPTse-ries(Radfordetal.,2018,2019;Brownetal.,2020)fueledthedevelopmentofprompt-basedlearning,andwefollowmanyofitscoreconcepts.WearealsogreatlyinspiredbytherecentPETwork(SchickandSch¨utze,2021a,b),althoughtheymainlyfocusonasemi-supervisedsettingwherealargesetofunlabeledexamplesareprovided.Weonlyuseafewannotatedexamplesassupervision,andalsoexploreautomaticallygeneratedpromptsandne-tuningwithdemonstrations.Furthermore,wedeviatefromtheirevaluationbyprovidingamorerigorousframework,aswewilldiscussin§3.Finally,thereisalargebodyofworkonprompt-ingforminingknowledgefrompre-trainedmodels(TrinhandLe,2018;Petronietal.,2019;Davisonetal.,2019;Talmoretal.,2020,interalia).Dif-ferentfromtheseworks,wefocusonleveragingpromptingforne-tuningondownstreamtasks.Automaticpromptsearch.SchickandSch¨utze(2021a

15 )andSchicketal.(2020)explorewaysofidenti
)andSchicketal.(2020)explorewaysofidentifyinglabelwordsautomatically,however,noneoftheseresultsleadtobetterperformancecomparedtohand-pickedones.Incontrast,ourmethodsearchesoverbothtemplatesandlabelwords,andisabletomatchoroutperformourmanualprompts.Severalotherattemptshavebeenmadeinaddition—yettheseapproacheseitherop- MakingPre-trainedLanguageModelsBetterFew-shotLearnersTianyuGaoyAdamFischzDanqiChenyyPrincetonUniversityzMassachusettsInstituteofTechnologyftianyug,danqicg@cs.princeton.edufisch@csail.mit.eduAbstractTherecentGPT-3model(Brownetal.,2020)achievesremarkablefew-shotperfor-mancesolelybyleveraginganatural-languagepromptandafewtaskdemonstrationsasin-putcontext.Inspiredbytheirndings,westudyfew-shotlearninginamorepracticalsce-nario,whereweusesmallerlanguagemodelsforwhichne-tuningiscomputationallyef-cient.WepresentLM-BFF—b etterf ew-shotf ine-tuningofl anguagem odels1—asuiteofsimpleandcomplementarytechniquesforne-tuninglanguagemodelsonasmallnumberofannotatedexamples.Ourapproachincludes(1)prompt-basedne-tuningtogetherwithanovelpipelineforautomatingpromptgenera-tion;and(2)arenedstrategyfordynamicallyandselectivelyincorporatingdemonstrationsintoeachcontext.Finally,wepresentasys-tematicevaluationforanalyzingfew-shotper-formanceonarangeofNLPtasks,includingclassicationandregression.Ourexperimentsdemonstratethatourmethodscombinetodra-maticallyoutperformstandardne-tuningpro-ceduresinthislowresourcesetting,achievingupto30%absoluteimprovement,and11%onaverageacrossalltasks.Ourapproachmakesminimalassumptionsontaskresourcesanddo-mainexpertise,andhenceconstitutesastrongtask-agnosticmethodforfew-shotlearning.21IntroductionTheGPT-3model(Brownetal.,2020)hasmadewavesintheNLPcommunitybydemonstratingas-toundingfew-shotcapabilitiesonmyriadlanguageunderstandingtasks.Givenonlyanaturallan-guagepromptandafewdemonstrationsofthetask,GPT-3isabletomakeaccuratepredictionswithoutupdatinganyoftheweightsofitsunderlyinglan- *Thersttwoauthorscontributedequally.1Alternatively,languagemodels'b estf riendsf orever.2Ourimplementationispubliclyavailableathttps://github.com/princeton-nlp/LM-BFF.guagemodel.However,whileremarkable,GPT-3consistsof175Bparameters,whichmakesitchal-lengingtouseinmostreal-woldapplications.Inthiswork,westudyamorepracticalscenarioinwhichweonlyassumeaccesstoamoderately-sizedlanguagemodelsuchasBERT(Devlinetal.,2019)orRoBERTa(Liuetal.,2019),andasmallnumberofexamples(i.e.,afew-shotsetting),whichwecanusetone-tunetheweightsofthelanguagemodel.Thissettingisappealingas(1)suchmod-elscanbetrainedontypicalresearchhardware;(2)few-shotsettingsarerealistic,asitisgenerallybotheasytoacquireafewannotations(e.g.,32examples)andefcienttotrainonthem;and(3)updatingparameterstypicallyleadstobetterperfor-mance.InspiredbyGPT-3'sndings,weproposeseveralnovelstrategiesforexpandingitsfew-shotlearningabilitiestooursetting,consideringbothclassicationand—forthersttime—regression.First,wefollowtherouteofprompt-basedpre-diction,rstdevelopedbytheGPTseries(Radfordetal.,2018,2019;Brownetal.,2020)forzero-shotpredictionandrecentlystudiedbyPET(SchickandSch¨utze,2021a,b)forne-tuning.Prompt-basedpredictiontreatsthedownstreamtaskasa(masked)languagemodelingproblem,wherethemodeldi-rectlygeneratesatextualresponse(referredtoasalabelword)toagivenpromptdenedbyatask-specictemplate(seeFigure1(c)).Findingtherightprompts,however,isanart—requiringbothdomainexpertiseandanunderstandingofthelan-guagemodel'sinnerworkings.Evenifsignicanteffortisinvested,manualpromptsarelikelytobesuboptimal.Weaddressthisissuebyintroducingautomaticpromptgeneration,includingaprunedbrute-forcesearchtoidentifythebestworkinglabelwords,andanoveldecodingobjectivetoautomat-icallygeneratetemplatesusingthegenerativeT5model(Raffeletal.,2020)—allofwhichonlyre-quirethefew-shottrainingdata.Thisallow

Related Contents


Next Show more