/
Extending Multilingual BERT to Low-Resource Languages Extending Multilingual BERT to Low-Resource Languages

Extending Multilingual BERT to Low-Resource Languages - PowerPoint Presentation

layla
layla . @layla
Follow
70 views
Uploaded On 2023-06-26

Extending Multilingual BERT to Low-Resource Languages - PPT Presentation

Zihan Wang University of California San Diego ziw224ucsdedu Content Task Crosslingual Zeroshot Transfer Our method Extend Experiments and ablation studies Related works Conclusion and future work ID: 1003555

languages bert language resource bert languages resource language target training multilingual lingual source word performance model learning cross class

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Extending Multilingual BERT to Low-Resou..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Extending Multilingual BERT to Low-Resource LanguagesZihan WangUniversity of California, San Diegoziw224@ucsd.edu

2. ContentTask: Cross-lingual Zero-shot TransferOur method ExtendExperiments and ablation studiesRelated worksConclusion and future work

3. Limited AnnotationsIt is hard to obtain annotations for low resource languages. A widely used approach is to transfer knowledge from high-resource languages (Source) to low-resource languages (LRL, Target). [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.Source supervision:Target :ବାର୍ସେଲୋନା ଫୁଟବଲ କ୍ଲବ ବା ଫୁଟବଲ କ୍ଲବ ବାର୍ସେଲୋନା (ସ୍ପେନୀୟ ଭାଷାରେ Futbol Club Barcelona) ସ୍ପେନର ବାର୍ସେଲୋନା ସହରର ଏକ ପ୍ରସିଦ୍ଧ ଫୁଟବଲ କ୍ଲବ ।

4. Problem setupCross-lingual Zero-shot TransferRequires supervision only on Source Language.Language adversarial trainingDomain adaptionPre-trained language models make things easyBilingual sup.endezhesfritjaruAverageSchewenk and Li (2018)Dictionary92.281.274.772.572.469.467.660.873.9Artetxe and Schwenk (2018)Parallel text89.984.871.977.378.069.460.367.874.9M-BERT/94.280.276.972.672.668.956.573.774.5[Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT: Wu et al.’19]

5. Problem: No Language Model for LRLsThere are a lot of low resource languages in the world!About 4,000 with a developed writing system.Multilingual models typically cover the top-100 languages.Model# languagesData sourceM-BERT104WikipediaXLM100WikipediaXLM-R100Common Crawl (CCNet)mBART25Common Crawl (CC25)MARGE26Wikipedia or CC-NewsmT5101Common Crawl (mC4)How to apply M-BERT to languages not pretrained on?Use it, hope for overlapping word-piecesRetrain a Multilingual (Bilingual) BERTExtend M-BERT to the Target LRL.[mT5 paper: Xue et al.’20]

6. Our Solution: ExtendContinue the pretraining task on the target language with raw text.Accommodate for new vocabularySimple but effective method Improved performance on both languages in M-BERT and out of M-BERT for cross-lingual NER.Easy to obtainIn M-BERT: 16 languages that are in the 104 training languages of M-BERTOut of M-BERT: 11 low resource languages that M-BERT was not trained on

7. Experiment SettingDataset:NER corpus from LORELEI.NER Model:Bi-LSTM-CRF from AllenNLP.Representations from language model are fixed (i.e., not finetuned)Performance averaged over 5 runs.BERT training:When extending, batch size 32, learning rate 2e-5, 500k iterations.When training from scratch, batch size 32, learning rate 1e-4, 2M iterations.

8. Languages in M-BERTAbout 6% F1 increase on average for 16 languages that appears in M-BERT

9. Languages out of M-BERTAbout 23% F1 increase on average for 11 languages that are out of M-BERT

10.

11. Ablation on ExtendPerformance on languages in M-BERT: 50.7 -> 56.6Where could the improvement come from?Additional data used for training Additional vocabulary createdMore focused training on the target language by Extend.

12. Impact of extra dataLanguageM-BERTExtend w/ Wiki dataExtend w/ LORELEI dataRussian56.5655.7056.64Thai22.4640.9938.35Hindi48.3162.7262.77Instead of using LORELEI data to extend, we use Wikipedia data that M-BERT is trained on.Whether using Wiki data or LORELEI data doesn’t affect the performance much.

13. Impact of vocabulary increaseIncluding extra vocabulary is important for low-resource languages (zul, uig), since there might be word-pieces that doesn’t exist in M-BERT’s vocabStill brings improvements to in M-BERT languages: M-BERT’s vocab might not be enough.Improvements on hin, tha and uig are significant without increase.M-BERTIncrease vocab: 05,00010,00020,00030,000 (default)hin48.3157.4462.5861.7660.2762.72tha22.4641.0052.2246.3945.1240.99zul15.8219.2144.2844.1844.4139.65uig3.5934.0541.1941.4741.2142.98

14. Extends the performance on M-BERTOn languages both out of M-BERT, and in M-BERT.Increasing vocabulary is very useful, especially for low resource languages.Even without expanding the vocab, improvements still exist.Focusing on the target language is useful.Can also train a Bilingual BERT on Source and Target.Appropriate super-sampling and sub-sampling is applied. Can expect that retraining costs more time.

15. Performance against Bilingual BERTTo transfer from English to a Low resource language, we can always train a Bilingual BERT on the two languages.Low resource languageBilingual BERTExtended Multilingual BERTUyghur21.9442.98Amharic38.6643.70Zulu44.0839.65Somali51.1853.63Akan48.0049.02………Average31.7033.97About 2% F1 increase on average for 11 low resource languages

16. Convergence speed against B-BERT

17. E-MBERT is superior to training B-BERTIn both performance and convergenceWhy?Better on Source Language (English)?No clear advantage on understanding on EnglishOn English NER, E-MBERT: 76.78, B-BERT: 79.60Better on Source -> Target Learning ?Likely trueThe extra languages in M-BERT might be helping to transfer.

18. Related Work

19. Cross-Lingual Ability of Multilingual BERT: An Empirical StudyWhy multilingual BERT works?Language similarity:Is word-piece overlap important? No.Is word-order important? Yes.Is word-frequency enough? No.Model structure:Which is more important, depth or width? Depth.

20. X-Class: Text Classification with Extremely Weak SupervisionUsing only the class name as supervision.Estimating class-oriented representations -> Clustering -> Training.Rivals and outperforms models that require expert given seed-words.

21. Conclusions and Future WorkExtend: a simple but effective method that can be applied on Multilingual BERT for Cross-Lingual Transfer Learning.Improves over languages in and out of M-BERT.Superior than training a Bilingual BERT from scratchMore careful design for vocabulary.Account for multiple languages.Thanks for listening. Any Questions?https://github.com/ZihanWangKi/extend_bertCross-Lingual Ability of Multilingual BERT: An Empirical StudyWhy M-BERT works?Word-piece overlap is not the keyhttps://github.com/CogComp/mbert-studyX-Class: Text Classification with Extremely Weak SupervisionThe class names themselves as only supervisionRepresentation learning -> Clustering -> Traininghttps://github.com/ZihanWangKi/XClass