Mikel Artetxe

Publications

2024

Linguini: A benchmark for language-agnostic linguistic reasoning

Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà

arXiv preprint

Abstract BibTex PDF Code

We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.

@misc{sanchez2024linguini
    title = {Linguini: A benchmark for language-agnostic linguistic reasoning},
    author = {Sánchez, Eduardo and Alastruey, Belen and Ropers, Christophe and Stenetorp, Pontus and Artetxe, Mikel and Costa-jussà, Marta R.},
    year = {2024},
    month = {09},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2409.12126},
}

Latxa: An Open Language Model and Evaluation Suite for Basque

Julen Etxaniz, Oscar Sainz, Naiara Miguel, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

ACL 2024

Abstract BibTex PDF Code

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,046 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

@inproceedings{etxaniz2024latxa
    title = {Latxa: An Open Language Model and Evaluation Suite for Basque},
    author = {Etxaniz, Julen and Sainz, Oscar and Miguel, Naiara and Aldabe, Itziar and Rigau, German and Agirre, Eneko and Ormazabal, Aitor and Artetxe, Mikel and Soroa, Aitor},
    booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {14952-14972},
    year = {2024},
    month = {08},
    address = {Bangkok, Thailand},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2024.acl-long.799},
    url = {https://aclanthology.org/2024.acl-long.799},
}

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

ACL 2024

Abstract BibTex PDF Code

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the FLORES-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and findings, notably that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

@inproceedings{bandarkar2024belebele
    title = {The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants},
    author = {Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian},
    booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {749-775},
    year = {2024},
    month = {08},
    address = {Bangkok, Thailand},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2024.acl-long.44},
    url = {https://aclanthology.org/2024.acl-long.44},
}

Do Multilingual Language Models Think Better in English?

Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lacalle, Mikel Artetxe

NAACL 2024

Abstract BibTex PDF Code

Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system before running inference. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model. In this work, we introduce a new approach called self-translate that leverages the few-shot translation capabilities of multilingual language models. This allows us to analyze the effect of translation in isolation. Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages. Our code is available at https://github.com/juletx/self-translate.

@inproceedings{etxaniz2024multilingual
    title = {Do Multilingual Language Models Think Better in English?},
    author = {Etxaniz, Julen and Azkune, Gorka and Soroa, Aitor and Lacalle, Oier and Artetxe, Mikel},
    booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)},
    pages = {550-564},
    year = {2024},
    month = {06},
    address = {Mexico City, Mexico},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2024.naacl-short.46},
    url = {https://aclanthology.org/2024.naacl-short.46},
}

Improving Factuality in Clinical Abstractive Multi-Document Summarization by Guided Continued Pre-training

Ahmed Elhady, Khaled Elsayed, Eneko Agirre, Mikel Artetxe

NAACL 2024

Abstract BibTex PDF

Factual accuracy is an important property of neural abstractive summarization models, especially in fact-critical domains such as the clinical literature. In this work, we introduce a guided continued pre-training stage for encoder-decoder models that improves their understanding of the factual attributes of documents, which is followed by supervised fine-tuning on summarization. Our approach extends the pre-training recipe of BART to incorporate 3 additional objectives based on PICO spans, which capture the population, intervention, comparison, and outcomes related to a clinical study. Experiments on multi-document summarization in the clinical domain demonstrate that our approach is competitive with prior work, improving the quality and factuality of the summaries and achieving the best-published results in factual accuracy on the MSLR task.

@inproceedings{elhady2024improving
    title = {Improving Factuality in Clinical Abstractive Multi-Document Summarization by Guided Continued Pre-training},
    author = {Elhady, Ahmed and Elsayed, Khaled and Agirre, Eneko and Artetxe, Mikel},
    booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)},
    pages = {755-761},
    year = {2024},
    month = {06},
    address = {Mexico City, Mexico},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2024.naacl-short.66},
    url = {https://aclanthology.org/2024.naacl-short.66},
}

BertaQA: How Much Do Language Models Know About Local Culture?

Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

arXiv preprint

Abstract BibTex PDF Code

Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at https://github.com/juletx/BertaQA.

@misc{etxaniz2024bertaqa
    title = {BertaQA: How Much Do Language Models Know About Local Culture?},
    author = {Etxaniz, Julen and Azkune, Gorka and Soroa, Aitor and de Lacalle, Oier Lopez and Artetxe, Mikel},
    year = {2024},
    month = {06},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2406.07302},
}

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Mikel Artetxe, Yi Tay

arXiv preprint

Abstract BibTex PDF Code

We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. Vibe-Eval is open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, our hard set contains >50% questions that all frontier models answer incorrectly. We explore the nuances of designing, evaluating, and ranking models on ultra challenging prompts. We also discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment. We offer free API access for the purpose of lightweight evaluation and plan to conduct formal human evaluations for public models that perform well on the Vibe-Eval's automatic scores. We release the evaluation code and data, see https://github.com/reka-ai/reka-vibe-eval

@misc{padlewski2024vibeeval
    title = {Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models},
    author = {Padlewski, Piotr and Bain, Max and Henderson, Matthew and Zhu, Zhongkai and Relan, Nishant and Pham, Hai and Ong, Donovan and Aleksiev, Kaloyan and Ormazabal, Aitor and Phua, Samuel and Yeo, Ethan and Lamprecht, Eugenie and Liu, Qi and Wang, Yuqi and Chen, Eric and Fu, Deyu and Li, Lei and Zheng, Che and de Masson d'Autume, Cyprien and Yogatama, Dani and Artetxe, Mikel and Tay, Yi},
    year = {2024},
    month = {05},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2405.02287},
}

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie

arXiv preprint

Abstract BibTex PDF

We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at http://chat.reka.ai . A showcase of non cherry picked qualitative examples can also be found at http://showcase.reka.ai .

@misc{rekateam2024reka
    title = {Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models},
    author = {Reka Team and Ormazabal, Aitor and Zheng, Che and de Masson d'Autume, Cyprien and Yogatama, Dani and Fu, Deyu and Ong, Donovan and Chen, Eric and Lamprecht, Eugenie and Pham, Hai and Ong, Isaac and Aleksiev, Kaloyan and Li, Lei and Henderson, Matthew and Bain, Max and Artetxe, Mikel and Relan, Nishant and Padlewski, Piotr and Liu, Qi and Chen, Ren and Phua, Samuel and Yang, Yazheng and Tay, Yi and Wang, Yuqi and Zhu, Zhongkai and Xie, Zhihui},
    year = {2024},
    month = {04},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2404.12387},
}

2023

Improving Language Plasticity via Pretraining with Active Forgetting

Yihong Chen, Kelly Marchisio, Roberta Raileanu, David Ifeoluwa Adelani, Pontus Stenetorp, Sebastian Riedel, Mikel Artetxe

NeurIPS 2023

Abstract BibTex PDF Code

Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performance, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English. Code will be available at https://github.com/facebookresearch/language-model-plasticity.

@inproceedings{chen2023improving
    title = {Improving Language Plasticity via Pretraining with Active Forgetting},
    author = {Chen, Yihong and Marchisio, Kelly and Raileanu, Roberta and Adelani, David Ifeoluwa and Stenetorp, Pontus and Riedel, Sebastian and Artetxe, Mikel},
    booktitle = {Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)},
    pages = {31543-31557},
    year = {2023},
    month = {12},
    publisher = {Curran Associates, Inc.},
    url = {https://papers.nips.cc/paper_files/paper/2023/hash/6450ea28ebbc8437bc38775157818172-Abstract-Conference.html},
}

Revisiting Machine Translation for Cross-lingual Classification

Mikel Artetxe, Vedanuj Goswami, Shruti Bhosale, Angela Fan, Luke Zettlemoyer

EMNLP 2023

Abstract BibTex PDF

Machine Translation (MT) has been widely used for cross-lingual classification, either by translating the test set into English and running inference with a monolingual model (translate-test), or translating the training set into the target languages and finetuning a multilingual model (translate-train). However, most research in the area focuses on the multilingual models rather than the MT component. We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed. The optimal approach, however, is highly task dependent, as we identify various sources of cross-lingual transfer gap that affect different tasks and approaches differently. Our work calls into question the dominance of multilingual models for cross-lingual classification, and prompts to pay more attention to MT-based baselines.

@inproceedings{artetxe2023revisiting
    title = {Revisiting Machine Translation for Cross-lingual Classification},
    author = {Artetxe, Mikel and Goswami, Vedanuj and Bhosale, Shruti and Fan, Angela and Zettlemoyer, Luke},
    booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
    pages = {6489-6499},
    year = {2023},
    month = {12},
    address = {Singapore},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2023.emnlp-main.399},
    url = {https://aclanthology.org/2023.emnlp-main.399},
}

CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models

Aitor Ormazabal, Mikel Artetxe, Eneko Agirre

EMNLP 2023

Abstract BibTex PDF Code

Methods for adapting language models (LMs) to new tasks and domains have traditionally assumed white-box access to the model, and work by modifying its parameters. However, this is incompatible with a recent trend in the field, where the highest quality models are only available as black-boxes through inference APIs. Even when the model weights are available, the computational cost of fine-tuning large LMs can be prohibitive for most practitioners. In this work, we present a lightweight method for adapting large LMs to new domains and tasks, assuming no access to their weights or intermediate activations. Our approach fine-tunes a small white-box LM and combines it with the large black-box LM at the probability level through a small network, learned on a small validation set. We validate our approach by adapting a large LM (OPT-30B) to several domains and a downstream task (machine translation), observing improved performance in all cases, of up to 9%, while using a domain expert 23x smaller.

@inproceedings{ormazabal2023comblm
    title = {CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models},
    author = {Ormazabal, Aitor and Artetxe, Mikel and Agirre, Eneko},
    booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
    pages = {2961-2974},
    year = {2023},
    month = {12},
    address = {Singapore},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2023.emnlp-main.180},
    url = {https://aclanthology.org/2023.emnlp-main.180},
}

A taxonomy and review of generalization research in NLP

Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin

Nature Machine Intelligence

Abstract BibTex PDF Code

The ability to generalize well is one of the primary desiderata for models of natural language processing (NLP), but what ‘good generalization’ entails and how it should be evaluated is not well understood. In this Analysis we present a taxonomy for characterizing and understanding generalization research in NLP. The proposed taxonomy is based on an extensive literature review and contains five axes along which generalization studies can differ: their main motivation, the type of generalization they aim to solve, the type of data shift they consider, the source by which this data shift originated, and the locus of the shift within the NLP modelling pipeline. We use our taxonomy to classify over 700 experiments, and we use the results to present an in-depth analysis that maps out the current state of generalization research in NLP and make recommendations for which areas deserve attention in the future.

@article{hupkes2023taxonomy
    title = {A taxonomy and review of generalization research in NLP},
    author = {Hupkes, Dieuwke and Giulianelli, Mario and Dankers, Verna and Artetxe, Mikel and Elazar, Yanai and Pimentel, Tiago and Christodoulopoulos, Christos and Lasri, Karim and Saphra, Naomi and Sinclair, Arabella and Ulmer, Dennis and Schottmann, Florian and Batsuren, Khuyagbaatar and Sun, Kaiser and Sinha, Koustuv and Khalatbari, Leila and Ryskina, Maria and Frieske, Rita and Cotterell, Ryan and Jin, Zhijing},
    journal = {Nature Machine Intelligence},
    volume = {5},
    number = {10},
    pages = {1161-1174},
    year = {2023},
    month = {10},
    issn = {2522-5839},
    doi = {10.1038/s42256-023-00729-y},
    url = {https://doi.org/10.1038/s42256-023-00729-y},
}

Gender-specific Machine Translation with Large Language Models

Eduardo Sánchez, Pierre Andrews, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà

arXiv preprint

Abstract BibTex PDF

While machine translation (MT) systems have seen significant improvements, it is still common for translations to reflect societal biases, such as gender bias. Decoder-only Large Language Models (LLMs) have demonstrated potential in MT, albeit with performance slightly lagging behind traditional encoder-decoder Neural Machine Translation (NMT) systems. However, LLMs offer a unique advantage: the ability to control the properties of the output through prompts. In this study, we leverage this flexibility to explore LLaMa's capability to produce gender-specific translations. Our results indicate that LLaMa can generate gender-specific translations with translation accuracy and gender bias comparable to NLLB, a state-of-the-art multilingual NMT system. Furthermore, our experiments reveal that LLaMa's gender-specific translations rely on coreference resolution to determine gender, showing higher gender variance in gender-ambiguous datasets but maintaining consistency in less ambiguous contexts. This research investigates the potential and challenges of using LLMs for gender-specific translations as an instance of the controllability of outputs offered by LLMs.

@misc{sanchez2023genderspecific
    title = {Gender-specific Machine Translation with Large Language Models},
    author = {Sánchez, Eduardo and Andrews, Pierre and Stenetorp, Pontus and Artetxe, Mikel and Costa-jussà, Marta R.},
    year = {2023},
    month = {09},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2309.03175},
}

Evaluation of Faithfulness Using the Longest Supported Subsequence

Anirudh Mittal, Timo Schick, Mikel Artetxe, Jane Dwivedi-Yu

arXiv preprint

Abstract BibTex PDF Code

As increasingly sophisticated language models emerge, their trustworthiness becomes a pivotal issue, especially in tasks such as summarization and question-answering. Ensuring their responses are contextually grounded and faithful is challenging due to the linguistic diversity and the myriad of possible answers. In this paper, we introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous substring of the claim that is supported by the context, which we refer to as the Longest Supported Subsequence (LSS). Using a new human-annotated dataset, we finetune a model to generate LSS. We introduce a new method of evaluation and demonstrate that these metrics correlate better with human ratings when LSS is employed, as opposed to when it is not. Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset. Our metric consistently outperforms other metrics on a summarization dataset across six different models. Finally, we compare several popular Large Language Models (LLMs) for faithfulness using this metric. We release the human-annotated dataset built for predicting LSS and our fine-tuned model for evaluating faithfulness.

@misc{mittal2023evaluation
    title = {Evaluation of Faithfulness Using the Longest Supported Subsequence},
    author = {Mittal, Anirudh and Schick, Timo and Artetxe, Mikel and Dwivedi-Yu, Jane},
    year = {2023},
    month = {08},
    publisher = {arXiv},
    url = {https://arxiv.org/abs/2308.12157},
}

Training Trajectories of Language Models Across Scales

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Veselin Stoyanov

ACL 2023

Abstract BibTex PDF Code

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)—from 125M to 175B parameters—on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

@inproceedings{xia2023training
    title = {Training Trajectories of Language Models Across Scales},
    author = {Xia, Mengzhou and Artetxe, Mikel and Zhou, Chunting and Lin, Xi Victoria and Pasunuru, Ramakanth and Chen, Danqi and Zettlemoyer, Luke and Stoyanov, Veselin},
    booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {13711-13738},
    year = {2023},
    month = {07},
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2023.acl-long.767},
    url = {https://aclanthology.org/2023.acl-long.767},
}

Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

Kelly Marchisio, Patrick Lewis, Yihong Chen, Mikel Artetxe

Findings of ACL 2023

Abstract BibTex PDF

Prior work shows that it is possible to expand pretrained Masked Language Models (MLMs) to new languages by learning a new set of embeddings, while keeping the transformer body frozen. Despite learning a small subset of parameters, this approach is not compute-efficient, as training the new embeddings requires a full forward and backward pass over the entire model. We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model’s parameters. New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer. We explore two approaches to learn mini-models: MINIJOINT, which jointly pretrains the primary model and the mini-model using a single transformer with a secondary MLM head at a middle layer; and MINIPOST, where we start from a regular pretrained model, build a mini-model by extracting and freezing a few layers, and learn a small number of parameters on top. Experiments on XNLI, MLQA and PAWS-X show that mini-model adaptation matches the performance of the standard approach using up to 2.3x less compute on average.

@inproceedings{marchisio2023minimodel
    title = {Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training},
    author = {Marchisio, Kelly and Lewis, Patrick and Chen, Yihong and Artetxe, Mikel},
    booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
    pages = {5474-5490},
    year = {2023},
    month = {07},
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2023.findings-acl.338},
    url = {https://aclanthology.org/2023.findings-acl.338},
}

On the Role of Parallel Data in Cross-lingual Transfer Learning

Machel Reid, Mikel Artetxe

Findings of ACL 2023

Abstract BibTex PDF

While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.

@inproceedings{reid2023role
    title = {On the Role of Parallel Data in Cross-lingual Transfer Learning},
    author = {Reid, Machel and Artetxe, Mikel},
    booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
    pages = {5999-6006},
    year = {2023},
    month = {07},
    address = {Toronto, Canada},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2023.findings-acl.372},
    url = {https://aclanthology.org/2023.findings-acl.372},
}

2022

Efficient Large Scale Language Modeling with Mixtures of Experts

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giridharan Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeffrey Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Veselin Stoyanov

EMNLP 2022

Abstract BibTex PDF Code

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ~4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

@inproceedings{artetxe2022efficient
    title = {Efficient Large Scale Language Modeling with Mixtures of Experts},
    author = {Artetxe, Mikel and Bhosale, Shruti and Goyal, Naman and Mihaylov, Todor and Ott, Myle and Shleifer, Sam and Lin, Xi Victoria and Du, Jingfei and Iyer, Srinivasan and Pasunuru, Ramakanth and Anantharaman, Giridharan and Li, Xian and Chen, Shuohui and Akin, Halil and Baines, Mandeep and Martin, Louis and Zhou, Xing and Koura, Punit Singh and O’Horo, Brian and Wang, Jeffrey and Zettlemoyer, Luke and Diab, Mona and Kozareva, Zornitsa and Stoyanov, Veselin},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {11699-11732},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.804},
    url = {https://aclanthology.org/2022.emnlp-main.804},
}

Does Corpus Quality Really Matter for Low-Resource Languages?

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

EMNLP 2022

Abstract BibTex PDF Code

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with \textless33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.

@inproceedings{artetxe2022corpus
    title = {Does Corpus Quality Really Matter for Low-Resource Languages?},
    author = {Artetxe, Mikel and Aldabe, Itziar and Agerri, Rodrigo and Perez-de-Viñaspre, Olatz and Soroa, Aitor},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {7383-7390},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.499},
    url = {https://aclanthology.org/2022.emnlp-main.499},
}

Don’t Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Mozes van de Kar, Mengzhou Xia, Danqi Chen, Mikel Artetxe

EMNLP 2022

Abstract BibTex PDF

Masked language models like BERT can perform text classification in a zero-shot fashion by reformulating downstream tasks as text infilling. However, this approach is highly sensitive to the template used to prompt the model, yet practitioners are blind when designing them in strict zero-shot settings. In this paper, we propose an alternative mining-based approach for zero-shot learning. Instead of prompting language models, we use regular expressions to mine labeled examples from unlabeled corpora, which can optionally be filtered through prompting, and used to finetune a pretrained model. Our method is more flexible and interpretable than prompting, and outperforms it on a wide range of tasks when using comparable templates. Our results suggest that the success of prompting can partly be explained by the model being exposed to similar examples during pretraining, which can be directly retrieved through regular expressions.

@inproceedings{vandekar2022dont
    title = {Don’t Prompt, Search! Mining-based Zero-Shot Learning with Language Models},
    author = {van de Kar, Mozes and Xia, Mengzhou and Chen, Danqi and Artetxe, Mikel},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {7508-7520},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.509},
    url = {https://aclanthology.org/2022.emnlp-main.509},
}

Multilingual Machine Translation with Hyper-Adapters

Christos Baziotis, Mikel Artetxe, James Cross, Shruti Bhosale

EMNLP 2022

Abstract BibTex PDF

Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks using hyper-adapters – hyper-networks that generate adapters from language and layer embeddings. While past work had poor results when scaling hyper-networks, we propose a rescaling fix that significantly improves convergence and enables training larger hyper-networks. We find that hyper-adapters are more parameter efficient than regular adapters, reaching the same performance with up to 12 times less parameters. When using the same number of parameters and FLOPS, our approach consistently outperforms regular adapters. Also, hyper-adapters converge faster than alternative approaches and scale better than regular dense networks. Our analysis shows that hyper-adapters learn to encode language relatedness, enabling positive transfer across languages.

@inproceedings{baziotis2022multilingual
    title = {Multilingual Machine Translation with Hyper-Adapters},
    author = {Baziotis, Christos and Artetxe, Mikel and Cross, James and Bhosale, Shruti},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {1170-1185},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.77},
    url = {https://aclanthology.org/2022.emnlp-main.77},
}

Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Mengzhou Xia, Mikel Artetxe, Jingfei Du, Danqi Chen, Veselin Stoyanov

EMNLP 2022

Abstract BibTex PDF Code

Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. How- ever, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.

@inproceedings{xia2022prompting
    title = {Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models},
    author = {Xia, Mengzhou and Artetxe, Mikel and Du, Jingfei and Chen, Danqi and Stoyanov, Veselin},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {11351-11361},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.780},
    url = {https://aclanthology.org/2022.emnlp-main.780},
}

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer

EMNLP 2022

Abstract BibTex PDF Code

Large language models (LMs) are able to in-context learn—perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations are in fact not required—randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of endtask performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence. Together, our analysis provides a new way of understanding how and why in-context learning works, while opening up new questions about how much can be learned from large language models through inference alone.

@inproceedings{min2022rethinking
    title = {Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?},
    author = {Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {11048-11064},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.759},
    url = {https://aclanthology.org/2022.emnlp-main.759},
}

Few-shot Learning with Multilingual Generative Language Models

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

EMNLP 2022

Abstract BibTex PDF Code

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples.

@inproceedings{lin2022fewshot
    title = {Few-shot Learning with Multilingual Generative Language Models},
    author = {Lin, Xi Victoria and Mihaylov, Todor and Artetxe, Mikel and Wang, Tianlu and Chen, Shuohui and Simig, Daniel and Ott, Myle and Goyal, Naman and Bhosale, Shruti and Du, Jingfei and Pasunuru, Ramakanth and Shleifer, Sam and Koura, Punit Singh and Chaudhary, Vishrav and O’Horo, Brian and Wang, Jeff and Zettlemoyer, Luke and Kozareva, Zornitsa and Diab, Mona and Stoyanov, Veselin and Li, Xian},
    booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
    pages = {9019-9052},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.emnlp-main.616},
    url = {https://aclanthology.org/2022.emnlp-main.616},
}

On the Role of Bidirectionality in Language Model Pre-Training

Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Veselin Stoyanov

Findings of EMNLP 2022

Abstract BibTex PDF

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.

@inproceedings{artetxe2022role
    title = {On the Role of Bidirectionality in Language Model Pre-Training},
    author = {Artetxe, Mikel and Du, Jingfei and Goyal, Naman and Zettlemoyer, Luke and Stoyanov, Veselin},
    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
    pages = {3973-3985},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.findings-emnlp.293},
    url = {https://aclanthology.org/2022.findings-emnlp.293},
}

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, Eneko Agirre

Findings of EMNLP 2022

Abstract BibTex PDF Code

Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems that follow any given meter and rhyme scheme, without requiring any poetic text for training. Our method works by splitting a regular, non-poetic corpus into phrases, prepending control codes that describe the length and end rhyme of each phrase, and training a transformer language model in the augmented corpus. The transformer learns to link the structure descriptor with the control codes to the number of lines, their length and their end rhyme. During inference, we build control codes for the desired meter and rhyme scheme, and condition our language model on them to generate formal verse poetry. Experiments in Spanish and Basque show that our approach is able to generate valid poems, which are often comparable in quality to those written by humans.

@inproceedings{ormazabal2022poelm
    title = {PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation},
    author = {Ormazabal, Aitor and Artetxe, Mikel and Agirrezabal, Manex and Soroa, Aitor and Agirre, Eneko},
    booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
    pages = {3655-3670},
    year = {2022},
    month = {12},
    address = {Abu Dhabi, United Arab Emirates},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.findings-emnlp.268},
    url = {https://aclanthology.org/2022.findings-emnlp.268},
}

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe

NAACL 2022

Abstract BibTex PDF Code Video

Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.

@inproceedings{pfeiffer2022lifting
    title = {Lifting the Curse of Multilinguality by Pre-training Modular Transformers},
    author = {Pfeiffer, Jonas and Goyal, Naman and Lin, Xi Victoria and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel},
    booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages = {3479-3495},
    year = {2022},
    month = {07},
    address = {Seattle, United States},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.naacl-main.255},
    url = {https://aclanthology.org/2022.naacl-main.255},
}

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Machel Reid, Mikel Artetxe

NAACL 2022

Abstract BibTex PDF Code Video

Despite the success of multilingual sequence-to-sequence pretraining, most existing approaches rely on monolingual corpora and do not make use of the strong cross-lingual signal contained in parallel data. In this paper, we present PARADISE (PARAllel &Denoising Integration in SEquence-to-sequence models), which extends the conventional denoising objective used to train these models by (i) replacing words in the noised sequence according to a multilingual dictionary, and (ii) predicting the reference translation according to a parallel corpus instead of recovering the original sequence. Our experiments on machine translation and cross-lingual natural language inference show an average improvement of 2.0 BLEU points and 6.7 accuracy points from integrating parallel data into pretraining, respectively, obtaining results that are competitive with several popular models at a fraction of their computational cost.

@inproceedings{reid2022paradise
    title = {PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining},
    author = {Reid, Machel and Artetxe, Mikel},
    booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages = {800-810},
    year = {2022},
    month = {07},
    address = {Seattle, United States},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.naacl-main.58},
    url = {https://aclanthology.org/2022.naacl-main.58},
}

Principled Paraphrase Generation with Parallel Corpora

Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

ACL 2022

Abstract BibTex PDF Code

Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments.

@inproceedings{ormazabal2022principled
    title = {Principled Paraphrase Generation with Parallel Corpora},
    author = {Ormazabal, Aitor and Artetxe, Mikel and Soroa, Aitor and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {1621-1638},
    year = {2022},
    month = {05},
    address = {Dublin, Ireland},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2022.acl-long.114},
    url = {https://aclanthology.org/2022.acl-long.114},
}

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

arXiv preprint

Abstract BibTex PDF Code

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

@misc{zhang2022opt
    title = {OPT: Open Pre-trained Transformer Language Models},
    author = {Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and Mihaylov, Todor and Ott, Myle and Shleifer, Sam and Shuster, Kurt and Simig, Daniel and Koura, Punit Singh and Sridhar, Anjali and Wang, Tianlu and Zettlemoyer, Luke},
    year = {2022},
    month = {05},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2205.01068},
    url = {https://arxiv.org/abs/2205.01068},
}

Multilingual Autoregressive Entity Linking

Nicola De Cao, Ledell Wu, Kashyap Popat, Mikel Artetxe, Naman Goyal, Mikhail Plekhanov, Luke Zettlemoyer, Nicola Cancedda, Sebastian Riedel, Fabio Petroni

TACL

Abstract BibTex PDF Code

We present mGENRE, a sequence-to-sequence system for the Multilingual Entity Linking (MEL) problem -- the task of resolving language-specific mentions to a multilingual Knowledge Base (KB). For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token in an autoregressive fashion. The autoregressive formulation allows us to effectively cross-encode mention string and entity names to capture more interactions than the standard dot product between mention and entity vectors. It also enables fast search within a large KB even for mentions that do not appear in mention tables and with no need for large-scale vector indices. While prior MEL works use a single representation for each entity, we match against entity names of as many languages as possible, which allows exploiting language connections between source input and target name. Moreover, in a zero-shot setting on languages with no training data at all, mGENRE treats the target language as a latent variable that is marginalized at prediction time. This leads to over 50% improvements in average accuracy. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks where mGENRE establishes new state-of-the-art results. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

@article{decao2022multilingual
    title = {Multilingual Autoregressive Entity Linking},
    author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {10},
    pages = {274-290},
    year = {2022},
    month = {03},
    publisher = {MIT Press},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00460},
    url = {https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00460/110051/Multilingual-Autoregressive-Entity-Linking},
}

Efficient Language Modeling with Sparse all-MLP

Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

arXiv preprint

Abstract BibTex PDF

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

@misc{yu2022efficient
    title = {Efficient Language Modeling with Sparse all-MLP},
    author = {Yu, Ping and Artetxe, Mikel and Ott, Myle and Shleifer, Sam and Gong, Hongyu and Stoyanov, Ves and Li, Xian},
    year = {2022},
    month = {03},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2203.06850},
    url = {https://arxiv.org/abs/2203.06850},
}

2021

Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring

Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

ACL 2021

Abstract BibTex PDF

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

@inproceedings{ormazabal2021beyond
    title = {Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring},
    author = {Ormazabal, Aitor and Artetxe, Mikel and Soroa, Aitor and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    pages = {6479-6489},
    year = {2021},
    month = {08},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2021.acl-long.506},
    url = {https://aclanthology.org/2021.acl-long.506},
}

Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders

Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

EACL 2021

Abstract BibTex PDF

State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua representation, we simultaneously train the N initial languages. Our experiments show that the proposed approach outperforms the universal encoder-decoder by 3.28 BLEU points on average, while allowing to add new languages without the need to retrain the rest of the modules. All in all, our work closes the gap between shared and language-specific encoderdecoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings.

@inproceedings{escolano2021multilingual
    title = {Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders},
    author = {Escolano, Carlos and Costa-jussà, Marta R. and Fonollosa, José A. R. and Artetxe, Mikel},
    booktitle = {Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume},
    pages = {944-948},
    year = {2021},
    month = {04},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2021.eacl-main.80},
    url = {https://aclanthology.org/2021.eacl-main.80},
}

2020

Translation Artifacts in Cross-lingual Transfer Learning

Mikel Artetxe, Gorka Labaka, Eneko Agirre

EMNLP 2020

Abstract BibTex PDF Code Video

Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.

@inproceedings{artetxe2020translation
    title = {Translation Artifacts in Cross-lingual Transfer Learning},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    pages = {7674-7684},
    year = {2020},
    month = {11},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2020.emnlp-main.618},
    url = {https://aclanthology.org/2020.emnlp-main.618},
}

Do all roads lead to Rome? Understanding the role of initialization in iterative back-translation

Mikel Artetxe, Gorka Labaka, Noe Casas, Eneko Agirre

Knowledge-Based Systems

Abstract BibTex

Back-translation provides a simple yet effective approach to exploit monolingual corpora in Neural Machine Translation (NMT). Its iterative variant, where two opposite NMT models are jointly trained by alternately using a synthetic parallel corpus generated by the reverse model, plays a central role in unsupervised machine translation. In order to start producing sound translations and provide a meaningful training signal to each other, existing approaches rely on either a separate machine translation system to warm up the iterative procedure, or some form of pre-training to initialize the weights of the model. In this paper, we analyze the role that such initialization plays in iterative back-translation. Is the behavior of the final system heavily dependent on it? Or does iterative back-translation converge to a similar solution given any reasonable initialization? Through a series of empirical experiments over a diverse set of warmup systems, we show that, although the quality of the initial system does affect final performance, its effect is relatively small, as iterative back-translation has a strong tendency to convergence to a similar solution. As such, the margin of improvement left for the initialization method is narrow, suggesting that future research should focus more on improving the iterative mechanism itself.

@article{artetxe2020roads
    title = {Do all roads lead to Rome? Understanding the role of initialization in iterative back-translation},
    author = {Artetxe, Mikel and Labaka, Gorka and Casas, Noe and Agirre, Eneko},
    journal = {Knowledge-Based Systems},
    volume = {206},
    pages = {106401},
    year = {2020},
    month = {10},
    issn = {0950-7051},
    doi = {10.1016/j.knosys.2020.106401},
    url = {https://www.sciencedirect.com/science/article/pii/S0950705120305335},
}

On the Cross-lingual Transferability of Monolingual Representations

Mikel Artetxe, Sebastian Ruder, Dani Yogatama

ACL 2020

Abstract BibTex PDF Code Video

State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages giving rise to deep multilingual abstractions. We evaluate this hypothesis by designing an alternative approach that transfers a monolingual model to new languages at the lexical level. More concretely, we first train a transformer-based masked language model on one language, and transfer it to a new language by learning a new embedding matrix with the same masked language modeling objective, freezing parameters of all other layers. This approach does not rely on a shared vocabulary or joint training. However, we show that it is competitive with multilingual BERT on standard cross-lingual classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD). Our results contradict common beliefs of the basis of the generalization ability of multilingual models and suggest that deep monolingual models learn some abstractions that generalize across languages. We also release XQuAD as a more comprehensive cross-lingual benchmark, which comprises 240 paragraphs and 1190 question-answer pairs from SQuAD v1.1 translated into ten languages by professional translators.

@inproceedings{artetxe2020crosslingual
    title = {On the Cross-lingual Transferability of Monolingual Representations},
    author = {Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani},
    booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    pages = {4623-4637},
    year = {2020},
    month = {07},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2020.acl-main.421},
    url = {https://aclanthology.org/2020.acl-main.421},
}

A Call for More Rigor in Unsupervised Cross-lingual Learning

Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre

ACL 2020

Abstract BibTex PDF Video

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world’s languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

@inproceedings{artetxe2020call
    title = {A Call for More Rigor in Unsupervised Cross-lingual Learning},
    author = {Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
    pages = {7375-7388},
    year = {2020},
    month = {07},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2020.acl-main.658},
    url = {https://aclanthology.org/2020.acl-main.658},
}

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

ACL Student Research Workshop 2020

Abstract BibTex PDF

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.

@inproceedings{kvapilikova2020unsupervised
    title = {Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining},
    author = {Kvapilíková, Ivana and Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko and Bojar, Ondřej},
    booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop},
    pages = {255-262},
    year = {2020},
    month = {07},
    address = {Online},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2020.acl-srw.34},
    url = {https://aclanthology.org/2020.acl-srw.34},
}

Training Multilingual Machine Translation by Alternately Freezing Language-Specific Encoders-Decoders

Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Mikel Artetxe

arXiv preprint

Abstract BibTex PDF

We propose a modular architecture of language-specific encoder-decoders that constitutes a multilingual machine translation system that can be incrementally extended to new languages without the need for retraining the existing system when adding new languages. Differently from previous works, we simultaneously train $N$ languages in all translation directions by alternately freezing encoder or decoder modules, which indirectly forces the system to train in a common intermediate representation for all languages. Experimental results from multilingual machine translation show that we can successfully train this modular architecture improving on the initial languages while falling slightly behind when adding new languages or doing zero-shot translation. Additional comparison of the quality of sentence representation in the task of natural language inference shows that the alternately freezing training is also beneficial in this direction.

@misc{escolano2020training
    title = {Training Multilingual Machine Translation by Alternately Freezing Language-Specific Encoders-Decoders},
    author = {Escolano, Carlos and Costa-jussà, Marta R. and Fonollosa, José A. R. and Artetxe, Mikel},
    year = {2020},
    month = {05},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2006.01594},
    url = {https://arxiv.org/abs/2006.01594},
}

2019

Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora

Pablo Gamallo, Susana Sotelo, José Ramom Pichel, Mikel Artetxe

Computational Linguistics

Abstract BibTex PDF Code

This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.

@article{gamallo2019contextualized
    title = {Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora},
    author = {Gamallo, Pablo and Sotelo, Susana and Pichel, José Ramom and Artetxe, Mikel},
    journal = {Computational Linguistics},
    volume = {45},
    number = {3},
    pages = {395-421},
    year = {2019},
    month = {09},
    address = {Cambridge, MA},
    publisher = {MIT Press},
    issn = {0891-2017},
    doi = {10.1162/coli_a_00353},
    url = {https://direct.mit.edu/coli/article/45/3/395/93371/Contextualized-Translations-of-Phrasal-Verbs-with},
}

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Mikel Artetxe, Holger Schwenk

TACL

Abstract BibTex PDF Code

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER.

@article{artetxe2019massively
    title = {Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond},
    author = {Artetxe, Mikel and Schwenk, Holger},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {7},
    pages = {597-610},
    year = {2019},
    month = {09},
    address = {Cambridge, MA},
    publisher = {MIT Press},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00288},
    url = {https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00288/43523/Massively-Multilingual-Sentence-Embeddings-for},
}

Unsupervised Neural Machine Translation, a new paradigm solely based on monolingual text

Mikel Artetxe, Gorka Labaka, Eneko Agirre

SEPLN

Abstract BibTex PDF

This article presents UnsupNMT, a 3-year project of which the first year has already been completed. UnsupNMT proposes a radically different approach to machine translation: unsupervised translation, that is, translation based on monolingual data alone with no need for bilingual resources. This method is based on deep learning of temporal sequences and uses cutting-edge interlingual word representations in the form of cross-lingual word embeddings. This project is not only a highly innovative proposal but it also opens a new paradigm in machine translation which branches out to other disciplines, such us transfer learning. Despite the current limitations of unsupervised machine translation, the techniques developed are expected to have great repercussions in areas where machine translation achieves worse results, such as translation between languages which have little contact, e.g. German and Russian.

@article{artetxe2019unsupervised
    title = {Unsupervised Neural Machine Translation, a new paradigm solely based on monolingual text},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {63},
    pages = {151-154},
    year = {2019},
    month = {09},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6107},
}

An Effective Approach to Unsupervised Machine Translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre

ACL 2019

Abstract BibTex PDF Code Video

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, developing a theoretically well founded unsupervised tuning method, and incorporating a joint refinement procedure. Moreover, we use our improved SMT system to initialize a dual NMT model, which is further fine-tuned through on-the-fly back-translation. Together, we obtain large improvements over the previous state-of-the-art in unsupervised machine translation. For instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points more than the previous best unsupervised system, and 0.5 points more than the (supervised) shared task winner back in 2014.

@inproceedings{artetxe2019effective
    title = {An Effective Approach to Unsupervised Machine Translation},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    pages = {194-203},
    year = {2019},
    month = {07},
    address = {Florence, Italy},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P19-1019},
    url = {https://aclanthology.org/P19-1019},
}

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

ACL 2019

Abstract BibTex PDF Video

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.

@inproceedings{ormazabal2019analyzing
    title = {Analyzing the Limitations of Cross-lingual Word Embedding Mappings},
    author = {Ormazabal, Aitor and Artetxe, Mikel and Labaka, Gorka and Soroa, Aitor and Agirre, Eneko},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    pages = {4990-4995},
    year = {2019},
    month = {07},
    address = {Florence, Italy},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P19-1492},
    url = {https://aclanthology.org/P19-1492},
}

Bilingual Lexicon Induction through Unsupervised Machine Translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre

ACL 2019

Abstract BibTex PDF Code Video

A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.

@inproceedings{artetxe2019bilingual
    title = {Bilingual Lexicon Induction through Unsupervised Machine Translation},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    pages = {5002-5007},
    year = {2019},
    month = {07},
    address = {Florence, Italy},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P19-1494},
    url = {https://aclanthology.org/P19-1494},
}

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Mikel Artetxe, Holger Schwenk

ACL 2019

Abstract BibTex PDF Code

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

@inproceedings{artetxe2019marginbased
    title = {Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings},
    author = {Artetxe, Mikel and Schwenk, Holger},
    booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
    pages = {3197-3203},
    year = {2019},
    month = {07},
    address = {Florence, Italy},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P19-1309},
    url = {https://aclanthology.org/P19-1309},
}

2018

Unsupervised Statistical Machine Translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre

EMNLP 2018

Abstract BibTex PDF Code Video

While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses.

@inproceedings{artetxe2018unsupervised
    title = {Unsupervised Statistical Machine Translation},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
    pages = {3632-3642},
    year = {2018},
    month = {10},
    address = {Brussels, Belgium},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/D18-1399},
    url = {https://aclanthology.org/D18-1399},
}

Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

CoNLL 2018

Abstract BibTex PDF Code

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

@inproceedings{artetxe2018uncovering
    title = {Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation},
    author = {Artetxe, Mikel and Labaka, Gorka and Lopez-Gazpio, Iñigo and Agirre, Eneko},
    booktitle = {Proceedings of the 22nd Conference on Computational Natural Language Learning},
    pages = {282-291},
    year = {2018},
    month = {10},
    address = {Brussels, Belgium},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/K18-1028},
    url = {https://aclanthology.org/K18-1028},
}

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Mikel Artetxe, Gorka Labaka, Eneko Agirre

ACL 2018

Abstract BibTex PDF Code Slides Video

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap.

@inproceedings{artetxe2018robust
    title = {A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {789-798},
    year = {2018},
    month = {07},
    address = {Melbourne, Australia},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P18-1073},
    url = {https://aclanthology.org/P18-1073},
}

Unsupervised neural machine translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre, Kyunghyun Cho

ICLR 2018

Abstract BibTex PDF Code

In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely remove the need of parallel data and propose a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora. Our model builds upon the recent work on unsupervised embedding mappings, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingual corpora alone using a combination of denoising and backtranslation. Despite the simplicity of the approach, our system obtains 15.56 and 10.21 BLEU points in WMT 2014 French-to-English and German-to-English translation. The model can also profit from small parallel corpora, and attains 21.81 and 15.24 points when combined with 100,000 parallel sentences, respectively. Our implementation is released as an open source project.

@inproceedings{artetxe2018unsupervised
    title = {Unsupervised neural machine translation},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko and Cho, Kyunghyun},
    booktitle = {Proceedings of the Sixth International Conference on Learning Representations},
    year = {2018},
    month = {04},
    url = {https://openreview.net/forum?id=Sy2ogebAW},
}

Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations

Mikel Artetxe, Gorka Labaka, Eneko Agirre

AAAI 2018

Abstract BibTex PDF Code

Using a dictionary to map independently trained word embeddings to a shared space has shown to be an effective approach to learn bilingual word embeddings. In this work, we propose a multi-step framework of linear transformations that generalizes a substantial body of previous work. The core step of the framework is an orthogonal transformation, and existing methods can be explained in terms of the additional normalization, whitening, re-weighting, de-whitening and dimensionality reduction steps. This allows us to gain new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction. The corresponding software is released as an open source project.

@inproceedings{artetxe2018generalizing
    title = {Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence},
    pages = {5012-5019},
    year = {2018},
    month = {02},
    url = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16935},
}

2017

Learning bilingual word embeddings with (almost) no bilingual data

Mikel Artetxe, Gorka Labaka, Eneko Agirre

ACL 2017

Abstract BibTex PDF Code Slides Video

Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult to obtain for most language pairs. This has motivated an active research line to relax this requirement, with methods that use document-aligned corpora or bilingual dictionaries of a few thousand words instead. In this work, we further reduce the need of bilingual resources using a very simple self-learning approach that can be combined with any dictionary-based mapping technique. Our method exploits the structural similarity of embedding spaces, and works with as little bilingual evidence as a 25 word dictionary or even an automatically generated list of numerals, obtaining results comparable to those of systems that use richer resources.

@inproceedings{artetxe2017learning
    title = {Learning bilingual word embeddings with (almost) no bilingual data},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    pages = {451-462},
    year = {2017},
    month = {07},
    address = {Vancouver, Canada},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/P17-1042},
    url = {https://aclanthology.org/P17-1042},
}

2016

Learning principled bilingual mappings of word embeddings while preserving monolingual invariance

Mikel Artetxe, Gorka Labaka, Eneko Agirre

EMNLP 2016

Abstract BibTex PDF Code Video

Mapping word embeddings of different languages into a single space has multiple applications. In order to map from a source space into a target space, a common approach is to learn a linear mapping that minimizes the distances between equivalences listed in a bilingual dictionary. In this paper, we propose a framework that generalizes previous work, provides an efficient exact method to learn the optimal linear transformation and yields the best bilingual results in translation induction while preserving monolingual performance in an analogy task.

@inproceedings{artetxe2016learning
    title = {Learning principled bilingual mappings of word embeddings while preserving monolingual invariance},
    author = {Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
    booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
    pages = {2289-2294},
    year = {2016},
    month = {11},
    address = {Austin, Texas},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/D16-1250},
    url = {https://aclanthology.org/D16-1250},
}

Adding syntactic structure to bilingual terminology for improved domain adaptation

Mikel Artetxe, Gorka Labaka, Chakaveh Saedi, João Rodrigues, João Silva, António Branco, Eneko Agirre

DMTW 2016

Abstract BibTex PDF

Deep-syntax approaches to machine translation have emerged as an alternative to phrase-based statistical systems. TectoMT is an open source framework for transfer-based MT which works at the deep tectogrammatical level and combines linguistic knowledge and statistical techniques. When adapting to a domain, terminological resources improve results with simple techniques, e.g. force-translating domain-specific expressions. In such approaches, multiword entries are translated as if they were a single token-with-spaces, failing to represent the internal structure which makes TectoMT a powerful translation engine. In this work we enrich source and target multiword terms with syntactic structure, and seamlessly integrate them in the tree-based transfer phase of TectoMT. Our experiments on the IT domain using the Microsoft terminological resource show improvement in Spanish, Basque and Portuguese.

@inproceedings{artetxe2016adding
    title = {Adding syntactic structure to bilingual terminology for improved domain adaptation},
    author = {Artetxe, Mikel and Labaka, Gorka and Saedi, Chakaveh and Rodrigues, João and Silva, João and Branco, António and Agirre, Eneko},
    booktitle = {Proceedings of the 2nd Deep Machine Translation Workshop},
    pages = {39-46},
    year = {2016},
    month = {10},
    address = {Lisbon, Portugal},
    publisher = {ÚFAL MFF UK},
    url = {https://aclanthology.org/W16-6405},
}

2015

EHU at TweetMT: Adapting MT Engines for Formal Tweets

Iñaki Alegria, Mikel Artetxe, Gorka Labaka, Kepa Sarasola

TweetMT 2015

Abstract BibTex PDF

This paper describes the participation of the IXA group from the UPV/EHU (University of the Basque Country) in the TweetMT shared task at the SEPLN-2015 conference. We have adapted existing MT engines for the es-eu and eu-es pairs, obtaining good results (better than other experiments reported in previous work). Three main aspects are described: resource compilation, engine adaptation and results.

@inproceedings{alegria2015ehu
    title = {EHU at TweetMT: Adapting MT Engines for Formal Tweets},
    author = {Alegria, Iñaki and Artetxe, Mikel and Labaka, Gorka and Sarasola, Kepa},
    booktitle = {Proceedings of the Tweet Translation Workshop},
    pages = {20-24},
    year = {2015},
    month = {09},
    address = {Alicante, Spain},
    url = {http://ceur-ws.org/Vol-1445/tweetmt-3-alegria.pdf},
}

Lexical semantics, Basque and Spanish in QTLeap: Quality Translation by Deep Language Engineering Approaches

Eneko Agirre, Iñaki Alegria, Nora Aranberri, Mikel Artetxe, Ander Barrena, António Branco, Arantza Díaz de Ilarraza, Koldo Gojenola, Gorka Labaka, Arantxa Otegi, Kepa Sarasola

SEPLN

Abstract BibTex PDF

The goal of this FP7 European project is to contribute for the advancement of quality machine translation by pursuing an approach that further relies on semantics, deep parsing and linked open data.

@article{agirre2015lexical
    title = {Lexical semantics, Basque and Spanish in QTLeap: Quality Translation by Deep Language Engineering Approaches},
    author = {Agirre, Eneko and Alegria, Iñaki and Aranberri, Nora and Artetxe, Mikel and Barrena, Ander and Branco, António and Díaz de Ilarraza, Arantza and Gojenola, Koldo and Labaka, Gorka and Otegi, Arantxa and Sarasola, Kepa},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {55},
    pages = {169-172},
    year = {2015},
    month = {09},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/5231},
}

Analyzing English-Spanish Named-Entity enhanced Machine Translation

Mikel Artetxe, Eneko Agirre, Iñaki Alegria, Gorka Labaka

SSST 2015

Abstract BibTex PDF

Translation of named-entities (NEs) is an issue in SMT. In this paper we analyze the errors when translating NEs with a SMT system from English to Spanish. We train on Europarl and test on News Commentary, focusing on entities correctly recognized by an automatic NE recognition system. The automatic systems translate around 85% NEs correctly, leaving a small margin for improving performance. In addition, we implement a purpose-build NE translator and integrate it in the SMT system, yielding a small but significant improvement in BLEU score. Our analysis shows that, contrary to similar systems translating from Chinese to English, there was no improvement in NE translation, prompting further work.

@inproceedings{artetxe2015analyzing
    title = {Analyzing English-Spanish Named-Entity enhanced Machine Translation},
    author = {Artetxe, Mikel and Agirre, Eneko and Alegria, Iñaki and Labaka, Gorka},
    booktitle = {Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation},
    pages = {52-54},
    year = {2015},
    month = {06},
    address = {Denver, Colorado, USA},
    publisher = {Association for Computational Linguistics},
    doi = {10.3115/v1/W15-1007},
    url = {https://aclanthology.org/W15-1007},
}

Building hybrid machine translation systems by using an EBMT preprocessor to create partial translations

Mikel Artetxe, Gorka Labaka, Kepa Sarasola

EAMT 2015

Abstract BibTex PDF

This paper presents a hybrid machine translation framework based on a preprocessor that translates fragments of the input text by using example-based machine translation techniques. The preprocessor resembles a translation memory with named-entity and chunk generalization, and generates a high quality partial translation that is then completed by the main translation engine, which can be either rule-based (RBMT) or statistical (SMT). Results are reported for both RBMT and SMT hybridization as well as the preprocessor on its own, showing the effectiveness of our approach.

@inproceedings{artetxe2015building
    title = {Building hybrid machine translation systems by using an EBMT preprocessor to create partial translations},
    author = {Artetxe, Mikel and Labaka, Gorka and Sarasola, Kepa},
    booktitle = {Proceedings of the 18th Annual Conference of the European Association for Machine Translation},
    pages = {11-18},
    year = {2015},
    month = {05},
    address = {Antalya, Turkey},
    url = {https://aclanthology.org/W15-4902},
}

Mikel Artetxe

Experience

Students

Current

Alumni

Prospective

Publications

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015