EusCrawl is a high-quality corpus for Basque comprising 12.5 million documents and 423 million tokens, totalling 2.1 GiB of uncompressed text. EusCrawl was built using ad-hoc scrapers to extract text from 33 Basque websites with high-quality content, resulting in cleaner text compared to general purpose approaches. We only use content with a Creative Commons license. The corpus is available in two formats: JSONL and TXT.
Euscrawl kalitate handiko euskal corpusa osatzen duten dokumentuak modu librean bana daitezke, Creative Commons lizentzia libreekin. 12.5 milioi dokumentu eta 423 milioi hitzez osatuta dago, eta eskuz aukeratutako Interneteko hainbat webgunetatik dokumentuak xurgatuz (crawl ingelesez) osatu da. Corpusa bi formatu ezberdinetan dago eskuragarri: JSONL eta TXT.
Structured data format, including metadata information for each document (license, date of publication, URL...)
Datu formatu egituratua, dokumentu bakoitzeko metadatuak (lizentzia, argitalpen data, URL...) barne.
Download - JaitsiPlain text corpus extracted from the JSONL version
JSONL bertsiotik ateratako testu corpusa
Download - JaitsiWe release RoBERTa models trained with fairseq under the CC-BY-NC license.
Fairseq-ekin entrenatutako RoBERTa ereduak banatzen ditugu, CC-BY-NC lizentziarekin.
You can find more details about EusCrawl, including an empirical comparison with CommonCrawl-based corpora, in our academic publication.
EusCrawli buruzko datu gehiago gure argitalpen akademikoan aurkituko dituzu.
If you use our corpus or models for academic research, please cite the following paper:
Ikerketarako gure corpus edo ereduak erabiltzen badituzu, mesedez ondoko artikuluari egin erreferentzia:
@misc{artetxe2022euscrawl,
title={Does corpus quality really matter for low-resource languages?},
author={Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri,
Olatz Perez-de-Viņaspre, Aitor Soroa},
year={2022},
eprint={2203.08111},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
For questions please contact Aitor Soroa (UPV/EHU) via email at a.soroa[at]ehu.eus
Edozein zalantza izanez gero, idatzi mezu bat Aitor Soroari (UPV/EHU): a.soroa[at]ehu.eus