EusCrawl

A high-quality corpus for Basque

Kalitate handiko euskal corpus librea

Corpus - Corpusa

EusCrawl is a high-quality corpus for Basque comprising 12.5 million documents and 423 million tokens, totalling 2.1 GiB of uncompressed text. EusCrawl was built using ad-hoc scrapers to extract text from 33 Basque websites with high-quality content, resulting in cleaner text compared to general purpose approaches. We only use content with a Creative Commons license. The corpus is available in two formats: JSONL and TXT.


Euscrawl kalitate handiko euskal corpusa osatzen duten dokumentuak modu librean bana daitezke, Creative Commons lizentzia libreekin. 12.5 milioi dokumentu eta 423 milioi hitzez osatuta dago, eta eskuz aukeratutako Interneteko hainbat webgunetatik dokumentuak xurgatuz (crawl ingelesez) osatu da. Corpusa bi formatu ezberdinetan dago eskuragarri: JSONL eta TXT.

JSONL format/formatua

Structured data format, including metadata information for each document (license, date of publication, URL...)

Datu formatu egituratua, dokumentu bakoitzeko metadatuak (lizentzia, argitalpen data, URL...) barne.

Download - Jaitsi

TEXT format/formatua

Plain text corpus extracted from the JSONL version

JSONL bertsiotik ateratako testu corpusa

Download - Jaitsi

Models - Ereduak

We release RoBERTa models trained with fairseq under the CC-BY-NC license.

Fairseq-ekin entrenatutako RoBERTa ereduak banatzen ditugu, CC-BY-NC lizentziarekin.

Publication - Argitalpena

You can find more details about EusCrawl, including an empirical comparison with CommonCrawl-based corpora, in our academic publication.

EusCrawli buruzko datu gehiago gure argitalpen akademikoan aurkituko dituzu.

Citation and contact

Citation - Erreferentzia

If you use our corpus or models for academic research, please cite the following paper:

Ikerketarako gure corpus edo ereduak erabiltzen badituzu, mesedez ondoko artikuluari egin erreferentzia:

@misc{artetxe2022euscrawl,

 title={Does corpus quality really matter for low-resource languages?},

 author={Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri,

     Olatz Perez-de-Viņaspre, Aitor Soroa},

 year={2022},

 eprint={2203.08111},

 archivePrefix={arXiv},

 primaryClass={cs.CL}

 }

Contact - Kontaktua

For questions please contact Aitor Soroa (UPV/EHU) via email at a.soroa[at]ehu.eus

Edozein zalantza izanez gero, idatzi mezu bat Aitor Soroari (UPV/EHU): a.soroa[at]ehu.eus