CommonMorph's Datasets
Morphological Datasets
Elicitation Template Prompts
Morphological Datasets
We are gathering an open-source, multilingual dataset of morphological information, freely available for training sub-word models. These datasets also play a vital role in documenting and preserving languages. The datasets are available in the Unimorph Format.
eng English (UK)5091 inflected forms
1272 lemmas
hac Hawrami (Tawele)75951 inflected forms
1020 lemmas
tur Turkish (Standard)3216 inflected forms
115 lemmas
spa Spanish (Spain)3057 inflected forms
94 lemmas
fas Farsi (Tehran)3734 inflected forms
86 lemmas
ara Arabic (MSA)3276 inflected forms
84 lemmas
ckb Central Kurdish (Sine)3517 inflected forms
80 lemmas
ckb Central Kurdish (Standard)4210 inflected forms
79 lemmas
fas Farsi (Standard)3426 inflected forms
78 lemmas
sdh Southern Kurdish (Pahle)3528 inflected forms
72 lemmas
ckb Central Kurdish (Mehabad)386 inflected forms
69 lemmas
lki Laki Kurdish (Kakawand)2584 inflected forms
68 lemmas
deu German (Standard)56 inflected forms
60 lemmas
glk Gilaki (Rasht)77 inflected forms
54 lemmas
kmr Northern Kurdish499 inflected forms
31 lemmas
sdh Southern Kurdish (AliSherwan)469 inflected forms
28 lemmas
arz Arabic (Egyptian)4 inflected forms
16 lemmas
hac Hawrami (Jawero)19 inflected forms
16 lemmas
rus Russian13 inflected forms
6 lemmas
kat Georgian14 inflected forms
5 lemmas
sdh Southern Kurdish (Chardawel)4 lemmas
lat Latin16 inflected forms
2 lemmas
fra French (Standard)6 inflected forms
2 lemmas
diq Southern Zazaki1 lemmas
mon Mongolian (Khalkha)6 inflected forms
1 lemmas
swc Swahili (Congo)1 inflected forms
1 lemmas