Judit Ács
BME, SZTAKI, ELTE
MILAB-NLU Seminar
February 25, 2021
He is running a marathon.
type of probing data: word, sentence, sentence pair
probing location: token, token span (edge probing), sentence
probed information: morphology, POS, syntax, semantics
contextual | pretrained | tokenization | |
---|---|---|---|
M-BERT | yes | yes | wordpiece |
SLSTM | yes | no | character |
fastText | no | yes | n-gram |
WLSTM | no | no | character |
Czech | 23 | Croatian | 10 | Norwegian_Bokmal | 7 | Lithuanian | 4 |
Russian | 20 | Romanian | 10 | Norwegian_Nynorsk | 7 | Urdu | 4 |
Polish | 19 | Ukrainian | 9 | Turkish | 7 | Portuguese | 4 |
Finnish | 16 | Slovenian | 8 | Arabic | 6 | Basque | 3 |
Latvian | 16 | French | 8 | Serbian | 6 | Hungarian | 3 |
Latin | 13 | Swedish | 8 | Hindi | 6 | English | 3 |
Slovak | 12 | Catalan | 8 | Hebrew | 5 | Armenian | 3 |
German | 11 | Italian | 8 | Greek | 5 | Persian | 1 |
Bulgarian | 11 | Spanish | 7 | Danish | 5 | Afrikaans | 1 |
Estonian | 11 | Albanian | 7 | Dutch | 4 |
$ \text{BOW} \approx 1.22 \cdot B_2 + 1.42 $
World Atlas of Languages
“a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.”linguistic typology with 100+ fields for each language, many missing values
Languages | Training data | Details | |
---|---|---|---|
huBERT | Hungarian | Webcorpus 2.0 | BERT-base |
HILBERT | Hungarian | MNSZ, JSI, NOL, OS, KM | BERT-large |
mBERT | 100+ | Wikipedia | BERT-base |
XLM-RoBERTa | 100 | CommonCrawl | BERT-base |
XLM-MLM-100 | 100 | Larger version of XLM-RoBERTa | |
distil-mBERT | 100+ | Distilled version of mBERT |
Morph tag | POS | # of values | Values |
---|---|---|---|
Case | NOUN | 18 | Abl, Acc, ..., Ter, Tra |
Degree | ADJ | 3 | Cmp, Pos, Sup |
Mood | VERB | 4 | Cnd, Imp, Ind, Pot |
Number[psor] | NOUN | 2 | Sing, Plur |
Number | ADJ | 2 | Sing, Plur |
Number | NOUN | 2 | Sing, Plur |
Number | VERB | 2 | Sing, Plur |
Person[psor] | NOUN | 3 | 1, 2, 3 |
Person | VERB | 3 | 1, 2, 3 |
Tense | VERB | 2 | Pres, Past |
VerbForm | VERB | 2 | Inf, Fin |
POS tagging
NER tagging
huBERT | HILBERT | mBERT | XLM-RoBERTa | MLM-100 | |
---|---|---|---|---|---|
Vocab size | 32k | 64k | 120k | 250k | 200k |
Word len in WP | 2.77 | 2.96 | 3.95 | 3.17 | 3.46 |
Same as emtsv | 16% | 6.7% | 5% | 14% | 8% |
Last WP same | 43% | 31% | 41% | 47% | 39% |