IR-NLP Lab - CSUI

Tools & Resources

Rashel et al. (2014)

An Indonesian POS Tagger with tagset of 23 labels. Github | paper

Aksara

Tokenizer, Lemmatizer, POS Tagger, and Morphological Features Analizer that conforms to UD v2 annotation guidelines. Github | paper

Ibrohim and Budi (2019). Dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter. This dataset consists of 13,169 tweets. Github | paper
Ibrohim and Budi (2018). Dataset for abusive language detection in the Indonesian Twitter. This dataset consists of 2016 tweets. Github | paper
Alfina et al. (2017). Dataset for Indonesian hate speech detection with two labels: HS (hate speech) and Non_HS (non-hate-speech). This dataset consists of 713 tweets. Github | paper

UD-Indonesian-CSUI - Alfina et al. (2020). A dependency treebank that conforms to UD v2 and was converted automatically from the Kethu treebank. This treebank consists of 1030 sentences. Github | paper
UD-Indonesian-PUD - Alfina et al. (2019, 2020). A gold standard dependency treebank that conforms to UD v2. It was a revised version of the original UD-Indonesian-PUD treebank. It consists of 1,000 sentences. Github | paper
Kethu - Arwidarasti et al. (2019, 2020). An Indonesian constituency treebank that conforms to the Penn Treebank format. It was converted automatically with manual correction from Dinakaramani et al. (2015) treebank. This treebank consists of 1030 sentences. Github | paper
Dinakaramani et al. (2015). Manually tagged constituency treebank. This treebank consists of 1030 sentences. Github | slide
Dinakaramani et al. (2014). A POS tagging dataset of 250K sentences and 23 tags. Github | paper

Singgalang dataset, Alfina et al. (2017). An automatically generated NER dataset of 48K sentences. Github | paper
Gultom and Wibowo (2017). A NER dataset of 2,125 sentences. Github | paper

Jannati et al (2018). Dataset for stance classification towards political figures on blog writing (stance detection dataset) Github | paper

Saputri et al. (2018). This dataset contains 4,403 Indonesian tweets which are labeled into five emotion classes: love, anger, sadness, joy and fear. Github | paper
Alfina et al. (2017). Indonesian Twitter dataset for sentiment analysis in political domain. It consists of 4600 tweets. Github | paper