Tools & Resources

NLP Tools

Rashel et al. (2014)

An Indonesian POS Tagger with tagset of 23 labels. Github | paper

Aksara

Tokenizer, Lemmatizer, POS Tagger, and Morphological Features Analizer that conforms to UD v2 annotation guidelines. Github | paper

Dataset

Hate Speech or Abusive Language Dataset

  • Ibrohim and Budi (2019). Dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter. This dataset consists of 13,169 tweets. Github | paper
  • Ibrohim and Budi (2018). Dataset for abusive language detection in the Indonesian Twitter. This dataset consists of 2016 tweets. Github | paper
  • Alfina et al. (2017). Dataset for Indonesian hate speech detection with two labels: HS (hate speech) and Non_HS (non-hate-speech). This dataset consists of 713 tweets. Github | paper

Morphology, POS Tagging, or Syntactic Parsing Dataset

  • UD-Indonesian-CSUI - Alfina et al. (2020). A dependency treebank that conforms to UD v2 and was converted automatically from the Kethu treebank. This treebank consists of 1030 sentences. Github | paper
  • UD-Indonesian-PUD - Alfina et al. (2019, 2020). A gold standard dependency treebank that conforms to UD v2. It was a revised version of the original UD-Indonesian-PUD treebank. It consists of 1,000 sentences. Github | paper
  • Kethu - Arwidarasti et al. (2019, 2020). An Indonesian constituency treebank that conforms to the Penn Treebank format. It was converted automatically with manual correction from Dinakaramani et al. (2015) treebank. This treebank consists of 1030 sentences. Github | paper
  • Dinakaramani et al. (2015). Manually tagged constituency treebank. This treebank consists of 1030 sentences. Github | slide
  • Dinakaramani et al. (2014). A POS tagging dataset of 250K sentences and 23 tags. Github | paper

Named Entity Recognition (NER) Dataset

  • Singgalang dataset, Alfina et al. (2017). An automatically generated NER dataset of 48K sentences. Github | paper
  • Gultom and Wibowo (2017). A NER dataset of 2,125 sentences. Github | paper

Semantic Analysis Dataset

  • Jannati et al (2018). Dataset for stance classification towards political figures on blog writing (stance detection dataset) Github | paper

Sentiment Analysis or Emotion Detection Dataset

  • Saputri et al. (2018). This dataset contains 4,403 Indonesian tweets which are labeled into five emotion classes: love, anger, sadness, joy and fear. Github | paper
  • Alfina et al. (2017). Indonesian Twitter dataset for sentiment analysis in political domain. It consists of 4600 tweets. Github | paper