Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

The paper discusses the application of BERTopic, a cutting-edge topic modeling technique, to short text written in a morphologically rich language, specifically Serbian. The researchers used BERTopic with three multilingual embedding models on two levels of text preprocessing to evaluate its performance on partially preprocessed short text. The study also compared BERTopic's performance with LDA and NMF on fully preprocessed text, using a dataset of tweets expressing hesitancy toward COVID-19 vaccination.

The results indicate that BERTopic can produce informative topics even when applied to partially preprocessed short text, given the right parameter setting. The performance drop on partially preprocessed text is minimal when the same parameters are used in both preprocessing scenarios. BERTopic outperforms LDA and NMF by offering more informative topics and providing new insights when the number of topics is not limited. The study's findings could be significant for researchers working with other morphologically rich low-resource languages and short text.

Key takeaways:

The paper presents the first application of BERTopic, a topic modeling technique, to short text in a morphologically rich language.
BERTopic was applied with three multilingual embedding models on two levels of text preprocessing to evaluate its performance on partially preprocessed short text in Serbian.
The results show that BERTopic can yield informative topics even when applied to partially preprocessed short text, with minimal performance drop compared to fully preprocessed text.
Compared to LDA and NMF, BERTopic offers more informative topics and provides novel insights when the number of topics is not limited.

Multilingual transformer and BERTopic for short text topic modeling: The case of Serbian

Key takeaways:

Comments (0)

Newsletter