The results indicate that BERTopic can produce informative topics even when applied to partially preprocessed short text, given the right parameter setting. The performance drop on partially preprocessed text is minimal when the same parameters are used in both preprocessing scenarios. BERTopic outperforms LDA and NMF by offering more informative topics and providing new insights when the number of topics is not limited. The study's findings could be significant for researchers working with other morphologically rich low-resource languages and short text.
Key takeaways:
- The paper presents the first application of BERTopic, a topic modeling technique, to short text in a morphologically rich language.
- BERTopic was applied with three multilingual embedding models on two levels of text preprocessing to evaluate its performance on partially preprocessed short text in Serbian.
- The results show that BERTopic can yield informative topics even when applied to partially preprocessed short text, with minimal performance drop compared to fully preprocessed text.
- Compared to LDA and NMF, BERTopic offers more informative topics and provides novel insights when the number of topics is not limited.