Leila Rabiei

and 3 more

not-yet-known not-yet-known not-yet-known unknown Introduction: Part-of-Speech (POS) tagging, the process of classifying words into their respective parts of speech (e.g., verbs, nouns), is a fundamental task in various Natural Language Processing (NLP) applications. It serves as a crucial preprocessing step for tasks such as machine translation, question answering, and sentiment analysis. However, existing Persian POS corpora predominantly consist of formal texts, such as news articles and academic publications. Consequently, POS tagging tools and models developed on these corpora often struggle to accurately process colloquial language found in social media and informal contexts. Method: This paper introduces the ”Colloquial Persian POS” (CPPOS) corpus, specifically designed to facilitate POS tagging for colloquial Persian text. The corpus encompasses both formal and informal texts collected from diverse domains, including political, social, and commercial content on platforms like Telegram, Twitter, and Instagram, comprising over 520,000 labeled tokens. Following a year-long collection period, we implemented rigorous preprocessing steps, including normalization and both sentence and word tokenization tailored for social media text. The tokens and sentences underwent meticulous manual annotation and verification by a team of linguistic experts. Additionally, this study establishes comprehensive guidelines for POS tagging annotation. Results: To assess the quality of the CPPOS corpus, we trained various deep learning models, including those from the RNN family. A comparative analysis with the well-established ”Bijankhan” corpus and the Hazm POS tool—both trained on Bijankhan—demonstrated that our model utilizing the CPPOS corpus significantly outperforms these benchmarks. Specifically, with the integration of the new corpus and a BiLSTM deep neural model, we achieved a notable 14% improvement over previous datasets.