Sentiment Analysis on a Large Indonesian Product Review Dataset
Downloads
Background: The publicly available large dataset plays an important role in the development of the natural language processing/computational linguistic research field. However, up to now, there are only a few large Indonesian language datasets accessible for research purposes, including sentiment analysis datasets, where sentiment analysis is considered the most popular task.
Objective: The objective of this work is to present sentiment analysis on a large Indonesian product review dataset, employing various features and methods. Two tasks have been implemented: classifying reviews into three classes (positive, negative, neutral), and predicting ratings.
Methods: Sentiment analysis was conducted on the FDReview dataset, comprising over 700,000 reviews. The analysis treated sentiment as a classification problem, employing the following methods: Multinomial Naí¯ve Bayes (MNB), Support Vector Machine (SVM), LSTM, and BiLSTM.
Result: The experimental results indicate that in the comparison of performance using conventional methods, MNB outperformed SVM in rating prediction, whereas SVM exhibited better performance in the review classification task. Additionally, the results demonstrate that the BiLSTM method outperformed all other methods in both tasks. Furthermore, this study includes experiments conducted on balanced and unbalanced small-sized sample datasets.
Conclusion: Analysis of the experimental results revealed that the deep learning-based method performed better only in the large dataset setting. Results from the small balanced dataset indicate that conventional machine learning methods exhibit competitive performance compared to deep learning approaches.
Keywords: Indonesian review dataset, Large dataset, Rating prediction, Sentiment analysis
B. Pang, L. Lee, and others, "Opinion mining and sentiment analysis,” Foundations and Trends® in information retrieval, vol. 2, no. 1–2, pp. 1–135, 2008.
B. Liu, Sentiment analysis and opinion mining. Springer Nature, 2022.
Z. Madhoushi, A. R. Hamdan, and S. Zainudin, "Sentiment analysis techniques in recent works,” in 2015 science and information conference (SAI), 2015, pp. 288–291.
L. Zhang, S. Wang, and B. Liu, "Deep learning for sentiment analysis: A survey,” Wiley Interdiscip Rev Data Min Knowl Discov, vol. 8, no. 4, p. e1253, 2018.
A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, "Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
F. M. Harper and J. A. Konstan, "The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, pp. 1–19, 2015.
J. McAuley and J. Leskovec, "Hidden factors and hidden topics: understanding rating dimensions with review text,” in Proceedings of the 7th ACM conference on Recommender systems, 2013, pp. 165–172.
V. Nurcahyawati and Z. Mustaffa, "Vader Lexicon and Support Vector Machine Algorithm to Detect Customer Sentiment Orientation.,” Journal of Information Systems Engineering & Business Intelligence, vol. 9, no. 1, 2023.
M. Aly and A. Atiya, "Labr: A large scale arabic book reviews dataset,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2013, pp. 494–498.
D. Ekawati and M. L. Khodra, "Aspect-based sentiment analysis for Indonesian restaurant reviews,” in 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (ICAICTA), 2017, pp. 1–6.
R. A. Laksono, K. R. Sungkono, R. Sarno, and C. S. Wahyuni, "Sentiment analysis of restaurant customer reviews on tripadvisor using na"ive bayes,” in 2019 12th international conference on information & communication technology and system (ICTS), 2019, pp. 49–54.
R. Manurung and others, "Machine learning-based sentiment analysis of automatic indonesian translations of english movie reviews,” in Proceedings of the International Conference on Advanced Computational Intelligence and Its Applications (ICACIA), 2008, pp. 1–6.
Y. Nurdiansyah, S. Bukhori, and R. Hidayat, "Sentiment analysis system for movie review in Bahasa Indonesia using naive bayes classifier method,” in Journal of Physics: Conference Series, 2018, p. 12011.
M. A. Fauzi, T. Afirianto, and others, "Improving sentiment analysis of short informal Indonesian product reviews using synonym based feature expansion,” Telkomnika (telecommunication computing electronics and control), vol. 16, no. 3, pp. 1345–1350, 2018.
A. R. Prananda and I. Thalib, "Sentiment analysis for customer review: Case study of GO-JEK expansion,” Journal of Information Systems Engineering and Business Intelligence, vol. 6, no. 1, p. 1, 2020.
T. Sutabri, A. Suryatno, D. Setiadi, and E. S. Negara, "Improving na"ive bayes in sentiment analysis for hotel industry in Indonesia,” in 2018 Third International Conference on Informatics and Computing (ICIC), 2018, pp. 1–6.
P. F. Muhammad, R. Kusumaningrum, and A. Wibowo, "Sentiment analysis using Word2vec and long short-term memory (LSTM) for Indonesian hotel reviews,” Procedia Comput Sci, vol. 179, pp. 728–735, 2021.
C. C. P. Hapsari, W. Astuti, and M. D. Purbolaksono, "Naive Bayes Classifier and Word2Vec for Sentiment Analysis on Bahasa Indonesia Cosmetic Product Reviews,” in 2021 International Conference on Data Science and Its Applications (ICoDSA), 2021, pp. 22–27.
M. R. Danendra and Y. Sibaroni, "Sentiment Analysis on Beauty Product Reviews using LSTM Method,” in 2021 9th International Conference on Information and Communication Technology (ICoICT), 2021, pp. 616–620.
D. C. Oktaviana, B. Harjito, and S. W. Sihwi, "Rate Prediction of Cosmetic Product Based on Test Review from Website Female Daily Using Naive Bayes Classifier,” ITSMART: Jurnal Teknologi dan Informasi, vol. 8, no. 1, pp. 19–25.
T. Hasrudin and U. Sagena, "The Role of Indonesian Government Policy in Shaping the Competitive Landscape of the Southeast Asian Beauty Industry,” Research Horizon, vol. 3, no. 4, pp. 433–444, 2023.
J. KapoÄiÅ«tÄ—-DzikienÄ—, R. DamaÅ¡eviÄius, and M. Woźniak, "Sentiment analysis of lithuanian texts using traditional and deep learning approaches,” Computers, vol. 8, no. 1, p. 4, 2019.
P. Poomka, N. Kerdprasop, and K. Kerdprasop, "Machine learning versus deep learning performances on the sentiment analysis of product reviews,” Int J Mach Learn Comput, vol. 11, no. 2, pp. 103–109, 2021.
M. M. Danyal, S. S. Khan, M. Khan, M. B. Ghaffar, B. Khan, and M. Arshad, "Sentiment Analysis Based on Performance of Linear Support Vector Machine and Multinomial Na"ive Bayes Using Movie Reviews with Baseline Techniques.,” Journal on Big Data, vol. 5, 2023.
P. Anastasiou, K. Tzafilkou, D. Karapiperis, and C. Tjortjis, "YouTube Sentiment Analysis on Healthcare Product Campaigns: Combining Lexicons and Machine Learning Models,” in 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), 2023, pp. 1–8. doi: 10.1109/IISA59645.2023.10345900.
A. Varshney, Y. Kapoor, A. Thukral, R. Sharma, and P. Bedi, "Performing Sentiment Analysis on Twitter Data Using Deep Learning Models: A Comparative Study,” in Advances in Data and Information Sciences: Proceedings of ICDIS 2021, Springer, 2022, pp. 371–381.
G. Xu, Y. Meng, X. Qiu, Z. Yu, and X. Wu, "Sentiment analysis of comment texts based on BiLSTM,” Ieee Access, vol. 7, pp. 51522–51532, 2019.
S. Cahyawijaya et al., "NusaCrowd: Open Source Initiative for Indonesian NLP Resources,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 13745–13818.
G. Winata et al., "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 815–834.
A. N. Azhar, M. L. Khodra, and A. P. Sutiono, "Multi-label aspect categorization with convolutional neural networks and extreme gradient boosting,” in 2019 International Conference on Electrical Engineering and Informatics (ICEEI), 2019, pp. 35–40.
A. Ilmania, S. Cahyawijaya, A. Purwarianti, and others, "Aspect detection and sentiment classification using deep neural network for Indonesian aspect-based sentiment analysis,” in 2018 International Conference on Asian Language Processing (IALP), 2018, pp. 62–67.
C. Tho, Y. Heryadi, L. Lukas, and A. Wibowo, "Code-mixed sentiment analysis of Indonesian language and Javanese language using Lexicon based approach,” in Journal of Physics: Conference Series, 2021, p. 12084.
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, "IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP,” arXiv preprint arXiv:2011.00677, 2020.
A. Purwarianti and I. A. P. A. Crisdayanti, "Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector,” in 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 2019, pp. 1–5.
B. Samal, A. K. Behera, and M. Panda, "Performance analysis of supervised machine learning techniques for sentiment analysis,” in 2017 Third International Conference on Sensing, Signal Processing and Security (ICSSS), 2017, pp. 128–133.
Z. Xiao, L. Wang, and J. Y. Du, "Improving the performance of sentiment classification on imbalanced datasets with transfer learning,” IEEE Access, vol. 7, pp. 28281–28290, 2019.
S. Aldera, A. Emam, M. Al-Qurishi, M. Alrubaian, and A. Alothaim, "Exploratory data analysis and classification of a new Arabic online extremism dataset,” IEEE Access, vol. 9, pp. 161613–161626, 2021.
A. R. Naradhipa and A. Purwarianti, "Sentiment classification for Indonesian message in social media,” in 2012 International Conference on Cloud Computing and Social Networking (ICCCSN), 2012, pp. 1–5.
M. Al-Ayyoub, S. B. Essa, and I. Alsmadi, "Lexicon-based sentiment analysis of Arabic tweets,” International Journal of Social Network Mining, vol. 2, no. 2, pp. 101–114, 2015.
T. Widiyaningtyas, I. A. E. Zaeni, and R. Al Farisi, "Sentiment Analysis Of Hotel Review Using N-Gram And Naive Bayes Methods,” in 2019 Fourth International Conference on Informatics and Computing (ICIC), 2019, pp. 1–5.
T. Hasan and A. Matin, "Extract Sentiment from Customer Reviews: A Better Approach of TF-IDF and BOW-Based Text Classification Using N-Gram Technique,” in Proceedings of International Joint Conference on Advances in Computational Intelligence: IJCACI 2020, 2021, pp. 231–244.
M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, A. Mehmood, and M. T. Sadiq, "Document-level text classification using single-layer multisize filters convolutional neural network,” IEEE Access, vol. 8, pp. 42689–42707, 2020.
A. Althnian et al., "Impact of dataset size on classification performance: an empirical evaluation in the medical domain,” Applied Sciences, vol. 11, no. 2, p. 796, 2021.
C. Padurariu and M. E. Breaban, "Dealing with data imbalance in text classification,” Procedia Comput Sci, vol. 159, pp. 736–745, 2019.
C. N. Kamath, S. S. Bukhari, and A. Dengel, "Comparative study between traditional machine learning and deep learning approaches for text classification,” in Proceedings of the ACM Symposium on Document Engineering 2018, 2018, pp. 1–11.
M. Hossin and M. N. Sulaiman, "A review on evaluation metrics for data classification evaluations,” International journal of data mining & knowledge management process, vol. 5, no. 2, p. 1, 2015.
D. Nguyen, "Comparing automatic and human evaluation of local explanations for text classification,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1069–1078.
M. Lui and T. Baldwin, "langid. py: An off-the-shelf language identification tool,” in Proceedings of the ACL 2012 system demonstrations, 2012, pp. 25–30.
Copyright (c) 2024 The Authors. Published by Universitas Airlangga.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
All accepted papers will be published under a Creative Commons Attribution 4.0 International (CC BY 4.0) License. Authors retain copyright and grant the journal right of first publication. CC-BY Licenced means lets others to Share (copy and redistribute the material in any medium or format) and Adapt (remix, transform, and build upon the material for any purpose, even commercially).