Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Budhi, Gregorius Satia; Chiong, Raymond; Wang, Zuli

doi:10.1007/s11042-020-10299-5

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Published: 13 January 2021

Volume 80, pages 13079–13097, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Gregorius Satia Budhi^1,2,
Raymond Chiong¹ &
Zuli Wang³

1740 Accesses
Explore all metrics

Abstract

Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers; whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

Article 04 August 2017

A class imbalance-aware review rating prediction using hybrid sampling and ensemble learning

Article 23 October 2020

Detection of Fake Reviews in Yelp Dataset Using Machine Learning and Chain Classifier Approach

References

Akram AU, Khan HU, Iqbal S, Iqbal T, Munir EU, Shafi M (2018) Finding rotten eggs: a review spam detection model using diverse feature sets. KSII Trans Internet Inform Syst 12(10):5120–5142. https://doi.org/10.3837/tiis.2018.10.026
Bajaj S, Garg N, Singh SK (2017) A novel user-based spam review detection. Procedia Comput Sci 122:1009–1015
Article Google Scholar
Barbado R, Araque O, Iglesias CA (2019) A framework for fake review detection in online consumer electronics retailers. Inf Process Manag 56(4):1234–1244. https://doi.org/10.1016/j.ipm.2019.03.002
Article Google Scholar
Birchall G (2018) TripAdvisor denies claims one in three reviews ‘faked’. https://www.news.com.au/technology/online/social/tripadvisor-denies-claims-one-in-three-reviews-faked/news-story/55243de188cc7f1fb2abb52fee3bac45. Accessed October 03 2019
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/bf00058655
Article MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324
Article MATH Google Scholar
Budhi GS, Adipranata R (2014) Java characters recognition using evolutionary neural network and combination of Chi2 and backpropagation neural network. Int J Appl Eng Res 9(22):18025–18036
Google Scholar
Budhi GS, Chiong R, Pranata I, Hu Z (2017) Predicting rating polarity through automatic classification of review texts. In: Proceedings of the 2017 IEEE Conference on Big Data and Analytics, Kuching, Malaysia, pp 19–24. https://doi.org/10.1109/ICBDAA.2017.8284101
Budhi GS, Chiong R, Hu Z, Pranata I, Dhakal S (2018) Multi-PSO based classifier selection and parameter optimisation for sentiment polarity prediction. Proceedings of the 2018 IEEE Conference on Big Data and Analytics, Langkawi Island, Malaysia, pp 68–73. https://doi.org/10.1109/ICBDAA.2018.8629593
Budhi GS, Chiong R, Pranata I, Hu Z (2020) Using machine learning to predict the sentiment of online reviews: a new framework for comparative analysis. Arch Computation Methods Eng. https://doi.org/10.1007/s11831-020-09464-8
Campbell C, Ying Y (2011) Learning with support vector machines. Morgan & Claypool
Cardoso EF, Silva RM, Almeida TA (2018) Towards automatic filtering of fake reviews. Neurocomputing 309:106–116. https://doi.org/10.1016/j.neucom.2018.04.074
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27. https://doi.org/10.1145/1961189.1961199
Article Google Scholar
Darzi MRK, Niaki STA, Khedmati M (2019) Binary classification of imbalanced datasets: the case of CoIL challenge 2000. Expert Syst Appl 128:169–186. https://doi.org/10.1016/j.eswa.2019.03.024
Article Google Scholar
Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: a review. Multimed Tools Appl 78(3):3797–3816. https://doi.org/10.1007/s11042-018-6083-5
Article Google Scholar
Dobson AJ, Barnett AG (2008) An introduction to generalized linear models, 3rd edn. CRC Press, Boca Raton
Book Google Scholar
D'Onfro J (2013) A whopping 20% of Yelp reviews are fake. https://www.businessinsider.com.au/20-percent-of-yelp-reviews-fake-2013-9). Accessed Oktober 02 2019
Dunteman GH, Ho M-HR (2011) Generalized Linear Models. In: An introduction to generalized linear models. SAGE Publications, Inc., pp 2–6
Ellson A (2018) A third of TripAdvisor reviews are fake as cheats buy five stars. The Times. https://www.thetimes.co.uk/article/hotel-and-caf-cheats-are-caught-trying-to-buy-tripadvisor-stars-027fbcwc8. Accessed Oktober 02 2019
Etaiwi W, Naymat G (2017) The impact of applying different preprocessing steps on review spam detection. Procedia Comput Sci 113:273–279. https://doi.org/10.1016/j.procs.2017.08.368
Felbermayr A, Nanopoulos A (2016) The role of emotions for the perceived usefulness in online customer reviews. J Interact Mark 36:60–76. https://doi.org/10.1016/j.intmar.2016.05.004
Fernandez A, Garcıa S, Chawla FHNV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Freeman LL (2016) How to spot fake online reviews. Money 45(6):30–30
Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, pp 249–256
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman and Hall/CRC,
Hazim M, Anuar NB, Ab Razak MF, Abdullah NA (2018) Detecting opinion spams through supervised boosting approach. PLoS One 13(6):e0198884. https://doi.org/10.1371/journal.pone.0198884
Article Google Scholar
Hernández Fusilier D, Montes-y-Gómez M, Rosso P, Guzmán Cabrera R (2015) Detecting positive and negative deceptive opinions using PU-learning. Inf Process Manag 51(4):433–443. https://doi.org/10.1016/j.ipm.2014.11.001
Article Google Scholar
Heydari A, Ma T, Salim N, Heydari Z (2015) Detection of review spam: a survey. Expert Syst Appl 42(7):3634–3642. https://doi.org/10.1016/j.eswa.2014.12.029
Article Google Scholar
Hu Z, Chiong R, Pranata I, Susilo W, Bao Y (2016) Identifying malicious web domains using machine learning techniques with online credibility and performance data. In: Proceedings of the IEEE Congress on Evolutionary Computation, Vancouver, Canada, pp 5186–5194. https://doi.org/10.1109/CEC.2016.7748347
Hu Z, Chiong R, Pranata I, Bao Y, Lin Y (2019) Malicious web domain identification using online credibility and performance data by considering the class imbalance issue. Ind Manag Data Syst 119(3):676–696. https://doi.org/10.1108/IMDS-02-2018-0072
Article Google Scholar
Imran M, Latif S, Mehmood D, Shah MS (2019) Student academic performance prediction using supervised learning techniques. Int J Emerg Technol Learn 14(14):92–104. https://doi.org/10.3991/ijet.v14i14.10310
Ivanova O, Scholz M (2017) How can online marketplaces reduce rating manipulation? A new approach on dynamic aggregation of online ratings. Decis Support Syst 104:64–78. https://doi.org/10.1016/j.dss.2017.10.003
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations. San Diego, USA, pp 1–15
Ko T, Lee JH, Cho H, Cho S, Lee W, Lee M (2017) Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data. Ind Manag Data Syst 117(5):927–945. https://doi.org/10.1108/imds-06-2016-0195
Kumar N, Venugopal D, Qiu L, Kumar S (2018) Detecting review manipulation on online platforms with hierarchical supervised learning. J Manag Inf Syst 35(1):350–380. https://doi.org/10.1080/07421222.2018.1440758
Article Google Scholar
Li L, Qin B, Ren W, Liu T (2017) Document representation and feature combination for deceptive spam review detection. Neurocomputing 254:33–41. https://doi.org/10.1016/j.neucom.2016.10.080
Article Google Scholar
Li H, Fei G, Wang S, Liu B, Shao W, Mukherjee A, Shao J (2017) Bimodal distribution and co-bursting in review spam detection. In: Proceedings of the 26th International Conference on World Wide Web. Perth, Australia, pp 1063–1072. https://doi.org/10.1145/3038912.3052582
Luca M, Zervas G (2016) Fake it till you make it: reputation, competition, and yelp review fraud. Manag Sci 62(12):3412–3427. https://doi.org/10.1287/mnsc.2015.2304
Article Google Scholar
Malbon J (2013) Taking fake online consumer reviews seriously. J Consum Policy 36(2):139–157. https://doi.org/10.1007/s10603-012-9216-7
Article Google Scholar
Menard S (2010) Logistic regression: from introductory to advanced concepts and applications. SAGE, Los Angeles
Book Google Scholar
Munzel A (2016) Assisting consumers in detecting fake reviews: the role of identity information disclosure and consensus. J Retail Consum Serv 32:96–108. https://doi.org/10.1016/j.jretconser.2016.06.002
Article Google Scholar
Nelder JA, Wedderburn RWM (1972) Generalized linear models. J R Stat Soc Ser A 135(3):370–384. https://doi.org/10.2307/2344614
NLTK (2019) Nltk Package. http://www.nltk.org/api/nltk.html. Accessed 25 Jan 2019
Norvig P (2016) How to write a spelling corrector. https://norvig.com/spell-correct.html. Accessed June 01 2018
O'Neill S (2018) A peddler of fake reviews on TripAdvisor gets jail time. https://skift.com/2018/09/12/fake-reviews-tripadvisor-jail-italy/. Accessed October 03 2019
Picchi A (2019) Buyer beware: scourge of fake reviews hitting Amazon, Walmart and other major retailers. CBS News. https://www.cbsnews.com/news/buyer-beware-a-scourge-of-fake-online-reviews-is-hitting-amazon-walmart-and-other-major-retailers/. Accessed 2 Oct 2019
Rahman M, Carbunar B, Ballesteros J, Chau DH (2015) To catch a fake: curbing deceptive yelp ratings and venues. Statistic Anal Data Min 8(3):147–161. https://doi.org/10.1002/sam.11264
Article MathSciNet MATH Google Scholar
Rathore S, Loia V, Park JH (2018) SpamSpotter: an efficient spammer detection framework based on intelligent decision support system on Facebook. Appl Soft Comput 67:920–932. https://doi.org/10.1016/j.asoc.2017.09.032
Article Google Scholar
Rayana S, Akoglu L (2015) Collective opinion spam detection: Bridging review networks and metadata. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, pp 985–994. https://doi.org/10.1145/2783258.2783370
Ren Y, Ji D (2017) Neural networks for deceptive opinion spam detection: an empirical study. Inf Sci 385-386:213–224. https://doi.org/10.1016/j.ins.2017.01.015
Rodola G (2020) psutil 5.7.2. https://pypi.org/project/psutil/. Accessed August 5 2020
Rout JK, Singh S, Jena SK, Bakshi S (2016) Deceptive review detection using labeled and unlabeled data. Multimed Tools Appl 76(3):3187–3211. https://doi.org/10.1007/s11042-016-3819-y
Article Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1. MIT Press, pp 318–362
Salehan M, Kim DJ (2016) Predicting the performance of online consumer reviews: a sentiment mining approach to big data analytics. Decis Support Syst 81:30–40. https://doi.org/10.1016/j.dss.2015.10.006
Savage D, Zhang X, Yu X, Chou P, Wang Q (2015) Detection of opinion spam based on anomalous rating deviation. Expert Syst Appl 42(22):8650–8657. https://doi.org/10.1016/j.eswa.2015.07.019
Article Google Scholar
Scikit-learn (2019) API Reference. https://scikit-learn.org/stable/modules/classes.html. Accessed 19 Mar 2019
Shu C (2019) FTC brings its first case against fake paid reviews on Amazon. https://techcrunch.com/2019/02/26/ftc-brings-its-first-case-against-fake-paid-reviews-on-amazon/. Accessed October 03 2019
Smithers R (2019) Facebook still flooded with fake reviews, says which? The Guardian. https://www.theguardian.com/business/2019/aug/06/facebook-fake-reviews-which. Accessed October 03 2019
Sun C, Du Q, Tian G (2016) Exploiting product related review features for fake review detection. Math Probl Eng 2016:1–7. https://doi.org/10.1155/2016/4935792
Article Google Scholar
Wahyuni ED, Djunaidy A (2016) Fake review detection from a product review using modified method of iterative computation framework. Proceed MATEC Web Confer 58:03003. https://doi.org/10.1051/matec
Article Google Scholar
Wang X, Liu K, Zhao J (2017) Handling cold-start problem in review spam detection by jointly embedding texts and behaviors. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp 366–376. https://doi.org/10.18653/v1/P17-1034
Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116. https://doi.org/10.1016/j.knosys.2014.06.004
Wu Y, Ngai EWT, Wu P, Wu C (2020) Fake online reviews: literature review, synthesis, and directions for future research. Decis Support Syst 132:113280. https://doi.org/10.1016/j.dss.2020.113280
Zhang D, Zhou L, Kehoe JL, Kilic IY (2016) What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. J Manag Inf Syst 33(2):456–481. https://doi.org/10.1080/07421222.2016.1205907
Article Google Scholar
Zhang W, Du Y, Yoshida T, Wang Q (2018) DRI-RCNN: an approach to deceptive review identification using recurrent convolutional neural network. Inf Process Manag 54(4):576–592. https://doi.org/10.1016/j.ipm.2018.03.007
Zhu J, Zou H, Rosset S, Hastie T (2009) Multi-class AdaBoost. Stat Interface 2(3):349–360. https://doi.org/10.4310/SII.2009.v2.n3.a8

Download references

Acknowledgements

The first author would like to acknowledge financial support from the Indonesian Endowment Fund for Education (LPDP), Ministry of Finance, and the Directorate General of Higher Education (DIKTI), Ministry of Education and Culture, Republic of Indonesia.

Author information

Authors and Affiliations

School of Electrical Engineering and Computing, The University of Newcastle, Callaghan, NSW, 2308, Australia
Gregorius Satia Budhi & Raymond Chiong
Informatics Department, Petra Christian University, Surabaya, 60236, Indonesia
Gregorius Satia Budhi
School of Cybersecurity, Chengdu University of Information Technology, Chengdu, 610225, China
Zuli Wang

Authors

Gregorius Satia Budhi
View author publications
You can also search for this author inPubMed Google Scholar
Raymond Chiong
View author publications
You can also search for this author inPubMed Google Scholar
Zuli Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Raymond Chiong or Zuli Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Budhi, G.S., Chiong, R. & Wang, Z. Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features. Multimed Tools Appl 80, 13079–13097 (2021). https://doi.org/10.1007/s11042-020-10299-5

Download citation

Received: 05 December 2019
Revised: 30 September 2020
Accepted: 22 December 2020
Published: 13 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11042-020-10299-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving detection of untrustworthy online reviews using ensemble learners combined with feature selection

A class imbalance-aware review rating prediction using hybrid sampling and ensemble learning

Detection of Fake Reviews in Yelp Dataset Using Machine Learning and Chain Classifier Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now