Model Text Embedding dan TF-IDF+Ngram untuk Meningkatkan Kinerja Algoritma Binary Classifier pada Klasifikasi SMS Palsu

Authors

  • Sutriawan Universitas Muhammadiyah Bima
  • Siti Mutmainnah Universitas Muhammadiyah Bima
  • Teguh Ansyor Lorosae Universitas Muhammadiyah Bima
  • Sahrul Ramadhan Universitas Muhammadiyah Bima

DOI:

https://doi.org/10.53513/jursi.v4i1.10582

Keywords:

Klasifikasi SMS, Naive Bayes, TF-IDF, word2vec, algoritma binary classifier

Abstract

Seiring meningkatnya penggunaan SMS, deteksi SMS palsu (spam) menjadi tantangan dalam menjaga keamanan komunikasi. Algoritma klasifikasi berbasis teks, seperti Naive Bayes, Logistic Regression, dan Random Forest, memiliki performa yang bervariasi tergantung pada representasi fitur teks yang digunakan. Penelitian ini bertujuan untuk mengevaluasi kinerja algoritma binary classifier dalam klasifikasi SMS palsu menggunakan representasi fitur TF-IDF, TF-IDF + Ngram, dan Word2Vec. Algoritma yang diuji meliputi Naive Bayes, Logistic Regression, Random Forest, dan Decision Tree, dengan metrik akurasi, precision, recall, dan F1-score sebagai evaluasi. Hasil penelitian menunjukkan bahwa Naive Bayes dengan TF-IDF mencapai akurasi 91.26%, sementara Random Forest dengan Word2Vec memperoleh akurasi 89.08%. Logistic Regression dengan TF-IDF + Ngram menunjukkan hasil lebih rendah. Temuan ini menegaskan pentingnya pemilihan representasi fitur yang tepat untuk meningkatkan akurasi klasifikasi SMS palsu.

Author Biographies

Sutriawan, Universitas Muhammadiyah Bima

Ilmu Komputer

Siti Mutmainnah, Universitas Muhammadiyah Bima

Ilmu Komputer

Teguh Ansyor Lorosae, Universitas Muhammadiyah Bima

Ilmu Komputer

Sahrul Ramadhan, Universitas Muhammadiyah Bima

Ilmu Komputer

References

O. Abayomi-Alli, S. Misra, A. Abayomi-Alli, and M. Odusami, “A review of soft techniques for SMS spam classification: Methods, approaches and applications,” Eng. Appl. Artif. Intell., vol. 86, pp. 197–212, 2019, doi: 10.1016/j.engappai.2019.08.024.

A. Theodorus, T. K. Prasetyo, R. Hartono, and D. Suhartono, “Short Message Service (SMS) Spam Filtering using Machine Learning in Bahasa Indonesia,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), 2021, pp. 199–203. doi: 10.1109/EIConCIT50028.2021.9431859.

N. Aulia, “Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach,” pp. 164–169, 2019.

Y. Vernanda, S. Hansun, and M. B. Kristanda, “Indonesian language email spam detection using N-gram and Naïve Bayes algorithm,” vol. 9, no. 5, pp. 2012–2019, 2020, doi: 10.11591/eei.v9i5.2444.

N. K. Nagwani and A. Sharaff, “SMS spam filtering and thread identification using bi-level text classification and clustering techniques,” J. Inf. Sci., vol. 43, no. 1, pp. 75–87, 2017, doi: 10.1177/0165551515616310.

D. Kim, D. Seo, S. Cho, and P. Kang, “Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec,” Inf. Sci. (Ny)., vol. 477, pp. 15–29, 2019, doi: 10.1016/j.ins.2018.10.006.

Z. Alamin, T. A. Lorosae, and S. Ramadhan, “Improving Performance Sentiment Movie Review Classification Using Hybrid Feature TFIDF , N-Gram , Information Gain and Support Vector Machine,” vol. 11, no. 2, pp. 375–384, 2024.

S. Yilmaz and S. Toklu, “A deep learning analysis on question classification task using Word2vec representations,” Neural Comput. Appl., vol. 32, no. 7, pp. 2909–2928, 2020, doi: 10.1007/s00521-020-04725-w.

X. Bao, S. Lin, R. Zhang, Z. Yu, and N. Zhang, “Sentiment Analysis of Movie Reviews Based on Improved Word2vec and Ensemble Learning,” J. Phys. Conf. Ser., vol. 1693, no. 1, 2020, doi: 10.1088/1742-6596/1693/1/012088.

S. Yilmaz and S. Toklu, “A deep learning analysis on question classification task using Word2vec representations,” Neural Comput. Appl., vol. 32, no. 7, pp. 2909–2928, 2020, doi: 10.1007/s00521-020-04725-w.

S. Abdulateef, N. A. Khan, B. Chen, and X. Shang, “Multidocument Arabic text summarization based on clustering and word2vec to reduce redundancy,” Inf., vol. 11, no. 2, 2020, doi: 10.3390/info11020059.

S. Hosseinpour and M. R. Keyvanpour, “A Comprehensive Approach to SMS Spam Filtering Integrating Embedded and Statistical Features,” in 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE), 2023, pp. 7–12. doi: 10.1109/ICCKE60553.2023.10326281.

P. Joseph and S. Y. Yerima, “A comparative study of word embedding techniques for SMS spam detection,” in 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN), 2022, pp. 149–155. doi: 10.1109/CICN56167.2022.10008245.

A. Rusli, J. C. Young, and N. M. S. Iswari, “Identifying Fake News in Indonesian via Supervised Binary Text Classification,” in 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), 2020, pp. 86–90. doi: 10.1109/IAICT50021.2020.9172020.

M. . Abbashi, A. . Beltyukov, H. Lal, and A. . Abbasi, “Spam Detection in Short Text Messages (Sms) Using Word Embedding and Term Frequency- Inverse Document Frequency (Tf-Idf),” XXI Century Resumes Past Challenges Present plus, vol. 9, no. 50, pp. 143–148, 2020, doi: 10.46548/21vek-2020-0950-0026.

T. Singh, T. A. Kumar, and P. G. Shambharkar, “Enhancing Spam Detection on SMS performance using several Machine Learning Classification Models,” in 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), 2022, pp. 1472–1478. doi: 10.1109/ICOEI53556.2022.9777157.

A. Ghourabi and M. Alohaly, “Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning,” Sensors, vol. 23, no. 8, pp. 1–17, 2023, doi: 10.3390/s23083861.

A. E. Qasem and M. Sajid, “Exploring the Effect of N-grams with BOW and TF-IDF Representations on Detecting Fake News,” in 2022 International Conference on Data Analytics for Business and Industry (ICDABI), 2022, pp. 741–746. doi: 10.1109/ICDABI56818.2022.10041537.

N. Sharma, “A Methodological Study of SMS Spam Classification Using Machine Learning Algorithms,” in 2022 2nd International Conference on Intelligent Technologies (CONIT), 2022, pp. 1–5. doi: 10.1109/CONIT55038.2022.9848171.

A. A. Ramaditia Dwiyansaputra, Gibran Satya Nugraha, Fitri Bimantoro, “Indonesian SMS Spam Detection using TF-IDF and Stochastic Gradient Descent,” vol. 3, no. 2, pp. 200–207, 2021.

B. Das and S. Chakraborty, “An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation,” 2018, [Online]. Available: http://arxiv.org/abs/1806.06407

S. Alam and N. Yao, “The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis,” Comput. Math. Organ. Theory, vol. 25, no. 3, pp. 319–335, 2019, doi: 10.1007/s10588-018-9266-8.

A. I. Kadhim, “Survey on supervised machine learning techniques for automatic text classification,” Artif. Intell. Rev., vol. 52, no. 1, pp. 273–292, 2019, doi: 10.1007/s10462-018-09677-1.

I. M. Mubaroq and E. B. Setiawan, “The Effect of Information Gain Feature Selection for Hoax Identification in Twitter Using Classification Method Support Vector Machine,” Indones. J. …, vol. 5, no. September, pp. 107–118, 2020, doi: 10.21108/indojc.2020.5.2.499.

J. Asian, H. E. Williams, and S. M. M. Tahaghoghi, “Stemming Indonesian,” Conf. Res. Pract. Inf. Technol. Ser., vol. 38, pp. 307–314, 2005.

T. Shaik et al., “A Review of the Trends and Challenges in Adopting Natural Language Processing Methods for Education Feedback Analysis,” IEEE Access, vol. 10, pp. 56720–56739, 2022, doi: 10.1109/ACCESS.2022.3177752.

H. H. Saeed, K. Shahzad, and F. Kamiran, “Overlapping toxic sentiment classification using deep neural architectures,” IEEE Int. Conf. Data Min. Work. ICDMW, vol. 2018-Novem, pp. 1361–1366, 2019, doi: 10.1109/ICDMW.2018.00193.

I. Express and L. Part, “Short message service (sms) spam filtering using deep learning in bahasa indonesia,” vol. 13, no. 10, pp. 1093–1100, 2022, doi: 10.24507/icicelb.13.10.1093.

P. N. Andono and R. A. Pramunendar, “Performance Evaluation of Classification Algorithm for Movie Review Sentiment Analysis,” Int. J. Comput., vol. 22, no. 1, pp. 7–14, 2023, doi: 10.47839/ijc.22.1.2873.

M. A. Fauzi, “Random forest approach fo sentiment analysis in Indonesian language,” Indones. J. Electr. Eng. Comput. Sci., vol. 12, no. 1, pp. 46–50, 2018, doi: 10.11591/ijeecs.v12.i1.pp46-50.

A. P. Widyassari, E. Noersasongko, A. Syukur, and Affandy, “An Extractive Text Summarization based on Candidate Summary Sentences using Fuzzy-Decision Tree,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 7, pp. 572–579, 2022, doi: 10.14569/IJACSA.2022.0130768.

Downloads

Published

2025-01-15