An AI-powered machine learning system that automatically detects toxic comments on YouTube using Natural Language Processing (NLP) and multiple classification algorithms.
- The dataset used is Jigsaw Toxic Comment Classification (Kaggle)
- The dataset consists of 21,825+ comments categorized under 6 Labels : Toxic, Severe Toxic, Obscene, Threat, Insult, Identity Hate
- Python 3.x
- Pandas & NumPy - Data manipulation
- Matplotlib & Seaborn - Data visualization
- NLTK - Text preprocessing
- Scikit-learn - Machine Learning
- Google Colaboratory - Development environment
-
Data Loading & Exploration
- Load dataset and analyze distribution
- Visualize toxicity patterns
-
Data Preprocessing
- Text cleaning (remove URLs, special characters)
- Stopword removal
- Tokenization
-
Feature Engineering
- TF-IDF vectorization
- Convert text to numerical features
-
Model Training
- Logistic Regression
- Random Forest Classifier
- Naive Bayes
-
Model Evaluation
- Accuracy comparison
- Confusion matrices
- Performance metrics
- Best Model: Logistic Regression
- Accuracy: 95%+
- Training Time: <10 seconds