TikTok Comment Sentiment Analysis Using TextCNN

Project Overview

This project is a Final Project that implements a TextCNN to perform sentiment analysis on TikTok comments in Indonesian language. The system is designed to classify comments into two categories: Cyberbullying (insult/embarrass content comments) and Non-Cyberbullying (normal/clear content comments).

Institution: Institut Teknologi Sumatera (ITERA)
Study Program: Informatics Engineering
Author: Nikola Arinanda
Year: 2026


Abstract

This project uses TextCNN architecture to analyze YouTube comments sentiment. Initially, the data will go through preprocessing stages such as data division (80:20), case folding, text cleaning, augmentation (AEDA, random swap character, random delete character), tokenization and stopword removal to prepare the data. The next step is model training using k-fold cross-validation as many as 5 fold to ensure robustness and good generalization. Final step is model evaluation using confussion matrix such as accuracy, precision, recall and F1 score.


Project Structure

tugas-akhir-main/
├── dataset/
│ ├── k_fold.json           # k-fold cross-validation dictionary
│ └── cyberbullying.csv     # Original dataset
├── code/
│ ├── datareader.py         # Data loader and preprocessing
│ ├── model.py              # Model architecture
│ └── train.py              # Main script for model training
├── model_outputs/
│ ├── run_YYYYMMDD_HHMMSS/
│ │ ├── fold_1_model.pth    # Model output
│ │ ├── fold_2_model.pth    # ...
│ │ ├── fold_3_model.pth
│ │ ├── fold_4_model.pth
│ │ ├── fold_5_model.pth
│ │ └── ...
│ └── ...
└── report/
│ └── thesis.pdf            # Documentation and reports
└── requirements.txt        # Python dependencies

Environment Setup

Prerequisites

This project requires:

  • Python: 3.8 or higher (tested with Python 3.9+)
  • CUDA: Optional (for GPU acceleration)

System Requirements

  • RAM: Minimum 8 GB (recommended 16 GB)
  • Storage: Minimum 10 GB (for model and dataset)
  • GPU: Optional, but highly recommended for faster training

Dependencies

All dependencies are listed in the requirements.txt file. Main libraries:

| Library        | Version  | Purpose                              |
|----------------|----------|--------------------------------------|
| torch          | >=2.0.0  | Deep learning framework              |
| pandas         | >=1.5.0  | Data manipulation                    |
| numpy          | >=1.23.0 | Numerical computing                  |
| matplotlib     | >=3.7.0  | Data visualization                   |
| seaborn        | >=0.12.0 | Statistical visualization            |
| scikit-learn   | >=1.2.0  | Machine learning utilities           |
| transformers   | >=4.30.0 | NLP models (IndoBERT, etc.)          |
| nltk           | >=3.8.0  | Text preprocessing                   |
| tqdm           | >=4.65.0 | Progress bar                         |
| wandb          | >=0.15.0 | Experiment tracking                  |

For the complete list, see requirements.txt


Installation & Setup

Step 1: Clone Repository

git clone https://github.com/nikolaarinanda/tugas-akhir.git
cd tugas-akhir

Step 2: Create Virtual Environment

It is highly recommended to use a virtual environment to avoid dependency conflicts. Using venv (built-in python):

# Linux/Mac
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate

Using conda:

conda create -n youtube-sentiment python=3.9
conda activate youtube-sentiment

Step 3: Install Dependencies

# Upgrade pip to the latest version
pip install --upgrade pip

# Install all requirements
pip install -r requirements.txt

Note for PyTorch with GPU: If you want to use GPU, install the CUDA-specific version of PyTorch:

# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Dataset Information

The dataset consists of TikTOk comments in Indonesian language which comes from this research with labels:

  • Cyberbullying (-1): Insult/embarrass content comments
  • Non-cyberbullying (1): Normal/clear content comments

Dataset Format

The dataset cyberbullying.csv has the following columns:

Column Type Description
sentiment Integer label (-1 cyberbullying, 1 non-cyberbullying)
comment String Comment content

Model Architecture

TextCNN Text Classifier

SEDepthwise TextCNN Text Classifier

Key Components:

  • Embedding: Converts token IDs into dense vectors (IndoBERT tokenizer compatible)
  • Transpose: Adjusts tensor shape for Conv1D input (embedding_dim → channel dimension)
  • Depthwise Separable Convolution:
    • Depthwise Conv (kernel sizes = 3, 4)
    • Pointwise Conv (channel mixing)
  • Activation: ReLU for non-linearity
  • Pooling: Global Max Pooling to extract dominant features
  • Concatenation: Combines features from multiple convolution branches
  • SE Block (Squeeze-and-Excitation): Channel-wise attention to recalibrate feature importance
  • Output Layer: Fully connected layer for classification (num_classes)

How to Run

1. Training With Default COnfiguration

py train.py

Result:

  • Create fold indices as manys as 5 fold using k-fold cross-validation (if it doesn’t exist yet)
  • Training model Training model on 5 folds sequentially
  • Save the training result model in model_outputs/run_YYYYMMDD_HHMMSS/
  • Metrics plot and model checkpoints in Wandb

2. Training dengan custom parameter

python train.py \
    --max_length 128 \
    --dropout 0.3 \
    --batch_size 50 \
    --optimizer_name Muon \
    --embed_dim 100 \
    --conv_filters 50 \
    --kernel_size 3 4 \
    --epochs 100 \
    --lr 5e-4 \

Command Line Arguments

Argument Type Default Description
–seed int 01012001 Random seed for reproducibility
–dataset_path str ’../dataset/cyberbullying.csv’ Path to dataset file
–max_length int 128 Maximum sequence length
–tokenizer str ‘indobenchmark/indobert-base-p1’ Tokenizer name
–dropout float 0.5 Dropout rate
–batch_size int 50 Batch size for embedding
–embed_dim int 100 Embedding dimension for CNN
–num_classes int 2 Number of classes
–conv_filters int 50 Number of filters for CNN
–kernel_size int [3, 4] Kernel sizes for CNN
–n_folds int 5 Fold number for cross-validation
–epochs int 100 Number of epochs
–lr float 52-4 Learning rate
–output_model flag True Save model after training
–output_dir str ‘model_outputs’ Directory to save model outputs
–use_wandb flag False Enable Weights & Biases logging
–wandb_group str ‘Light TextCNN’ Create group for Weights & Biases runs
–wandb_note str ‘Light TextCNN Note’ Add Weights & Biases notes
–patience int 5 Patience for early stopping (epochs to wait after no improvement)

Output and Results

Output Structure

 model_outputs/
├── run_YYYYMMDD_HHMMSS/
│ ├── fold_1_model.pth    # Model output
│ ├── fold_2_model.pth    # ...
│ ├── fold_3_model.pth
│ └── fold_4_model.pth
│ └── fold_5_model.pth
│ └── ...
└── ...

Wandb Key Matrics

The model produces the following metrics:

  • Accuracy: Percentage of correct predictions
  • Precision: Accuracy for positive predictions
  • Recall: Ability to find all positive samples
  • F1-Score: Harmonic mean of precision and recall
  • Loss: Cross-entropy loss

Data Augmentation

Augmentation techniques are applied during training to improve robustness:

  • AEDA (An Easy Data Augmentation): augmentation that works by inserting punctuation marks “.”, ”;”, ”?”, ”:”,”!”,”,” randomly into the text
  • Random Swap Character: Randomly swap positions of two words in the text
  • Random Delete Character: Randomly delete character from the text\
  • Augmentation Probability: Default 0.5 (50% of data is selected for augmentation)

Example:

Original: "makannya segentong buset"
AEDA: "makannya segentong buset!"
Random swap Character: "makannya segetnong buset"
Random delete Character: "makanya segentong buset"

Troubleshooting

Issue: CUDA out of memory

If you encounter issues CUDA out of memory:

# Reduce batch size
python train.py --batch_size 8

# Reduce embedding dimension
python train.py --embedding_dim 64

Issue: Module not found

If you encounter issues module not found:

# Make sure virtual environment is activated
# Reinstall dependencies
pip install -r requirements.txt --force-reinstall

Issue: Dataset file not found

If you encounter issues dataset file not found:

  • Make sure dataset_youtube_comment.xlsx is in the root directory
  • Check file permissions (must be readable)

Issue: Transformers model cache

If you encounter issues downloading IndoBERT:

# Manual download
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('indobenchmark/indobert-base-p1')"

How to Cite

If you use or adapt this code/model in your research or publication, please use one of the following citation formats:

BibTeX Format

@thesis{arinanda2026cyberbullying,
  title={Sentiment Analysis of Cyberbullying Comments on TikTok Social Media Using TextCNN Architecture},
  author={Nikola Arinanda},
  year={2026},
  school={Institut Teknologi Sumatera (ITERA)},
  type={Final Project},
  address={Lampung, Indonesia}
}

APA Format

Arinanda, N. (2026). Sentiment Analysis of Cyberbullying Comments on TikTok Social Media Using TextCNN Architecture [Final Project]. Institut Teknologi Sumatera (ITERA).

MLA Format

Arinanda, Nikola. “Sentiment Analysis of Cyberbullying Comments on TikTok Social Media Using TextCNN Architecture.” Final Project, Institut Teknologi Sumatera (ITERA), 2026.

Chicago Format

Arinanda, Nikola. “Sentiment Analysis of Cyberbullying Comments on TikTok Social Media Using TextCNN Architecture.” Final Project, Institut Teknologi Sumatera (ITERA), 2026.

IEEE Format

N. Arinanda, “Sentiment Analysis of Cyberbullying Comments on TikTok Social Media Using TextCNN Architecture”, Final Project, Institut Teknologi Sumatera (ITERA), 2026.


Author Information

  • Name: Nikola Arinanda
  • Study Program: Informatics Engineering
  • Institution: Institut Teknologi Sumatera (ITERA)
  • Year: 2026
  • Github: larinand

Contact and Support

For questions or issues about this project, please:

  1. Create an issue on the GitHub repository
  2. Contact the author via university email
  3. See documentation in the laporan/ folder

Last Update: April 2026 Status: Active Development