Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

arXiv cs.CL 04/21/26, 04:00 AM Papers

authorship-attribution nlp threat-intelligence bert stylometry japanese-nlp

Summary

A foundational study on applying stylometric authorship attribution to threat intelligence, using Japanese Rakuten reviews to compare TF-IDF+LR, BERT embedding, BERT fine-tuning, and metric learning methods. BERT-FT performed best overall, but TF-IDF+LR proved more stable and efficient when scaling to hundreds of authors.

arXiv:2604.16376v1 Announce Type: new Abstract: This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with $k$-nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top-$k$ evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.

Original Article

Similar Articles

Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

arXiv cs.CL

This paper applies the likelihood ratio framework for forensic authorship attribution to Japanese texts, fusing stylometric features with embedding-based systems to improve discrimination and calibration.

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

arXiv cs.CL

This paper introduces a text-based causal inference methodology using an enhanced CausalBERT to disentangle the effects of individual aspects (e.g., school administration, academic performance) on overall online review ratings, validated on 600K+ U.S. K-12 school reviews. Key improvements include temperature scaling, hyperparameter optimization, and interpretability methods to reduce confounding bias.

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

arXiv cs.CL

This paper compares multiple machine learning and transformer models for sentiment classification on movie reviews, finding RoBERTa achieves 93.02% accuracy, and a soft voting ensemble improves performance.

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

arXiv cs.AI

Introduces READER, a lightweight framework for dynamic black-box LLM provenance that uses a frozen proxy LLM to extract authorship evidence from responses and performs Bayesian evidence accumulation across multiple queries, achieving high accuracy on the Agent500 dataset.

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

arXiv cs.CL

This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.

Similar Articles

Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Submit Feedback