Tag
This paper demonstrates that careful preprocessing—especially context length selection, normalization, and regularization—can make simple linear models like Ridge regression competitive with or superior to large Transformer, MLP, and CNN models on time-series forecasting benchmarks.
This paper systematically investigates the optimal order of preprocessing techniques for sentiment analysis on Twitter data, finding that tokenisation is most impactful and spelling correction least, with the best order being tokenisation, cleaning, stemming, then stopword removal.
The author shares how running multiple persistent AI agent profiles under Hermes led to high API costs, solved by implementing tiered model policies per profile, pre-processing inputs, and using an API gateway for cost visibility, reducing daily costs from $14-18 to $7-10.
This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.