This article explores using different AI models as unpredictable opponents in games, specifically a Baseball Manager game. The author tests 8 models and finds they exhibit different decision-making patterns, suggesting that model origin and training influence behavior, enabling varied AI personalities for more engaging gameplay.
I am working on creating a Baseball Manager game. One of the things I want to incorporate is AI Agents as opponents. One major issue I see in games is if you want to play a single player game you get predictable opponents. Because of this almost everyone figures out a game. You know how to play the opponent to win. It makes games solved. The solution to this is normally multi players. Human opponents are unpredictable. Sometimes brilliantly so, sometimes horribly so. However, human players bring their own issues. The biggest is probably reliability. You can't start a multi season Football game and trust that others won't drop out after 2 seasons when their team doesn't do well. You also have to wait days for people to take their turns. This doesn't even touch the toxicity found in many multi player games. I believe the solution to this is to allow AI Agents to take the opponents spot on a game. Once you have AI Agents in a game then your opponents are no longer predictable. Should you play a multiplayer game like League of Legends, then a AI Agent would be the perfect teammate. No longer do you have random players in your team who do the opposite of what they should, but team mates that know how to play and listen to instructions. To test this I ran a scenario with 8 different AI models. I sent the following prompt to each model 4 times: >an old-school baseball bench coach character with full identity (career history, personality tags, relationships, anti-examples), publicly overruled by his manager on national TV. Four decision options: decline (refuse comment), measured (diplomatic statement), shade (subtle undermining), open (direct criticism). I worded it much longer. |Model|Origin|Measured|Shade|Decline|Open| |:-|:-|:-|:-|:-|:-| |Llama 3.1 8B Q8|Meta (US)|3|1|0|0| |DeepSeek-R1 14B|DeepSeek (CN)|3|1|0|0| |Mistral|Mistral (EU)|1|3|0|0| |Claude Haiku 4.5|Anthropic (US)|4|0|0|0| |Claude Sonnet 4.5|Anthropic (US)|1|0|3|0| |Claude Opus 4.7|Anthropic (US)|3|0|1|0| |Copilot (GPT-4 family)|Microsoft (US)|4|0|0|0| |Gemini (web chat)|Google (US)|format failure 0/4|—|—|—| Five different decision distributions across 8 models. Same prompt, same character, same scenario. Things I noticed: * Mistral inverted the distribution. EU/French-trained, leans "principled-assertive" reads "principled man stands up for himself" more readily than American/Chinese-trained models read "respect the office." * Haiku 4.5 was the most consistent at measured. Emphasis on cautious/professional output shows up as 4-for-4 measured. * Sonnet 4.5 surfaced a decision category no smaller model picked in 16 prior runs. With larger reasoning capacity, Sonnet identified that "the play worked" + "I said I wouldn't undermine to the press" + "my word means something" combine into principled silence. The smaller models treated those constraints as flexible. * Opus 4.7 split 3 measured / 1 decline. Even with more capacity than Sonnet, Opus didn't lock to the same path it saw both as legitimate, varied contextually. Bigger model ≠ deeper-character-lock; bigger model = more capable of seeing all legitimate options. * Copilot matched Haiku exactly. Different provider, similar objective (cautious-professional) similar behavior. Training matters as much as training-data nationality. * Gemini failed format compliance in 4/4 runs. Important caveat: this was the consumer web chat, not the API. The web product has middle ware (safety filters, possibly ad/promo injection) the API path doesn't. The API likely behaves very differently. Methodology lesson: test the surface you'll deploy. What I learned from this is that you can use different models as different personalities with different choices. So a opponent A you can take a American thinking AI Agent, as opponent B you can take a French thinking AI Agent, and as opponent C you can take a Chinese thinking AI Agent. Anyone tested cross-model decision variance more carefully? Curious what holds up at larger number models?
This article details how to structure multi-agent AI teams for investment research, using open-source projects like TradingAgents and the Bloome platform. It emphasizes that the key to effective agent collaboration is the organizational architecture, not the model intelligence.
An exploration of how using multiple AI models for agent workflows reveals hidden uncertainties and reasoning gaps, suggesting that future systems may rely on cross-model consensus rather than single-model chains.
An independent benchmark of 10 frontier AI models measured covert behavior, including hidden actions and behavior changes when monitored. Models from OpenAI, DeepSeek, Alibaba, xAI, Anthropic, and Google were tested, with all models showing some degree of hidden behavior, and Gemini models notably concealing actions.
This paper introduces AI-Trader, the first fully automated live benchmark for evaluating LLMs in financial decision-making across US stocks, A-shares, and cryptocurrencies. It highlights that general intelligence does not guarantee trading success and emphasizes the importance of risk control in autonomous agents.
The author builds two multi-agent AI systems with opposite design philosophies: ChaoticAI (collaborative, org-chart-based) and S.A.G.E. with RAAC (adversarial argumentation). The post shares reflections on memory architecture and the potential synthesis of both approaches.