Tag
Summary of Lecture 19 on efficient AI distributed training, covering data, pipeline, tensor, and sequence parallelism methods with notes on memory and communication bottlenecks.
This paper introduces a parallelization strategy and adaptive steering mechanism for the Baymex algorithm to efficiently learn discretized Bayesian network classifiers for clinical data, achieving speedups over 54x on a 16-core CPU and comparable or better predictive performance than traditional models while maintaining explainability.
The article describes five key workflow patterns for building agentic AI systems in enterprise settings, as summarized by Anthropic: prompt chaining, routing, parallelization, orchestrator, and evaluator-optimizer, with tips to prefer simpler workflows before using full agents.
This paper presents a parallelization framework for CFR algorithms using linear algebra operations, achieving up to four orders of magnitude speedup on GPU compared to CPU implementations.
OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.