Posts
Dec, 15
Accurate Models of NVIDIA Tensor Cores
Matrix multiplication is a fundamental operation in for both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix […]
Dec, 15
TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization
High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with concise code, achieving expert-level performance still requires deep understanding of GPU architectures and low-level performance trade-offs. We present TritonForge, a profiling-guided framework for automated Triton […]
Dec, 15
Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically […]
Dec, 15
Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
Training large language models (LLMs) efficiently requires a deep understanding of how modern GPU systems behave under real-world distributed training workloads. While prior work has focused primarily on kernel-level performance or single-GPU microbenchmarks, the complex interaction between communication, computation, memory behavior, and power management in multi-GPU LLM training remains poorly characterized. In this work, we […]
Dec, 15
Accelerating Molecular Simulations with Triton: Fused GPU Kernels for TensorNet Neural Potentials
Molecular dynamics (MD) simulations are essential for understanding molecular behavior in biology and chemistry, but remain computationally expensive at the scales required for drug discovery and materials design. Machine learning force fields (MLFFs), particularly TensorNet-based architectures, have shown promise in accelerating simulations while maintaining physical accuracy. However, these models still face significant performance bottlenecks in […]
Dec, 7
Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles
Ensuring the correctness of compiler optimizations is critical, but existing fuzzers struggle to test optimizations effectively. First, most fuzzers use optimization pipelines (heuristics-based, fixed sequences of passes) as their harness. The phase-ordering problem can enable or preempt transformations, so pipelines inevitably miss optimization interactions; moreover, many optimizations are not scheduled, even at aggressive levels. Second, […]
Dec, 7
tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection
We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM kernels. tritonBLAS explicitly models the relationship between architectural topology, matrix shapes, and algorithmic blocking behavior to predict near-optimal configurations without runtime autotuning. Based on this model, we […]
Dec, 7
Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis
As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA’s Blackwell (B200) generation introduce significant architectural advances including the 5th generation tensor cores, tensor memory (TMEM), decompression engine (DE), and dual chips; however systematic methodologies for […]
Dec, 7
hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware
We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning […]
Dec, 7
Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels
Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: the implicit algorithm, which defines the tensor operation that computes the output tensor, and the schedule, which defines how the operation is implemented. The schedule of an ML kernel […]
Nov, 30
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate […]
Nov, 30
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small […]

