30764

Posts

May, 3

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, […]
May, 3

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. […]
May, 3

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA’s closed-source userspace driver. As a result, the translation from high-level CUDA APIs to low-level hardware commands remains opaque, limiting both software understanding and performance attribution. This paper makes that […]
May, 3

A Human–Machine Collaborative Tuning Framework for Triton Kernel Optimization on SIMD Platforms

Single Instruction, Multiple Data (SIMD) technology enhances performance through parallel data processing on CPUs. SIMD platforms are widely adopted across domains ranging from high-performance computing to AI inference. As modern AI workloads increasingly rely on Python-based kernel frameworks to maintain usability and benefit from automatic tuning, Triton has emerged as a representative solution. However, Triton’s […]
May, 3

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs […]
Apr, 13

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, […]
Apr, 13

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing […]
Apr, 13

DVM: Real-Time Kernel Generation for Dynamic AI Models

Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a […]
Apr, 13

Agentic Code Optimization via Compiler-LLM Cooperation

Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program’s purpose. Recent work has proposed […]
Apr, 13

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness […]
Mar, 26

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose […]
Mar, 26

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton […]

Recent source codes

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: