OpenAI Debuts Triton for GPU-Powered Neural Networks

AI research firm OpenAI has released Triton, a Python-based environment that enables developers to create machine learning algorithms more easily than before. This comes just weeks after Microsoft unveiled the technical preview of GitHub Copilot – which is powered by technology from OpenAI.

Triton was first presented in an academic paper by OpenAI scientist Philippe Tillet and his advisors, H. T. Kung and David Cox, as a graduate student at Harvard University two years ago. Triton 1.0 offers further optimizations designed to speed up enterprise machine learning projects.

Triton for AI

Unlike Nvidia’s official framework tool called CUDA which can be difficult to optimize, Triton performs many AI code optimizations automatically to save time for developers.

In a post on the OpenAI blog by Tillet: “Triton makes it possible to reach peak hardware performance with relatively little effort… it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS… in under 25 lines of code.”

The cuBLAS library implements GPU-accelerated basic linear algebra subroutines (BLAS) and is highly optimized for NVIDIA GPUs. The library is generally considered to offer performance that many GPU coding experts cannot match.

Triton optimizes three key steps of the workflow: Combining memory transfers from DRAM to maximize memory bus width, minimizing shared memory bank conflicts by efficiently using SRAM, and the partitioning and scheduling of computations to maximize parallelism and access to arithmetic logic units (ALUs) such as tensor cores.

Tillet noted that doing these well can be challenging even for seasoned CUDA programmers with years of experience.

“The purpose of Triton is to fully automate these optimizations so that developers can better focus on the high-level logic of their parallel code. Triton aims to be broadly applicable, and therefore does not automatically schedule work across [streaming multiprocessors] – leaving some important algorithmic considerations… to the discretion of developers.”

With Triton, even teams without extensive CUDA experience can now create more efficient algorithms that complete faster than before. Moreover, they can leverage Triton to speed up development times, since they won’t need to spend as much time optimizing the code.

“Our goal is for it to become a viable alternative to CUDA for Deep Learning. [Triton] is for machine learning researchers and engineers who are unfamiliar with GPU programming despite having good software engineering skills,” wrote Tillet to ZDNet.

Standard CPUs and AMD GPUs are not currently supported, though Tillet says community contributions to address this limitation are welcomed. Triton is offered as open-source and can be accessed on GitHub here.

Image credit: iStockphoto/Krzysztof12