Groq AI Chip Delivers Blistering Inference
- By Paul Mah
- February 21, 2024
A generative AI firm has built a new chip designed to deliver blistering AI inference performance with large language models (LLMs).
Groq achieves this by creating a processing unit known as the Tensor Streaming Processor (TSP), which is designed to deliver deterministic performance for AI computations, eschewing the use of GPUs.
In comparison, GPUs are optimized for parallel graphics processing with a large number of cores. For instance, a unit of Nvidia H100 GPU comes with 14,592 CUDA cores.
Faster than before
Groq calls the resulting chips a “Language Processor Unit”, or LPU.
“The LPU's architecture is a departure from the SIMD (Single Instruction, Multiple Data) model used by GPUs and favors a more streamlined approach that eliminates the need for complex scheduling hardware. This design allows every clock cycle to be utilized effectively, ensuring consistent latency and throughput,” explained Jay Scambler, the managing director of AI firm Kapstone Bridge in a LinkedIn post.
The LPU is more efficient partly due to the ability to do away with the overhead of managing multiple threads and the underutilization of cores. An LPU hence has greater computing capacity, allowing sequences of text to be generated much faster.
Crucially, TSPs can also be linked together and scaled without the traditional bottlenecks of GPU clusters. The result is a linear scaling of performance with the addition of more LPUs.
According to a report on Tom’s Hardware, Groq claims its users are already using its engine and API to run LLMs at speeds up to 10 times faster than GPU-based alternatives. When I tested Groq, I was able to get over 300 tokens per second – faster than with GPT-4 on ChatGPT.
The search for faster AI inference
Faster inference can alleviate the AI hardware crunch that we are experiencing as large technology firms and governments scramble to get their hands on GPUs.
Others are searching for new ways to speed up inference. As I wrote last year, Laurence Liew, the director of AI Singapore and his team have been experimenting with neuroscience-based techniques and algorithms on that front.
According to Liew, initial efforts have shown promise with up to 10 times performance improvement for basic LLM models. LLM inference at a tenth of the cost and 100 times the performance on modern-day CPUs could well be possible, he noted.
Groq currently supports standard machine learning (ML) frameworks such as PyTorch, TensorFlow, and ONNX for inference. It does not support ML training “currently”. It runs Meta’s Llama 2, Mixtral-87b, and Mistral 7B and can be accessed for free here.
_Image credit: iStockphoto/Adrian Vidal
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.