Apple has started a project in collaboration with Nvidia to accelerate the process known as inferencing in large language models (LLMs). These models are used to find connections between tokens. Inferencing involves running pre-trained AI algorithms using AI accelerators.
In November, Apple released an open-source software called Recurrent Drafter, or ReDrafter, available on GitHub. Nvidia has already integrated ReDrafter into its own TensorRT-LLM framework. According to Nvidia, this is an “innovative, speculative decoding technique” that helps developers significantly boost workload performance on Nvidia GPU chips.
Tests on a large production model showed that ReDrafter and TensorRT-LLM can increase token generation per second by 2.7 times during greedy decoding. This was verified on a model with several tens of billions of parameters. Apple mentioned, “The benchmark results indicate that this technology could significantly reduce latency perceived by users.” It also saves performance and energy.
Speculative decoding, according to Nvidia, is a method where LLM inferencing is accelerated by generating multiple tokens in parallel. “Smaller ‘draft’ modules are used to predict future tokens, which are then verified by the main model.” The output quality remains as good as before, “while response times, especially during low traffic, are significantly reduced.” Available resources are better utilized.
Apple emphasizes that alongside its work in the server sector with Nvidia GPUs, it is also working on accelerating LLM inference on Apple Silicon devices. The iPhone company, like competitors Meta and OpenAI, appears to rely heavily on Nvidia technology for training its own LLMs. The work of Apple’s AI team is likely to benefit the rest of the industry as well. In open-source models, ReDrafter reportedly achieved up to 3.5 tokens per generation step faster than previous speculative decoding methods.
The TensorRT-LLM framework’s latest version includes both the necessary drafting and validation logic in a single engine, minimizing overhead. The collaboration with Apple has made TensorRT-LLM “more powerful and flexible,” according to Nvidia.