YuE: An Open-Source AI Music Generator by M-A-P and HKUST

Recently, the Chinese-American research collective Multimodal Art Projection (M-A-P) has released an open-source AI music generator. This project, called “Open Music Foundation Models for Full-Song Generation” or “YuE,” has been developed in collaboration with the Hong Kong University of Science and Technology (HKUST). The name “YuE” is significant as it means both “music” and “happiness” in Chinese.

YuE is capable of generating complete songs from given lyrics, including both vocal and instrumental parts. It can produce music in various genres, languages, and vocal techniques. The demo songs are surprisingly coherent, even after several minutes, although they are currently only available in mono sound. In contrast, other AI music services like Udio and Suno produce stereo music.

Unlike these services, YuE operates offline on local hardware. However, it requires significant computational resources. For instance, generating a 30-second audio clip takes about 150 seconds on an Nvidia H800 GPU and around 360 seconds on a GeForce RTX 4090. For full songs, the developers recommend at least 80 GB of GPU memory, which is available only on high-end graphics cards like the Hopper H800 or A100, or multiple RTX 4090 GPUs combined. For shorter clips like a verse and chorus, 24 GB of VRAM should suffice.

YuE models are based on Meta’s LLama architecture and have been trained in three stages to ensure scalability, musicality, and control through lyrics. A semantically enhanced audio tokenizer was used to reduce training costs. M-A-P has released versions with 1 and 7 billion parameters for languages including English, Chinese (Mandarin and Cantonese), Japanese, and Korean, along with an upsampler model. This allows the generated music to be output in CD quality at 44.1 kHz.

Numerous demo songs are available on the project page. YuE generates different styles in English, Chinese, Japanese, and Korean. For example, an English rap song and a jazz piece where the AI begins to “improvise” at the end after the lyrics finish. There is also an example of code-switching between Korean, English, and Japanese.

The models are available for free download on GitHub and can be used for commercial projects, provided it is acknowledged that the songs were generated with M-A-P’s AI support. Musicians and creatives are encouraged to reuse and monetize works produced by YuE.

Recently, developers have enhanced the models with “In-Context Learning,” allowing YuE to adopt the style of a reference song. As an example, a Billie Eilish imitation was created singing about OpenAI. Future updates will include BPM control and a user-friendly interface. By switching to the “Tensor library for machine learning” (GGML), the M-A-P team hopes to reduce memory requirements.

The developers aim for a breakthrough in AI music generation similar to what Stable Diffusion achieved for AI image generation and Meta’s LLama for language models. To improve and expand the models to more languages, the YuE team seeks support, including partners for creating and curating training data for fine-tuning and evaluating results.

The researchers plan to publish a scientific paper on YuE soon, although currently, only an abstract and an overview graphic are available on the project page.

Related Posts