Using diffusion models to generate audio is not a brand new idea. Yet, this method was not particularly useful for music generation because diffusion models are typically trained to generate an output of certain size. For instance, if the model is trained on 20-second audio pieces, it can only produce 20-second pieces. In real life applications, users need music of varying durations. In addition, when music is arbitrarily truncated to fit the length used in the diffusion model, the result is often segments of a musical work that can terminate amidst a musical phrase.
Stable Audio is a new latent diffusion model presented by Stable AI. In its training, the model has been conditioned on text metadata, the length of audio files, and start time. Because of this extra conditioning, the model can produce music of specified length within certain limit. Specifically, when the system is being trained with a piece of audio, it keeps track of two things: the point in time when a specific segment starts and the total length of the original audio file. The information is turned into embeddings and fed into the model. The new method enables users to dictate how long the output will be.
The core of Stable Audio is a diffusion model with over 900 million parameters, which helps in cleaning up the input and using the text and timing information to generate high-quality output. It is designed to work efficiently, even with longer sequences, making it a powerful tool for audio processing.
Stable Audio offers three pricing plans: a no-cost option allowing individuals to generate up to 45 seconds of audio across 20 tracks monthly; a Professional package priced at $11.99 per month, which accommodates the creation of 500 tracks of no more than 90 seconds; and an Enterprise plan with negotiable pricing. It should be noted that commercial use of the audio produced with the free tier is prohibited.
In the future, the research lab will release open-source models based on Stable Audio.
Summary
The Purpose
To be able to produce music of different lengths
The Idea
Train diffusion models on music pieces of various lengths with timing conditioning
Further Possibilities
1. Music-to-text translation to help people understand music
Reverse the technology so that classical music comes with textual description.
2. Use the new technology for customizing music therapy
Users may personalize the music used in therapy.
3. Personalize music in videos, on demand
Users may change the music in videos on the fly.
4. Create audio branding elements for marketing
Create unique jingles for marketing campaigns.
5. AI DJ
Develop an AI DJ that can read the room’s mood and adjust the music accordingly during live events.
Questions
1. What are the consequences of an explosion in music composition?
2. How would this technology be used in video production?
3. How might you use text-to-music in your life?