AI-generated music is already an innovative enough concept, but Riffusion takes it to another level with a clever and weird approach that produces weird and compelling music using not audio but pictures sound.
It seems strange, it is strange. But if it works, it works. And it works ! Type of.
Diffusion is a machine learning technique for generating images that has supercharged the AI world over the past year. DALL-E 2 and Stable Diffusion are the two most prominent models that work by gradually replacing visual noise with what the AI thinks a prompt should look like.
The method has proven powerful in many contexts and is very susceptible to refinement, where you give the most trained model a lot of a specific content type in order to specialize it in producing more examples of that contents. For example, you can hone it on watercolors or photos of cars, and it will prove more capable of reproducing either of those things.
What Seth Forsgren and Hayk Martiros did for their amateur project Riffusion was to refine stable scattering on spectrograms.
“Hayk and I play together in a small band, and we started the project just because we love the music and weren’t sure if it would even be possible for a stable broadcast to create a spectrogram image with enough fidelity to be converted to audio,” Forsgren told TechCrunch. “Each step along the way, we’ve been more and more impressed with what’s possible, and one idea leads to the next.”
What are spectrograms, you ask? They are visual representations of audio that show the amplitude of different frequencies over time. You’ve probably seen waveforms, which display volume over time and make audio look like a series of hills and valleys; imagine if instead of the total volume, it displayed the volume of each frequency, from low to high end.
Here’s part of a song I made myself (“Marconi’s Radio” from The Secret Machines, if you’re wondering):

Picture credits: Devin Coldwey
You can see how it gets louder in all frequencies as the song builds, and you can even spot individual notes and instruments if you know what to look for. The process is not inherently perfect or lossless, but it is an accurate and systematic representation of sound. And you can convert it back to sound by performing the same process in reverse.
Forsgren and Martiros made spectrograms of a bunch of music and tagged the resulting images with relevant terms, like blues guitar, jazz piano, afrobeat, stuff like that. Feeding the model this collection gave him a good idea of what certain sounds “sounded like” and how he could recreate or combine them.
Here’s what the diffusion process looks like if you sample it while it refines the image:

Picture credits: Seth Forsgren/Hayk Martyrs
And indeed, the model has proven capable of producing spectrograms that, when converted to sound, match prompts like funky piano, jazzy sax, and the like quite well. Here is an example :

Picture credits: Seth Forsgren/Hayk Martyrs
But of course, a square spectrogram (512 × 512 pixels, a stable broadcast standard resolution) represents only a short clip; a 3 minute song would be a much, much larger rectangle. No one wants to listen to music for five seconds at a time, but the limitations of the system they had created meant they couldn’t just create a spectrogram 512 pixels high and 10,000 wide.
After trying a few things, they took advantage of the fundamental structure of large models like Stable Diffusion, which have a lot of “latent space”. It’s a bit like no man’s land between more well-defined nodes. For example, if you had an area of the model representing cats and another representing dogs, what is “in between” them is latent space which, if you just told the AI to draw, would be a kind of dog cat or cat, even if there is no such thing.
Incidentally, latent spatial things get much stranger than this:
No spooky nightmarish worlds for Project Riffusion, though. Instead, they found that if you have two prompts, like “church bells” and “electronic beats”, you can kind of switch between them a bit at a time and that’s fine. fades gradually and surprisingly naturally from one to the other, on the same rhythm:
It’s a strange and interesting sound, though obviously not particularly complex or high fidelity; remember, they weren’t even sure if the broadcast models could do that at all, so the ease with which this one turns bells into beats or typewriter taps into piano and bass is pretty remarkable.
Producing longer clips is possible, but still theoretical:
“We didn’t really try to create a classic 3-minute song with repeated choruses and verses,” Forsgren said. “I think it could be done with some nifty tricks like building a top level model for the song structure and then using the lower level model for the individual clips. Alternatively, you can train our model in depth with much higher resolution images of full songs.
Where does he go from here? Other bands attempt to create AI-generated music in a variety of ways, ranging from using text-to-speech models to specially trained audio models like Dance Diffusion.
Riffusion is more of a “wow, check it out” demo than a grand plan to reinvent music, and Forsgren said he and Martiros are just happy to see people engaging in their work, having fun, and rehearsing:
“There are many directions we could take from here, and we’re excited to continue learning along the way. It was fun to see other people already building their own ideas on our code this morning too. One of the amazing things about the Stable Diffusion community is how quickly people build on top of things in directions the original authors couldn’t predict.
You can test it out in a live demo on Riffusion.com, but you might have to wait a bit for your clip to render – it drew a bit more attention than the creators anticipated. The code is fully available through the About page, so feel free to run your own too, if you have the chips for it.
#Riffusion #model #composes #music #visualizing