Google’s new AI model can generate entirely new music from text prompts. Here’s what they sound like.
Listen to Episode 1 of this 3-part Science, Quickly fascination here.
Transcript
Allison Parshall: As a kid, I was something of a composer. Please brace yourself for Mozart-level brilliance.
[CLIP: Parshall family home video of toddler Allison playing a tiny keyboard and singing, mostly gibberish]
Parshall: As you can hear, I was a lyrical genius. And I did stick with it.
In high school I made a song entirely with homemade instruments that involved water, including hitting half-full wine glasses with chopsticks and water drumming—you know, when you get in the bathtub in a swimsuit and have your mom hold a microphone while you smack the surface of the water?
[CLIP: Sample of Allison’s weird water composition from high school in the background]
Water drumming.
But no amount of amateur water drumming arranged in GarageBand could be even half as weird—and fun—as some of the music that AI can make these days.
[CLIP: Theme music]
Welcome back to Science, Quickly. I’m Allison Parshall. This is part two of our three-part series on music-making artificial intelligence.
In the first episode, we met the winner of the 2022 AI Song Contest, who used machine learning to bring Western instruments into the world of Thai tuning.
Today we’re diving way deeper into that technology. And—as it turns out—scientists are on the cusp of something gigantic in music AI.
Shelly Palmer: The MusicLM product—it’s so impressive. What it is is so impressive…. the amount of things that have to be true for this to be what it is are unbelievable—like, unbelievable.”
Parshall: That was Shelly Palmer, a composer, and he’s talking about MusicLM, a music-making AI model published in January by Google Research. The “LM” in its name stands for “language model” because it harnesses some of the advances in AI language processing to create music. And boy, does it create music.
You might remember Dall-E 2, that AI tool that can turn your weird text prompts into surreal images totally unconstrained by reality.
This is like that—except it spits out music.
[CLIP: Soundscape of digital music from MusicLM samples]
Parshall: To understand why this is so cool, we’re going to have to step back in time a bit.
The desire to outsource some of the creative process to algorithms is old, older than AI, even older than computers.
Akito van Troyer: People were definitely already thinking about algorithmic music before computers existed, right?
Parshall: That’s Akito van Troyer, a professor at the Berklee College of Music who teaches a class on machine learning and music.
Van Troyer: For example, Mozart composed a piece of music he called “Dice Music.” What he did was to compose a melodic motif, a bunch of them. And he will choose which motifs to stitch together based on throwing a dice…
[CLIP: Sample from one possible playing of Wolfgang Amadeus Mozart’s Musikalisches Würfelspiel version #1]
…and that’s algorithmic music because you’re determining the result of the music based on the throwing of the dice kind of situation, right?
[CLIP: Sample from one possible playing of Wolfgang Amadeus Mozart’s Musikalisches Würfelspiel version #2]
And then computers just came to be the ideal platform to experiment on. As soon as computers were accessible to musicians they were already using it.
Parshall: Brad Garton was one of those computer musicians back in the early days, when computers were the size of rooms and whatnot. He’s a composer and former director of Columbia University’s Computer Music Center.
Garton: I learned how to program on punch cards. People don’t even remember what those are. We’d type in these lines painstakingly by hand and then submit them to the mainframe computer—an IBM 3081—on campus. I think it had two [kilobytes] of RAM, or something. [laughs] And we would then have to wait for two hours.
And then, uh, we would have to wait for two hours. We’d get a special tape; we’d take it over to the computer lab, where they had the special hardware to turn it into sound. Then you’d realize you’d mistyped one parameter, and your masterpiece sounded like this [sound of a click]. And that was it [laughs].
And it was really fun—I mean, because a lot of times, I wouldn’t quite know what I would get. But I’d listen to it and go, “Whoa, that’s pretty cool. It’s got a better sense of rhythm than I do!” [laughs]
Parshall: Eventually, making computer music became easier and quicker. AI musicians moved from punch cards to MIDI files, a digital file format that contains sequences of musical notes in text form.
MIDI stands for “musical instrument digital interface,” and it’s a pretty limited way to represent music. It strips out the richness and expressivity of a performance since it includes just the names of notes and when they’re played. It can turn a song that sounds like this …
[CLIP: A recording of a grand piano playing an expressive piece]
… into this.
[CLIP: A MIDI piano version of the same piece, sounding robotic and unexpressive]
So MIDI is restrictive, but it worked well enough with these early AI models. What you would do was: you’d give a model the text versions of a whole bunch of music, and it would pick up patterns to generate text versions of new music, which could be converted into sound using programs like, say, GarageBand.
Up until the past few years, this was still the most efficient way of making AI-generated music. One relatively recent model called MuseNet was made public in 2019 by OpenAI, the makers of the now famous chatbot ChatGPT.
With Musenet, you could make unlikely mashups—such as this example.
[CLIP: MuseNet improvises from Wolfgang Amadeus Mozart’s Rondo alla Turca in the style of Lady Gaga]
Parshall: You can kind of hear how robotic it sounds—that’s because it’s using MIDI. But soon after MuseNet was published, the OpenAI team got to work changing that. The researchers’ next project was an AI composer that could actually work with raw audio files, not just their text-based approximations.
Christine McLeavey: In theory, raw audio can sound like anything. It doesn’t even have to be only music. You could literally do any sound effects you can imagine.
[CLIP: Saw cranes at Målsjön in Kristdala, Sweden]
Parshall: That’s Christine McLeavey, a pianist—and one of the architects of OpenAI’s music models.
McLeavey: If you think of a good quality recording, it’s maybe, like, 44 kilohertz, which is, like, 44,000 samples per second. It’s a lot—a lot—of information, and you have to get each one of those numbers correct. There’s just so many ways you can get off by a little bit.
Parshall: In order to work with these massive files, engineers like McLeavey need to compress all that information down into a small space. Their approach isn’t that different from how you compress a file on your computer.
Basically, they break up the audio into small pieces they call “tokens,” like breaking a sentence into words. Each token contains important information about a portion of the audio waveform compressed in a smaller amount of space.
McLeavey: If you’re a teacher who’s going to give a lecture, usually you wouldn’t write down literally every single word you’re going to say….you have a sort of higher-level plan of what’s going to happen.
And in the same way, we’re kind of doing that to the music. We’re trying to generate this, like, higher-level representation first, and then, from that, then generate the, the sort of fleshed-out version, where we get the full sound that we can listen to.
Parshall: Honestly, this part of the technology is like magic to me. But it works.
The resulting program, called Jukebox, came together through a few months of trial and error.
This is the algorithm in August 2019, trying its best to make a pop song.
[CLIP: Jukebox sample entitled “Hints of Pop”]
Then, one month later, after being trained on a wider diversity of music, it could do an okay-ish Bob Marley impression.
[CLIP: Jukebox sample in the style of Bob Marley]
A few months after that, the researchers improved the quality of the audio and prompted the model with the first 12 seconds of “Despacito,” letting it run wild with the rest.
[CLIP: Jukebox sample: “Despacito” continuation]
Then, in a the pinnacle of their achievement, they primed it with 12 seconds of everyone’s favorite song and set it loose on the rest of the lyrics.
[CLIP: Jukebox sample: continuation of “Never Gonna Give You Up,” by Rick Astley]
McLeavey: “Working with raw audio, the sky is the limit in terms of what you can create. Literally, you could say, “What if I take this whale sound …
[CLIP: whale call]
… and blend it with, I don’t know, the Beatles?”
Parshall: Flash forward, now, this year: Google publishes MusicLM, that new music language model, where you can probably type, “Whale song blended with the Beatles” into a text box and get a really unique piece of music out of it.
I’d do just that and play it for you, but unfortunately, MusicLM is not available to the public yet. Still, the samples that I got from its publication were enough to set my head spinning.
Palmer: The MusicLM product is early days.
Parshall: That’s Shelly Palmer again.
Palmer: When it’s right, it’s kinda right. When it’s wrong, it’s really wrong.
Parshall: I asked Shelly to walk me through some of the highlights and lowlights of the samples.
Palmer: Main soundtrack of an arcade game….
[CLIP: MusicLM sample: arcade game]
Palmer: Fast-paced, upbeat, catchy guitar riff. That’s pretty much spot on for what they say it’s supposed to be.
Rising synth playing an arpeggio with a lot of reverb is backed by pads … Let’s listen.
[CLIP: MusicLM sample: reverb arpeggio]
Palmer: It’s not doing the first thing it was asked for. Rising synth playing an arpeggio, but there’s no arpeggio. It’s delightful. It’s got the reverb they’re talking about. The pads are backing it, so yeah ….
Let’s see what this is … [a] slow-tempo, bass and drums, reggae song.
[CLIP: MusicLM sample: reggae]
Palmer: So, that’s actually exactly backwards from a reggae rhythm.
This one is …
[CLIP: MusicLM sample: swing]
Palmer: So, they call that swing—except that that’s not swing. This is “swung,” but it’s not swing.
Parshall: [laughs]
Palmer: Again, it’s early days. I’m not… It’s so impressive. What it is is so impressive. There’s no way that I’m going to sit here and take a shot at these guys. Oh, my goodness. This is, like, amazing. And I know where it’s going, and I’m excited about it, I really am. I don’t see anything to be scared of.
The first time I was this excited musically… and it was the first time I heard a fully synthesized piece of music.
Parshall: I have to agree—it’s hard not to get excited listening to these. My personal favorites are the painting examples where they fed the AI a written description of famous paintings. Here we’ve got Salvador Dalí’s The Persistence of Memory, aka his melting clocks painting.
[CLIP: MusicLM sample: The Persistence of Memory Narrator: “His melting-clock imagery mocks the rigidity of chronometric time. The watches themselves look like soft cheese—indeed, by Dali’s own account, they were inspired by hallucinations after eating Camembert cheese. In the center of the picture, under one of the watches, is a distorted human face in profile. The ants on the plate represent decay.”]
Parshall: Honestly, the Dali painting feels appropriate, given how surreal listening to these samples seems. And it’s only possible because the Google team figured out how to compress raw audio even further than OpenAI’s Jukebox could.
Those little tokens, the ones that represent pieces of the music—typically, each would capture about 50 milliseconds of audio, about this much time.
[CLIP: Fifty-millisecond bleep]
Jesse Engel: What MusicLM does, on top of that, is uses an even more coarse representation ….
Parshall: That’s Jesse Engel, one of the engineers who designed MusicLM.
Engel: …at the level of, like, three seconds or so.
[CLIP: Three-second bleep]
And that’s one that we can control in all sorts of new kinds of ways. You can use it to look at a certain piece of music and say, okay, these are the very high-level features of this music on the level of seconds. Now, “Please generate some more music that has some of these types of features.”
Parshall: You may notice that the audio quality is, admittedly, still pretty rough.
Engel: They are, in essence, the same thing that happens when you take a guitar and you stick it up to an amplifier, and it causes feedback.
[CLIP: Feedback]
Engel: We’re taking the outputs of the model, and we’re sticking it back into the inputs of the model…. It’s just feedback that, instead of a single tone, is turning into music in these beautiful ways.
But when it fails, it does sort of actually kind of sound a bit like feedback and has all these different tones sort of take on unnatural sorts of timbres to them.
Parshall: But you can already hear how the quality has gotten better in just the past few years.
McLeavey: For Jukebox, the audio quality itself was okay but not amazing. And in some ways, that’s actually why I always felt okay with it as a musician, because I was like, “Okay, this is, like, good enough to be cool and interesting and maybe spark people’s creativity but not quite good enough to be like to feel threatening as a musician.”
Parshall: But that reality seems to be changing—and quick.
Palmer: It’s early, they just—this is the sample for goodness’s sake. Like, wait till this thing gets real—like, when this thing is production ready…. we’re so close. So close. And it’s gonna to piss off so many people.
Parshall: Next time on Science, Quickly …
Parshall: We’re talking about the elephant in the room: When machines make music—what happens to musicians?
Science, Quickly is produced by Jeff DelViscio, Tulika Bose and Kelso Harper. Our theme music was composed by Dominic Smith.
Don’t forget to subscribe to Science, Quickly wherever you get your podcasts. For more in-depth science news and features, go to ScientificAmerican.com.
For Scientific American’s Science, Quickly, I’m Allison Parshall.
Listen to Episode 1 of this 3-part Science, Quickly fascination here.