Vintage microphone on stage with warm lights and smoke.
Généré par l'IA / Generated using AI
π Science and technology
Science at the service of creativity

Can AI compose music now?

with Geoffroy Peeters, Professor of Data Science at Télécom Paris (IP Paris)
On February 12th, 2025 |
6 min reading time
Geoffroy Peeters
Geoffroy Peeters
Professor of Data Science at Télécom Paris (IP Paris)
Key takeaways
  • Today, algorithms for classifying, indexing and analysing music data have enough data to operate autonomously.
  • With advances in deep learning, music can now be analysed as a set of distinct elements (vocals, drums, bass, etc.).
  • This ability to extract the elements that make up music has made it possible to recontextualize, modify or even clone them in other content.
  • It is now possible for certain models to generate their own music, although this remains a major technical challenge.
  • One of the challenges of these practices is to enable these models to generate genuinely new content, and not simply reproduce what they have already learned.

In 1957, a com­puter wrote a music­al score for the first time. ILLIAC I – designed by Lejar­en Hiller and Leonard Isaac­son at the Uni­ver­sity of Illinois – com­posed a string quar­tet1. From then on, the prom­ise of a com­puter pro­gram cap­able of gen­er­at­ing music was rooted in real­ity. After all, music is all about struc­tures, rules and math­em­at­ics. Noth­ing unknown to a com­puter pro­gram… except for one detail: creativity.

The fas­cin­at­ing thing about this suite is that it was com­posed by a com­puter, fol­low­ing a prob­ab­il­ist­ic mod­el sur­pris­ingly sim­il­ar to those used today2. Only, it was cre­ated accord­ing to rules estab­lished by a human com­poser, revised and then per­formed by an orches­tra. The res­ult: a rigid applic­a­tion of the rules, leav­ing little room for artist­ic innovation.

Today, tech­no­logy has evolved rad­ic­ally: any­one can play at being a com­poser from their com­puter. And thanks to deep learn­ing algorithms and the rise of gen­er­at­ive AI, music­al AI has taken an inter­est­ing turn. In order for a machine to really pro­duce a music­al work from scratch, it had to under­stand it, not imit­ate it.  Therein lies the chal­lenge of a sci­entif­ic quest begun over twenty years ago: not to make machines com­pose, but to teach them how to listen. Recog­nising a style, clas­si­fy­ing a work, ana­lys­ing a music­al structure…

Long before the explo­sion of AI-assisted music gen­er­a­tion, research­ers were already try­ing to get machines to hear music. Among them was Geof­froy Peeters, a pro­fess­or at Télé­com Par­is and pre­vi­ously Research Dir­ect­or at IRCAM. His work on the sub­ject could help us answer the ques­tion: can a machine truly under­stand music, even before it claims to cre­ate it?

Understanding music

“In the early 2000s, the inter­na­tion­al stand­ard­isa­tion of the .mp3 format (MPEG‑1 Audio Lay­er III) led to the digit­isa­tion of music lib­rar­ies (today’s stream­ing plat­forms), giv­ing users access to a vast cata­logue of music, and hence the need to clas­si­fy and index each piece of music in it,” explains Geof­froy Peeters.

This gave rise to a new field of research: how to devel­op a music search engine? “These music ana­lys­is tech­no­lo­gies are based on audio ana­lys­is and sig­nal pro­cessing and were ini­tially “human-driv­en”. Learn­ing was based on human-input rules,” he adds. Music is not simply a series of ran­dom sounds, but a struc­ture organ­ised accord­ing using a rig­or­ous gram­mar – some­times as strong, or even stronger, than that of lan­guage. Since a style of music is determ­ined by a cer­tain type of chord, a cer­tain tempo, a har­mon­ic struc­ture, and so on: “teach­ing these dif­fer­ent rules to a machine did­n’t seem all that complicated”.

“What defines Blues music, for example, is the repe­ti­tion of a 12-bar grid based on the sequence of three spe­cif­ic chords,” elab­or­ates the pro­fess­or. These rules, which we know very well, will be encoded in a com­puter, so that it can clas­si­fy the music accord­ing to genre.” That said, music is not only defined by its genre, but it can also con­vey a mood or be more suited to a par­tic­u­lar con­text – be it for sports, or for med­it­a­tion. In short, there are many ele­ments whose rules are more dif­fuse than those determ­in­ing genre.

“In an attempt to address this com­plex­ity, Pan­dora Music, the largest music stream­ing plat­form in the U.S., cre­ated the ‘Music­al Gen­ome’ pro­ject, ask­ing human beings to annot­ate over 1 mil­lion tracks based on 200 dif­fer­ent cri­ter­ia.” This colossal task has accu­mu­lated enough data to enable the devel­op­ment of so-called data-driv­en approaches (in which know­ledge is learned by the machine from the ana­lys­is of data). Among machine learn­ing tech­niques, deep learn­ing algorithms quickly emerged as the most power­ful, and in the 2010s enabled dazzling advances. “Rather than mak­ing human-driv­en mod­els with com­plex math­em­at­ics, like sig­nal pro­cessing, and manu­al decision rules, we can now learn everything com­pletely auto­mat­ic­ally from data,” adds Geof­froy Peeters.

Over time, these trained mod­els have enabled the devel­op­ment of clas­si­fic­a­tion and recom­mend­a­tion algorithms for online music plat­forms such as Deez­er and Spotify. 

Learning to listen

Deep learn­ing will also bring about a paradigm shift.  Where­as music used to be con­sidered as a whole, it can now be ana­lysed as a com­pos­ite of ele­ments. “Until 2010, we were unable to sep­ar­ate vocals, drums and bass from a mix in a clean – usable – way,” he points out. But if the voice could be extrac­ted, the sung melody could be pre­cisely recog­nised, char­ac­ter­ised and more finely ana­lysed. “Deep learn­ing will make this pos­sible by train­ing sys­tems that take a ‘mixed song’ as input with all the sources mixed (vocals, drums, bass, etc.), and then out­put the vari­ous sources demixed, or sep­ar­ated.” To train such a sys­tem, how­ever, you need data – lots of it. In the early days, some train­ing could be car­ried out with access, often lim­ited, to demixed record­ings from record com­pan­ies. Until Spo­ti­fy, with its huge cata­logue of data, came up with a con­vin­cing source sep­ar­a­tion algorithm. This was fol­lowed by a host of new mod­els, each more impress­ive than the last, includ­ing the French Spleeter mod­el from Deez­er, which is open source3, and Demucs from Meta-AI in Paris.

This indi­vidu­al ana­lys­is of each ele­ment that makes up a piece of music has turned AI train­ing on its head. “All this has opened the door to many things, includ­ing gen­er­at­ive AI developed today in music. For example, with the abil­ity to sep­ar­ate the voice and ana­lyse it in detail, it becomes entirely pos­sible to re-con­tex­tu­al­ize it (rein­sert­ing Edith Piaf’s voice in the film “La Môme”, or John Len­non’s in the Beatles’ “Now and Then”), to modi­fy it (pitch cor­rec­tion is widely used) to recre­ate it (the voice of Gen­er­al de Gaulle pro­noun­cing the call of June 18th), but also to clone it. Recent events show just how far the lat­ter use can go, with con­cerns in the world of film dub­bing, the fear of “deep­fakes”, but also a pre­vi­ously unre­leased soundtrack with Drake and The Weeknd, which was nev­er­the­less not sung by them.” 

Becoming a composer

Early research in music­al AI had well-defined object­ives: to clas­si­fy, ana­lyse and seg­ment music, and, why not, to assist the com­poser in his or her cre­ation. But with the emer­gence of gen­er­at­ive mod­els, this work became the basis for a whole new approach: the gen­er­a­tion of a piece of music (and there­fore its audio sig­nal) from noth­ing, or just a tex­tu­al “prompt”. “The first play­er to pos­i­tion itself in music gen­er­a­tion from scratch is OpenAI’s Juke­box,” notes Geof­froy Peeters. “In a way, they’ve recycled what they were doing for Chat­G­PT: using a Large-Lan­guage-Mod­el (LLM), a so-called autore­gress­ive mod­el, trained to pre­dict the next word, based on pre­vi­ous ones.”

Trans­pos­ing this prin­ciple to music is a major tech­nic­al chal­lenge. Unlike text, audio is not made up of dis­tinct words that the AI can treat as tokens. “We had to trans­late the audio sig­nal into a form that the mod­el could under­stand,” he explains. “This is pos­sible with quant­ised auto-encoders, which learn to pro­ject the sig­nal into a quant­ized space, the space of “tokens”, and to recon­struct the audio sig­nal from these “tokens”. All that remains is to mod­el the tem­por­al sequence of tokens in a piece of music, which is done using an LLM. The LLM is then used again to gen­er­ate a new sequence of “tokens” (the most likely sequence), which are then con­ver­ted into audio by the quant­ized auto-encoder­’s decoder”. 

Mod­els with even more impress­ive res­ults fol­lowed, such as Stable Audio from Sta­bil­ity AI. This type of mod­el uses the prin­ciple of dif­fu­sion (pop­ular­ised for the gen­er­a­tion of very high-qual­ity images, as in Mid­jour­ney or Stable Dif­fu­sion), but the idea remains the same: to trans­form the audio sig­nal into quant­ised data read­able by their dif­fu­sion model.

To provide a min­im­um of con­trol over the res­ult­ing music gen­er­a­tion requires “con­di­tion­ing” of the gen­er­at­ive mod­els on text; this text is either a descrip­tion of the audio sig­nal (its genre, mood, instru­ment­a­tion), or its lyr­ics. To achieve this, the train­ing of the mod­els will also take into account a text cor­res­pond­ing to a giv­en music input. This is why the Suno mod­el can be “promp­ted” with text. How­ever, this is where the lim­its of their cre­at­ive capa­city and ques­tions of intel­lec­tu­al prop­erty come into play. “These mod­els suf­fer a lot from mem­or­isa­tion,” warns Geof­froy Peeters. “For example, by ask­ing Suno in a prompt to make music accom­pan­ied by the lyr­ics of ‘Bohemi­an Rhaps­ody’ it ended up gen­er­at­ing music very close to the ori­gin­al. This does pose copy­right prob­lems for the new music just cre­ated, because the rights to it belong to the human behind the prompt, and the music used for train­ing the mod­el, the rights to which they did­n’t have.” [N.D.L.R.: Today, Suno refuses this type of gen­er­a­tion, as it no longer com­plies with its terms of use].

“So, there’s a real need to turn these tools into mod­els that gen­er­ate new con­tent, not just repro­duce what they’ve learned,” con­cludes the pro­fess­or. “Today’s mod­els gen­er­ate music, but do they cre­ate new music? Unlike audio syn­thes­isers (which made it pos­sible to cre­ate new sounds), music is an organ­isa­tion of sounds (notes or oth­er­wise) based on rules. Mod­els are undoubtedly cap­able of under­stand­ing these rules, but are they cap­able of invent­ing new ones? Are they still at the stage of “stochast­ic par­rots”, as is often said?”

Pablo Andres
1Suite d’Illiac 1 — Hiller, L., & Isaac­son, L. (1959). Exper­i­ment­al Music: Com­pos­i­tion with an Elec­tron­ic Com­puter. McGraw-Hill.
2Chro­no­lo­gie de l’usage de l’IA en com­pos­i­tion musicale — IRCAM (2023). Une brève chro­no­lo­gie sub­ject­ive de l’usage de l’intelligence arti­fi­ci­elle en com­pos­i­tion musicale. – Agon, C. (1998). Ana­lyse de l’utilisation de l’IA en musique.
3Rap­port de l’OMPI sur l’IA et la pro­priété intel­lec­tuelle musicale. Organ­isa­tion Mon­diale de la Pro­priété Intel­lec­tuelle (OMPI) (2021). Arti­fi­cial Intel­li­gence and Intel­lec­tu­al Prop­erty: A Lit­er­at­ure Review.

Support accurate information rooted in the scientific method.

Donate