Vintage microphone on stage with warm lights and smoke.
Généré par l'IA / Generated using AI
π Science and technology
Science at the service of creativity

Can AI compose music now?

Geoffroy Peeters, Professor of Data Science at Télécom Paris (IP Paris)
On February 12th, 2025 |
6 min reading time
Geoffroy Peeters
Geoffroy Peeters
Professor of Data Science at Télécom Paris (IP Paris)
Key takeaways
  • Today, algorithms for classifying, indexing and analysing music data have enough data to operate autonomously.
  • With advances in deep learning, music can now be analysed as a set of distinct elements (vocals, drums, bass, etc.).
  • This ability to extract the elements that make up music has made it possible to recontextualize, modify or even clone them in other content.
  • It is now possible for certain models to generate their own music, although this remains a major technical challenge.
  • One of the challenges of these practices is to enable these models to generate genuinely new content, and not simply reproduce what they have already learned.

In 1957, a com­put­er wrote a musi­cal score for the first time. ILLIAC I – designed by Lejaren Hiller and Leonard Isaac­son at the Uni­ver­si­ty of Illi­nois – com­posed a string quar­tet1. From then on, the promise of a com­put­er pro­gram capa­ble of gen­er­at­ing music was root­ed in real­i­ty. After all, music is all about struc­tures, rules and math­e­mat­ics. Noth­ing unknown to a com­put­er pro­gram… except for one detail: creativity.

The fas­ci­nat­ing thing about this suite is that it was com­posed by a com­put­er, fol­low­ing a prob­a­bilis­tic mod­el sur­pris­ing­ly sim­i­lar to those used today2. Only, it was cre­at­ed accord­ing to rules estab­lished by a human com­pos­er, revised and then per­formed by an orches­tra. The result: a rigid appli­ca­tion of the rules, leav­ing lit­tle room for artis­tic innovation.

Today, tech­nol­o­gy has evolved rad­i­cal­ly: any­one can play at being a com­pos­er from their com­put­er. And thanks to deep learn­ing algo­rithms and the rise of gen­er­a­tive AI, musi­cal AI has tak­en an inter­est­ing turn. In order for a machine to real­ly pro­duce a musi­cal work from scratch, it had to under­stand it, not imi­tate it.  There­in lies the chal­lenge of a sci­en­tif­ic quest begun over twen­ty years ago: not to make machines com­pose, but to teach them how to lis­ten. Recog­nis­ing a style, clas­si­fy­ing a work, analysing a musi­cal structure…

Long before the explo­sion of AI-assist­ed music gen­er­a­tion, researchers were already try­ing to get machines to hear music. Among them was Geof­froy Peeters, a pro­fes­sor at Télé­com Paris and pre­vi­ous­ly Research Direc­tor at IRCAM. His work on the sub­ject could help us answer the ques­tion: can a machine tru­ly under­stand music, even before it claims to cre­ate it?

Understanding music

“In the ear­ly 2000s, the inter­na­tion­al stan­dard­i­s­a­tion of the .mp3 for­mat (MPEG‑1 Audio Lay­er III) led to the digi­ti­sa­tion of music libraries (today’s stream­ing plat­forms), giv­ing users access to a vast cat­a­logue of music, and hence the need to clas­si­fy and index each piece of music in it,” explains Geof­froy Peeters.

This gave rise to a new field of research: how to devel­op a music search engine? “These music analy­sis tech­nolo­gies are based on audio analy­sis and sig­nal pro­cess­ing and were ini­tial­ly “human-dri­ven”. Learn­ing was based on human-input rules,” he adds. Music is not sim­ply a series of ran­dom sounds, but a struc­ture organ­ised accord­ing using a rig­or­ous gram­mar – some­times as strong, or even stronger, than that of lan­guage. Since a style of music is deter­mined by a cer­tain type of chord, a cer­tain tem­po, a har­mon­ic struc­ture, and so on: “teach­ing these dif­fer­ent rules to a machine did­n’t seem all that complicated”.

“What defines Blues music, for exam­ple, is the rep­e­ti­tion of a 12-bar grid based on the sequence of three spe­cif­ic chords,” elab­o­rates the pro­fes­sor. These rules, which we know very well, will be encod­ed in a com­put­er, so that it can clas­si­fy the music accord­ing to genre.” That said, music is not only defined by its genre, but it can also con­vey a mood or be more suit­ed to a par­tic­u­lar con­text – be it for sports, or for med­i­ta­tion. In short, there are many ele­ments whose rules are more dif­fuse than those deter­min­ing genre.

“In an attempt to address this com­plex­i­ty, Pan­do­ra Music, the largest music stream­ing plat­form in the U.S., cre­at­ed the ‘Musi­cal Genome’ project, ask­ing human beings to anno­tate over 1 mil­lion tracks based on 200 dif­fer­ent cri­te­ria.” This colos­sal task has accu­mu­lat­ed enough data to enable the devel­op­ment of so-called data-dri­ven approach­es (in which knowl­edge is learned by the machine from the analy­sis of data). Among machine learn­ing tech­niques, deep learn­ing algo­rithms quick­ly emerged as the most pow­er­ful, and in the 2010s enabled daz­zling advances. “Rather than mak­ing human-dri­ven mod­els with com­plex math­e­mat­ics, like sig­nal pro­cess­ing, and man­u­al deci­sion rules, we can now learn every­thing com­plete­ly auto­mat­i­cal­ly from data,” adds Geof­froy Peeters.

Over time, these trained mod­els have enabled the devel­op­ment of clas­si­fi­ca­tion and rec­om­men­da­tion algo­rithms for online music plat­forms such as Deez­er and Spotify. 

Learning to listen

Deep learn­ing will also bring about a par­a­digm shift.  Where­as music used to be con­sid­ered as a whole, it can now be analysed as a com­pos­ite of ele­ments. “Until 2010, we were unable to sep­a­rate vocals, drums and bass from a mix in a clean – usable – way,” he points out. But if the voice could be extract­ed, the sung melody could be pre­cise­ly recog­nised, char­ac­terised and more fine­ly analysed. “Deep learn­ing will make this pos­si­ble by train­ing sys­tems that take a ‘mixed song’ as input with all the sources mixed (vocals, drums, bass, etc.), and then out­put the var­i­ous sources demixed, or sep­a­rat­ed.” To train such a sys­tem, how­ev­er, you need data – lots of it. In the ear­ly days, some train­ing could be car­ried out with access, often lim­it­ed, to demixed record­ings from record com­pa­nies. Until Spo­ti­fy, with its huge cat­a­logue of data, came up with a con­vinc­ing source sep­a­ra­tion algo­rithm. This was fol­lowed by a host of new mod­els, each more impres­sive than the last, includ­ing the French Spleeter mod­el from Deez­er, which is open source3, and Demucs from Meta-AI in Paris.

This indi­vid­ual analy­sis of each ele­ment that makes up a piece of music has turned AI train­ing on its head. “All this has opened the door to many things, includ­ing gen­er­a­tive AI devel­oped today in music. For exam­ple, with the abil­i­ty to sep­a­rate the voice and analyse it in detail, it becomes entire­ly pos­si­ble to re-con­tex­tu­al­ize it (rein­sert­ing Edith Piaf’s voice in the film “La Môme”, or John Lennon’s in the Bea­t­les’ “Now and Then”), to mod­i­fy it (pitch cor­rec­tion is wide­ly used) to recre­ate it (the voice of Gen­er­al de Gaulle pro­nounc­ing the call of June 18th), but also to clone it. Recent events show just how far the lat­ter use can go, with con­cerns in the world of film dub­bing, the fear of “deep­fakes”, but also a pre­vi­ous­ly unre­leased sound­track with Drake and The Week­nd, which was nev­er­the­less not sung by them.” 

Becoming a composer

Ear­ly research in musi­cal AI had well-defined objec­tives: to clas­si­fy, analyse and seg­ment music, and, why not, to assist the com­pos­er in his or her cre­ation. But with the emer­gence of gen­er­a­tive mod­els, this work became the basis for a whole new approach: the gen­er­a­tion of a piece of music (and there­fore its audio sig­nal) from noth­ing, or just a tex­tu­al “prompt”. “The first play­er to posi­tion itself in music gen­er­a­tion from scratch is Ope­nAI’s Juke­box,” notes Geof­froy Peeters. “In a way, they’ve recy­cled what they were doing for Chat­G­PT: using a Large-Lan­guage-Mod­el (LLM), a so-called autore­gres­sive mod­el, trained to pre­dict the next word, based on pre­vi­ous ones.”

Trans­pos­ing this prin­ci­ple to music is a major tech­ni­cal chal­lenge. Unlike text, audio is not made up of dis­tinct words that the AI can treat as tokens. “We had to trans­late the audio sig­nal into a form that the mod­el could under­stand,” he explains. “This is pos­si­ble with quan­tised auto-encoders, which learn to project the sig­nal into a quan­tized space, the space of “tokens”, and to recon­struct the audio sig­nal from these “tokens”. All that remains is to mod­el the tem­po­ral sequence of tokens in a piece of music, which is done using an LLM. The LLM is then used again to gen­er­ate a new sequence of “tokens” (the most like­ly sequence), which are then con­vert­ed into audio by the quan­tized auto-encoder’s decoder”. 

Mod­els with even more impres­sive results fol­lowed, such as Sta­ble Audio from Sta­bil­i­ty AI. This type of mod­el uses the prin­ci­ple of dif­fu­sion (pop­u­larised for the gen­er­a­tion of very high-qual­i­ty images, as in Mid­jour­ney or Sta­ble Dif­fu­sion), but the idea remains the same: to trans­form the audio sig­nal into quan­tised data read­able by their dif­fu­sion model.

To pro­vide a min­i­mum of con­trol over the result­ing music gen­er­a­tion requires “con­di­tion­ing” of the gen­er­a­tive mod­els on text; this text is either a descrip­tion of the audio sig­nal (its genre, mood, instru­men­ta­tion), or its lyrics. To achieve this, the train­ing of the mod­els will also take into account a text cor­re­spond­ing to a giv­en music input. This is why the Suno mod­el can be “prompt­ed” with text. How­ev­er, this is where the lim­its of their cre­ative capac­i­ty and ques­tions of intel­lec­tu­al prop­er­ty come into play. “These mod­els suf­fer a lot from mem­o­ri­sa­tion,” warns Geof­froy Peeters. “For exam­ple, by ask­ing Suno in a prompt to make music accom­pa­nied by the lyrics of ‘Bohemi­an Rhap­sody’ it end­ed up gen­er­at­ing music very close to the orig­i­nal. This does pose copy­right prob­lems for the new music just cre­at­ed, because the rights to it belong to the human behind the prompt, and the music used for train­ing the mod­el, the rights to which they did­n’t have.” [N.D.L.R.: Today, Suno refus­es this type of gen­er­a­tion, as it no longer com­plies with its terms of use].

“So, there’s a real need to turn these tools into mod­els that gen­er­ate new con­tent, not just repro­duce what they’ve learned,” con­cludes the pro­fes­sor. “Today’s mod­els gen­er­ate music, but do they cre­ate new music? Unlike audio syn­the­sis­ers (which made it pos­si­ble to cre­ate new sounds), music is an organ­i­sa­tion of sounds (notes or oth­er­wise) based on rules. Mod­els are undoubt­ed­ly capa­ble of under­stand­ing these rules, but are they capa­ble of invent­ing new ones? Are they still at the stage of “sto­chas­tic par­rots”, as is often said?”

Pablo Andres
1Suite d’Illiac 1 — Hiller, L., & Isaac­son, L. (1959). Exper­i­men­tal Music: Com­po­si­tion with an Elec­tron­ic Com­put­er. McGraw-Hill.
2Chronolo­gie de l’usage de l’IA en com­po­si­tion musi­cale — IRCAM (2023). Une brève chronolo­gie sub­jec­tive de l’usage de l’intelligence arti­fi­cielle en com­po­si­tion musi­cale. – Agon, C. (1998). Analyse de l’utilisation de l’IA en musique.
3Rap­port de l’OMPI sur l’IA et la pro­priété intel­lectuelle musi­cale. Organ­i­sa­tion Mon­di­ale de la Pro­priété Intel­lectuelle (OMPI) (2021). Arti­fi­cial Intel­li­gence and Intel­lec­tu­al Prop­er­ty: A Lit­er­a­ture Review.

Our world explained with science. Every week, in your inbox.

Get the newsletter