For more than 20 years, researchers have been using artificial intelligence (AI) on sound signals. These sound signals can be speech, music, or environmental sounds. Recent advances in algorithms are opening the door to new fields of research and new applications.
How can artificial intelligence be used to process sound signals?
Firstly, AI can be used for sound analysis. In other words, based on a recording, the machine can recognise the sounds (which instrument is playing, which machine or object is generating which noise, etc.) and the recording conditions (live, studio, outside, etc.). For example, Shazam is a fairly simple but very well-known music recognition AI.
AI can also be used to transform sound. For example, this involves separating the different sources of a sound recording so that they can be remixed differently (as with karaoke applications). It is also possible to consider transferring the musical style of a given sound recording or changing the acoustic conditions of the recording (for example by removing the reverberation while keeping the content intact). Finally, the third major area of sound processing using generative AI is synthesis. Given a musical extract or certain instructions, the machine can generate music in the style of the extract. It can also be asked to generate music in relation to a text or image.
I’m currently working on a major research project funded by the European Research Council (ERC) called HI-Audio, or “Hybrid and Interpretable Deep neural audio machines” The term “hybrid” implies that instead of learning solely from large quantities of data, we are incorporating a priori information deduced from our knowledge into our learning models. We already have certain knowledge about sound: the type of musical instruments present, the level of reverberation in a room, etc. The idea is to use this knowledge as a basis for relatively simple models that describe these phenomena. Then we insert them into neural networks and more complex models that allow us to learn and describe what we don’t know. The result is models that combine interpretability and controllability.
What are the specific features of AI algorithms applied to sound?
A sound signal is a temporal signal (a sequence of data ordered in time) that can be more or less periodic. First of all, each sound signal has its own specific characteristics. Recognising the instruments and notes in a musical recording requires advanced source separation techniques, making it possible to distinguish and isolate each sound element. Unlike speech, where a single instrument (the voice) conveys a linguistic message, musical analysis must manage the simultaneity and harmony of the instruments.
Another specificity of music is the length of the recordings. In principle, this type of AI is trained in much the same way as for images or text. But unlike an image, a sound signal is a series of numbers, positive or negative, that vary over time around a reference value. For one second of music, with a CD-quality recording, there are 44,100 values per second. Similarly, if we have had one minute of recording, we have 2,646,000 values (44,100 x 60 seconds). Data volumes are very high for a short period of time. It is therefore necessary to have specific AI methods applied to sound, but also very powerful analysis resources to be able to process this volume of data.
Which application sectors could benefit from these developments in sound processing?
Sound signal processing, or more generally AI applied to sound, is already used in a variety of fields. First of all, there are industrial applications. Speech is very sensitive to reverberation, which can quickly affect intelligibility. It is necessary to “clean” the sound signal of environmental noise, particularly for telephone communications. Another area not to be overlooked is the usefulness of synthesised sound environments in the audiovisual industry. Recreating ambient sound allows you to suggest what is off-screen. Let’s imagine a film scene on a café terrace. We probably won’t know where the café is located: in the town centre, in a residential area, near a park, etc. Depending on the direction, sound can help immerse the viewer in a richer atmosphere. The same applies to video games and virtual reality. Sound is one of the five senses, so we are very sensitive to it. Adding sound enhancement increases realism and immersion in a virtual environment.
With the development of AI applied to sound, new fields of application can be envisaged. I’m thinking particularly of predictive maintenance, meaning that we could use sound to detect when an object is starting to malfunction. Understanding the sound environment could also be useful in the development of self-driving cars. In addition to the information captured by the cameras, it will be able to steer itself according to the surrounding noise: bicycle bells, pedestrians’ reactions.
Let’s not forget that processing sound signals can become a tool for helping people. In the future, we can imagine an AI translating the sound environment into another modality, enabling deaf people to “hear” the world around them. On the other hand, perhaps sound analysis will help to protect people at home by detecting and characterising normal, abnormal, and alarming noises in the home. And that’s just a non-exhaustive list of possible applications!
What are the main challenges and issues linked to the development and use of AI in general and more specifically in the field of sound?
One of the main dilemmas is the ecological impact of such systems. The performance of generative AI in general is correlated with the amount of data ingested and computing power. Although we have so-called “frugal” approaches, the environmental and economic repercussions of these tools are non-negligible. This is where my research project comes in, as it explores an alternative, more frugal approach to hybrid AI.
Another concern for sound processing is access to music databases because of copyright issues. Overall, regulations can be an obstacle to the development of AI in France. In the United States, the notion of fair use allows a degree of flexibility in the use of copyrighted works. In Europe, we are juggling between several methods. All the same, there are a few public databases that contain royalty-free compositions written specifically for research purposes. Sometimes we work with companies like Deezer, which offer restricted access to their catalogues for specific projects.
AI applied to sound also poses certain specific ethical problems. In particular, there is the question of the music generated by the machine and the potential for plagiarism, since the machine may have been trained using well-known and protected music. Who owns the copyright to the music generated by the machine? What is the price of this automatically generated music? How transparent is the music creation process? Finally, there is the question of the controllability of AI or, more precisely, its explicability. We need to be able to explain the decisions taken by the machine. Let’s go back to our example of the autonomous car: we need to be able to determine why it chooses to turn at a given moment. “It was the most likely action,” is not a sufficient answer, particularly in the event of an accident. In my opinion, it is vital to integrate human knowledge into these AI systems and to ensure transparency in its use.
More generally, we need to build a legal framework for these constantly evolving technologies. But France and Europe sometimes tend to overregulate, hampering innovation and our international competitiveness. We need to identify and protect ourselves against the risks of abuse and the ethical risks of AI, which are real, but we also need to avoid overregulation.
Do you think AI will have an impact on musicians and the sound industry?
AI will have an impact everywhere. In all professions, all companies and all environments, including jobs in the music sector. Yes, it can raise concerns and questions, like musicians and film sound engineers who fear they will be replaced. Some jobs may disappear, but others will be created.
In my view, AI is more a tool than a threat. They will open up a new range of possibilities. By making it possible to play together remotely, AI will be able to bring together communities of musicians across the planet. They can also help to democratise music learning, by creating fun, personalised remote “training courses”. It is also a fairly sophisticated composition tool that can stimulate artists’ creativity.
AI in itself is not creative. It reproduces and reshapes but creates nothing. Similarly, in my opinion, AI does not make art. It’s almost conceptually impossible for a machine to make art. Art, even if it’s not clearly defined, is personified; it’s a form of human communication. Today, AI, particularly AI applied to sound processing, is not capable of that.