- Posted by Daitan Innovation Team
- On November 18, 2020
- AI, Artificial Intelligence, Audio, Audio De-noiser
Sharing our experiences building an audio denoiser using GANs
An article by Jacob Boness, Jamie Thomassen, and Colton Davenport
One of the main goals of the Innovation team at Daitan is to keep our eyes open to emerging technology that can positively impact our clients. Undoubtedly, one of such technologies is voice-based application. In the last few years, large and mid-sized companies have relied more and more on this kind of application to address problems ranging from recognition, identification, and enhancement tools. In this piece, we expand our previous work on noise removal of audio signals. We believe that exploring different strategies to solve a still open problem, can increase our understanding and offer our clients better informed recommendations. Enjoy the reading.
For over 30 years, the Camosun College Information and Computer Systems Technology (ICS) Program has successfully matched up industry, government, and not-for-profit organizations with senior students to complete a final software project, the Capstone Project. This is an opportunity for students to work on a real-world project sponsored by an entity or individual in the community.
In 2020, Daitan Labs of Canada submitted a project proposal focused on audio denoising and machine learning. This article presents the results of this Capstone Project, outlining the educational opportunities, challenges, and accomplishments experienced by three students from Camosun College’s ICS Program.
In 2019, the Innovation team at Daitan created an audio denoiser by training a Convolutional Neural Network (CNN). Based on his research, Thalles Santos Silva, the author of that study, proposed using a network commonly used for image-to-image translation in the denoising process as it would fit the signal-to-signal translation inherent to denoising. This network is known as a Generative Adversarial Network (GAN).
In the summer of 2020, Daitan selected us, a group of three Information and Computer Systems students, to test this hypothesis. We spent the summer learning about TensorFlow and neural networks, building a GAN based on what we discovered, and compared it to Daitan’s CNN for our end-of-degree Capstone project.
Over the course of the project, we learned many lessons including the roadblocks we faced, the results of our comparison study, a reflection on our process, and the implications for future study. Hopefully, the information provided will inform further research and development using GANs.
There are various definitions of audio denoising. For the purposes of this project we interpret audio denoising to be the removal of any sound other than the primary speaker’s voice. Thalles Santos Silva covers the mathematical concepts behind denoising and the CNN in his 2019 article. He also provides background information about the two datasets involved (Mozilla Common Voice English dataset and the UrbanSound8k dataset). We used Daitan’s dataset creator to mix clean-speech audio from the Mozilla dataset with noise audio from the Urban8k dataset. The dataset creator then translated the audio into spectrograms that our networks could learn to manipulate.
Simply put, the GAN (as seen in Figure 1) pits two neural networks against each other. The first network, the generator, takes input and attempts to create realistic output. The second network, the discriminator, learns to distinguish between the generator’s output and real input, and produces a single probability called a prediction. We based our Generative Adversarial Network on a conditional GAN called Pix2Pix. A regular GAN produces random output that can pass as a real piece of the dataset. Many regular GANs, such as the one described here, can produce a random, realistic, and fake face. This lack of concern for input would not work well for denoising a specific piece of audio. A conditional GAN (as seen in Figure 2) uses its input more intelligently as a guide for what it should produce. Both GANs have similar discriminators, but the conditional GAN differs in how it trains the generator. Pictured below is a high level diagram of a conditional GAN found in Manish Nayak’s very informative article “An Introduction to Conditional GANs (CGANs)”.
As previously mentioned, a Generative Adversarial Network pits two networks against each other. A discriminator attempts to train to a prediction of 100% when given non-generated real input, and 0% when given the generator’s fake output. We defined a loss function that evaluates the accuracy of the discriminator after each step of each epoch. The discriminator is then modified based on this evaluation.
A regular GAN only trains its generator to fool the discriminator, while a conditional GAN trains the generator to create output that fools the discriminator and resembles a target. To this end, the generator has two loss functions. One loss function judges the difference between the generator’s output and the target output, and the second checks how well the denoised spectrogram fooled the discriminator. Daitan’s original CNN denoiser only had the first of these two loss functions. As the GAN trains it is learning a loss function. This ever-improving loss function, theoretically, could extend how long the denoising generator can be trained.
Although beyond the scope of our project, theoretically, if the discriminator continues to raise its standards, the generator may be able to train effectively much longer than Daitan’s single CNN could on its own.
As the technology and theory of neural networks were almost completely unknown to us, the Innovation team at Daitan suggested we investigate TensorFlow, the major python library used for deep learning. We worked our way through an online MIT introductory course on deep learning methods that featured free lectures and labs. Though this course dives more into hard theory than we needed, the initial lectures and labs provided a solid foundation to base our decisions on. While researching audio denoising we discovered SEGAN, a speech enhancement generative adversarial network, and thought it might be applicable to our project. However, it was not directly translatable as audio was represented as waveforms rather than spectrograms.
Our Development and Roadblocks
We faced a few obstacles early on. The first involved our initial GAN which used the architecture described in the face generator mentioned earlier. This was not the right choice and we reached a point where we were uncertain how to proceed. Daitan’s data scientists directed us towards the conditional GAN Pix2Pix, which they conceived as the next step for Daitan’s denoiser model.
Pix2Pix is a conditional GAN designed to translate an image of simple blocks into a realistic image based on the content it was trained on. This translation method seemed well suited for changing a noisy audio source to a clean audio output. For this GAN, we used the Innovation team’s CNN as the generator. This saved time and allowed for a fair comparison between a standalone network and one housed in competition. The discriminator was a modification of Pix2Pix’s discriminator, tweaked for the shape of our input.
Naturally, we ran into roadblocks during development and training. As our project manager Jacob said: “Google is the author of 90% of our misfortune.” We tried to upload the MCVD to Google Drive for processing, but the Google API has a strict call limit of 100 calls in 100 seconds. To comply, we wrote a custom FTP client to upload at the exact rate allowed by Google. Though we no longer needed to manually upload files a hundred at a time, the upload limit resulted in the dataset taking 6 days to transfer.
We then tried to use the uploaded files with Google Colab but had problems accessing a folder with over a million files in it. This meant we had to process clips locally, connecting Colab to a Jupyter server. As a stopgap measure, we used the MCVD Welsh dataset with a size of 4GB, much smaller than the Mozilla Common Voice English dataset’s 50GB. The Welsh dataset was converted to TFRecords locally for uploading to Google Drive, allowing development to continue while trying to process the English dataset.
Once both networks were able to train with the Welsh records, we began training the networks with the English dataset. Unfortunately, Colab could not handle working with a few large files any better than it had with many small files. We discovered that a paid license for Colab exists, which would allow us to circumvent this issue but is only available in the USA.
With Colab unable to process the data, we made the decision to abandon Google and begin training the networks locally. We ran Python files, dropping notebooks entirely. We knew that Colab used GPUs which sped up the process of training the networks, but we did not realize how much of an increase to performance it provided. Without a GPU, the CNN took about 5 minutes to finish one epoch of 200 steps with a batch size of 768. Once we were able to install the required technology for accessing the GPU, that same epoch processed in approximately 15 seconds.
With everything setup, we could train our networks in a limited fashion. Daitan’s Innovation Team set their CNN to train for an arbitrarily large number of epochs (9,999). Instead of an explicit epoch limit, the CNN network has an early-stopping callback that evaluates the network at the end of each epoch. If the network stopped improving it would halt training and save weights when they are most effective. Due to the limitations of our technology, we could not run each resource-intensive network for an indeterminate amount of time. Thus, our research does not represent the full extent of the networks, rather, it measures how networks compare after equal training periods of 200 epochs.
With the two networks trained, we began our comparison study. We approached the comparison in two ways: from both an objective and a subjective point of view.
The subjective view was easy to determine. We created a survey containing an unbiased sample of equivalent audio clips and sent it to employees at Daitan and students in our program at Camosun College. Participants were presented with pairs of audio clips, one denoised with the CNN and the other with the GAN. The survey asked questions about which of the pair was clearer, which had less background noise, and which sounded less muffled.
The objective view was more difficult, as there is no objective way to determine audio quality. However, we applied several techniques that gave insight into how much introduced noise affected clean audio. Using the free software Sonic Visualiser, we produced detailed waveform models and spectrograms.
Our survey consisted of eleven sections. The first ten sections asked identical questions about pairs of audio clips. Each pair included a noisy audio output denoised by the two denoisers our networks created. The last section asked for personal information regarding the participants, including what their primary language is and how they listened to the audio clips. To see the questionnaire, click here.
For each pair of denoised clips we asked participants to choose:
- which one was clearer,
- which had less background noise, and
- which was more muffled.
In addition, we asked what the participants thought the speaker was saying and how they would rate each clip for overall intelligibility.
The survey participants also rated each pair on a scale from 1 to 10. The ratings seemed to depend on the quality of the noisy file prior to denoising, rather than the quality of the network itself. A clip featuring a clearly enunciating speaker with soft street sounds as background noise [noisy_13] received the highest average ratings of 7.92 and 7.43. The lowest rated clip, 2.73 and 2.70, had a mumbling speaker and harsh drilling as the background noise [noisy_29]. The widest gap between the pairs was a score of 0.76, with the GAN scoring 5.56 against the CNN’s 4.80 [noisy_33].
Based on the survey responses, there was no clear winner between the two networks in terms of clarity. Participants found the networks performed similarly regarding how muffled their output was. However, they found the CNN produced decidedly less background noise than the GAN. Aside from one pair of denoised clips that had evenly split votes, the CNN outperformed the GAN by a margin of at least 15%.
In the following section, we will be describing two ways of analyzing audio data — Waveforms and Spectrograms. A waveform is a way of looking at the sum of the wave amplitudes at a given time. The farther the amplitudes are from zero, the louder the sound is at that moment of time. A spectrogram can be considered a “heatmap” of frequency distribution and when they occur. It is a three-dimensional graph of time on the x-axis, frequency on the y-axis, and power on the z-axis. If a frequency of sound is louder (higher power) at a point in time, it will appear “hotter” on the spectrogram.
This image shows a closeup of a clean audio waveform that is 1/10th of a second long. Since the original recording is low quality, we can see some noise. A basic waveform is depicted; we are looking at overall shape in this waveform, not specifics.
This image shows a different point in time point, another 1/10th of a second in length (the CNN is pictured on top, the GAN below). You can see there are higher peaks and more erratic frequencies on the GAN. Though there is some noise in the CNN, the general shape of the waveform is easier to see. That erratic noise in the GAN seems to be generated background static that permeates all denoising attempts, a common occurrence of GAN generated audio denoising. To solve this, we could train the GAN to also detect that background noise and remove it. While it may not completely remove it, we hypothesize that it would make a difference.
Here is a comparison of all 3 audio clips. The top blue waveform received denoising from the CNN, the middle red by the GAN, and the bottom clean is the clean original audio. We can see that generally the GAN removes less of the noise introduced to the voice due to the higher peaks where the gunshot noise is. What is worth noting is that the CNN seems to have difficulty keeping the peaks down on certain higher-frequency words such as “style” and seems to exacerbate any small amount of static or noise caused by the recording device. This second picture below highlights the areas in question.
Above, the recorded word “style” is highlighted in green. The CNN seems to peak much higher here, meaning there is more noise compared to the original recording. Highlighted in Yellow is a moment where the speaker is not saying anything, but their microphone is still creating ambient noise. The CNN seems confused by this and ends up making it louder than in the GAN, despite the GAN’s inherent, constant background noise. Highlighted in pink is a gunshot noise where there is no speech happening at that moment. Here, we can see the CNN does a better job reducing the noise than the GAN.
We can only tell so much by looking at the waveform, so we must also look at the spectrogram of each of these sound bytes.
Above is the Spectrogram of the original “clean” audio. There is a very faint background noise that permeates when the individual speaks due to the quality of the audio equipment used.
These are spectrograms of the outputs of the two networks, with the CNN on top and the GAN on the bottom. Throughout both clips there is background static shown by the heat at all frequencies, though it is much more dramatic in the GAN. Interestingly, at the anomalous points outside the spoken frequencies (created by the noise we added), the GAN has a far lower range of frequencies that it distributes that noise over, but the noise itself is louder. This tracks with what we saw in the waveform.
This also provides some more interesting results in the actual spoken section. Based on our observations, the CNN creates more noise at higher frequencies, while the GAN keeps the general shape of the voice audio distribution. The CNN is unable to remove the higher frequencies. It is hard to determine what the cause of this pattern is, but it suggests the GAN seems to outperform the CNN in catching high-frequency noise, where the CNN is better at removing the louder noise of the gunshot.
The three voice files used to produce the diagrams above can be listened to at the links below.
To improve this project and show the full potential of the networks, the first and easiest modification would be training the networks indefinitely through early-stopping callbacks. Further, we would invest more time in planning and focused research. Although our main setback was our reliance on Google Suite, we could have minimized this by researching other available options (i.e. AWS Sagemaker).
Regardless of whether the GAN or the CNN proves more successful, both networks could use refinement. Future integration with a network that trains to recognize words might give the denoising network another useful input. However, this might create its own problems due to the lack of accuracy common to speech-recognition software. These networks could also be personalized.Imagine a phone app that learns a single voice, and with that knowledge learns to better denoise it. There are ethical implications to that line of thinking, as deep fakes remain a concern.
While there are still challenges with denoising audio, there are many applications that benefit from the current technology. This study allowed us to examine the benefits one type of network offers over another in improving a denoiser. While the results of this study are marginally in favour of the CNN, our networks were subjected to a limited amount of training. Due to the structure of the GAN and the theory behind it, it is reasonable to believe that the GAN could train longer and would continue to refine itself. We believe this should be the next step for research in this area.
Our team thanks Daitan for sponsoring this project and for providing the opportunity to learn about a bleeding-edge technology. We also thank Camosun College and our instructors for their insights, support, and words of encouragement. A special thanks to João Paulo Tavares Músico, Cleosson José Pirani de Souza, and Thalles Santos Silva of Daitan, as well as Saryta Schaerer, Lynda Robbins, Jonas Bambi, and Benjamin Leather of Camosun College.
This work has been proudly sponsored by Daitan Labs of Canada
Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License. For more information on the structure of Generative Adversarial Networks please see here.