Posted by

ITEM TILE – File Size: 8.1K

Code for the paper Music Source Separation in the Waveform Domain

Music Source Separation in the Waveform Domain

We provide an implementation of Demucs and Conv-Tasnet for music source separation on the MusDB dataset.They can separate drums, bass and vocals from the rest with state-of-the-art results, surpassing previous waveform or spectrogram based methods.The architecture and results obtained are detailed in our paperMusic Source Separation in the waveform domain.

Demucs is based on U-Net convolutional architecture inspired by Wave-U-Net andSING, with GLUs, a BiLSTM between the encoder and decoder, specific initialization of weightsand transposed convolutions in the decoder.

Conv-Tasnetis a separation model developed for speech which predicts a mask on a learnt over-complete linear representationusing a purely convolutional model with stride of 1 and dilated convolutional blocks.We reused the code from the kaituoxu/Conv-TasNetrepository and added support for multiple audio channels.

When trained only on MusDB, Conv-Tasnet achieves higher SDR than Demucs (5.7 vs 5.6).However, the audio it generates has significant artifacts as measured by human evaluations(MOS is 3.2 for Demucs, 2.9 for Conv-Tasnet). When trained with extra training data,Demucs and Conv-Tasnet obtain the same SDR. See our paper Section 6 for more details or listen to ouraudio samples .

Schema representing the structure of Demucs,    with a convolutional encoder, a BiLSTM, and a decoder based on transposed convolutions.

Comparison with other models

An audio comparison of Demucs and Conv-Tasnet with other state-of-the-art methods such as Wave-U-Net, OpenUnmix orMMDenseLSTM is available on the audio comparison page.We provide hereafter a summary of the different metrics presented in the paper.You can also compare Spleeter, Open-Unmix, Demucs and Conv-Tasnet on one of my favoritesongs on our soundcloud playlist.

Comparison of accuracy

Overall SDR is the mean of the SDR for each of the 4 sources, MOS Quality is a rating from 1 to 5of the naturalness and absence of artifacts given by human listeners (5 = no artifacts), MOS Contaminationis a rating from 1 to 5 with 5 being zero contamination by other sources. We refer the reader to our paper, Section 5 and 6,for more details.

| Model | Domain | Extra data? | Overall SDR | MOS Quality | MOS Contamination || ————- |————-| —–:|——:|—-:|—-:|| Open-Unmix | spectrogram | no | 5.3 | 3.0 | 3.3 || Wave-U-Net | waveform | no | 3.2 | – | – || Demucs (this) | waveform | no | 5.6 | 3.2 | 3.3 || Conv-Tasnet (this) | waveform | no | 5.7 | 2.9 | 3.4 || Demucs (this) | waveform | 150 songs | 6.3 | – | – || Conv-Tasnet (this) | waveform | 150 songs | 6.3 | – | – || MMDenseLSTM | spectrogram | 804 songs | 6.0 | – | – || Spleeter | spectrogram | undisclosed | 5.9 | – | – |


If you have anaconda installed, you can run from the root of this repository:

conda env update -f environment-cpu.yml # if you don't have GPUsconda env update -f environment-cuda.yml # if you have GPUsconda activate demucs

This will create a demucs environmnent with all the dependencies installed.

Using Windows

If you are using Windows, replace python3 by python.exe in all the commands provided hereafter :)Parts of the code are untested on Windows (in particular, training a new model). Please open an issue in case you have a problem.

Separating tracks

In order to try Demucs or Conv-Tasnet on your tracks, simply run from the root of this repository

“`bashpython3 -m demucs.separate –dl -n demucs PATHTOAUDIOFILE1 [PATHTOAUDIOFILE2 …] # for Demucspython3 -m demucs.separate –dl -n tasnet PATHTOAUDIOFILE1 … # for Conv-Tasnet

Demucs with randomized equivariant stabilization (10x slower, suitable for GPU, 0.2 extra SDR)

python3 -m demucs.separate –dl -n demucs –shifts=10 PATHTOAUDIOFILE1“`

The --dlflag will automatically download a pretrained model into ./models. There will be one folderper audio file, reusing the name of the track without the extension. Each folder will contain four stereo wav files sampled at 44.1 kHz: drums.wav, bass.wav,other.wav, vocals.wav.Those folders will be placed in ./separated/MODEL_NAME.

Any stereo audio file supported by ffmpeg will work. It will be resampled to 44.1 kHz on the flyif necessary. If multiple streams (i.e. a stems file) are present in the audio file,the first one will be used.

Other pre-trained models can be selected with the -n flag and downloaded with the --dl flag.The models will be stored in the models folder. The list of pre-trained models is:- demucs: Demucs trained on MusDB,- demucs_extra: Demucs trained with extra training data,- tasnet: Conv-Tasnet trained on MusDB,- tasnet_extra: Conv-Tasnet trained with extra training data.

The --shifts=SHIFTS performs multiple predictions with random shifts (a.k.a randomizedequivariant stabilization) of the input and average them. This makes prediction SHIFTS timesslower but improves the accuracy of Demucs by 0.2 points of SDR.It has limited impact on Conv-Tasnet as the model is by nature almost time equivariant.The value of 10 was used on the original paper, although 5 yields mostly the same gain.It is deactivated by default.

Examining the results from the paper experiments

The metrics for our experiments are stored in the results folder. In particularmuseval json evaluations are stored in results/evals/EXPERIMENT NAME/results.You can aggregate and display the results usingbashpython3 -p # show valid loss, aggregated with multiple random seedspython3 -p # show SDR on test set, aggregated with multiple random seedspython3 -p SIR # also SAR, ISR, show other metricsThe std column shows the standard deviation divided by the square root of the number of runs.

Training Demucs and evaluating on the MusDB dataset

If you want to train Demucs from scrath, you will need a copy of the MusDB dataset.It can be obtained on the MusDB website.To start training on a single GPU or CPU, use:bashpython3 -m demucs -b 4 --musdb MUSDB_PATH # Demucspython3 -m demucs -b 4 --musdb MUSDB_PATH --tasnet --samples=80000 --split_valid # Conv-TasnetThe -b 4 flag will set the batch size to 4. The default is 4 and will crash on a single GPU.Demucs was trained on 8 V100 with 32GB of RAM.The default parameters (batch size, number of channels etc)might not be suitable for 16GB GPUs.To train on all available GPUs, use:bashpython3 --musdb MUSDB_PATH [EXTRA_FLAGS]

This will launch one process per GPU and report the output of the first one. When interruptingsuch a run, it is possible some of the children processes are not killed properly, be mindful of that.If you want to use only some of the available GPUs, export the CUDA_VISIBLE_DEVICES variable toselect those.

To see all the possible options, use python3 -m demucs --help.

About checkpointing

Demucs will automatically generate an experiment name from the command line flags you provided.It will checkpoint after every epoch. If a checkpoint already exist for the combination of flagsyou provided, it will be automatically used. In order to ignore/delete a previous checkpoint,run with the -R flag.The optimizer state, the latest model and the best model on valid are stored. At the end of eachepoch, the checkpoint will erase the one from the previous epoch.By default, checkpoints are stored in the ./checkpoints folder. This can be changed using the--checkpoints CHECKPOINT_FOLDER flag.

Not all options will impact the name of the experiment. For instance --workers is notshown in the name, therefore, changing this parameter will not impact the checkpoint fileused. Refer to for more details.

Test set evaluations

Test set evaluations computed with museval will be stored underevals/EXPERIMENT NAME/results. The experiment nameis the first thing printed when running python3 or python3 -m demucs. If you usedthe flag --save, there will also be a folder evals/EXPERIMENT NAME/wavs containingall the extracted waveforms.

Running on a cluster

If you have a cluster available with Slurm, you can set the as the target of aslurm job, using as many nodes as you want and a single task per node. willcreate one process per GPU and run in a distributed manner. Multinode training is supported.

Extracting Raw audio for faster loading

We observed that loading from compressed mp4 audio lead to unreliable speed, sometimes reducing bya factor of 2 the number of iterations per second. It is possible to extract all datato raw PCM f32e format. If you wish to store the raw data under RAW_PATH, run the followingcommand first:

bashpython3 -m demucs.raw [--workers=10] MUSDB_PATH RAW_PATH

You can then train using the --raw RAW_PATH flag, for instance:bashpython3 --raw RAW_PATH --musdb MUSDB_PATHYou still need to provide the path to the MusDB dataset as we always load the test setfrom the original MusDB.

Results reproduction

To reproduce the performance of the main Demucs model in our paper:“`bash

Extract raw waveforms. This is optional


Train models with default parameters and multiple seeds

python3 –seed 42 # for Demucspython3 –seed 42 –tasnet –X=10 –samples=80000 –epochs=180 –split_valid # for Conv-Tasnet

Repeat for –seed = 43, 44, 45 and 46


You can visualize the results aggregated on multiple seeds usingbashpython3 # compare validation lossespython3 # compare test SDRpython3 SIR # compare test SIR, also available ISR, and SAR

You can look at our exploration file to see the exact flagsfor all experiments (grid search and ablation study). If you have a Slurm cluster,you can also try adapting it to run on your own.

Environment variables

If you do not want to always specify the path to MUSDB, you can export the following variables:“`bashexport DEMUCS_MUSDB=PATH TO MUSDB

Optionally, if you extracted raw pcm data



How to cite

@techreport{music_separation_waveform, title = {{Music Source Separation in the Waveform Domain}}, author = {D{'e}fossez, Alexandre and Usunier, Nicolas and Bottou, L{'e}on and Bach, Francis}, year = {2019}, number = {02379796v1}, institution = {HAL},}


Demucs is released under Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) license, as found in the LICENSE file.

The file demucs/ is adapted from the kaituoxu/Conv-TasNet repository.It was originally released under the MIT License updated to support multiple audio channels.

To restore the repository download the bundle


and run:

 git clone facebookresearch-demucs_-_2019-11-29_12-57-06.bundle 

Uploader: facebookresearch
Upload date: 2019-11-29