Evaluate the Performance¶

Usage¶

After creating the tensors that will be used to train and test the neural network, it is possible to start training the neural network. Different models and optimizers can be selected by command-line argument parsing. Additionally, the class_weight implementation from Keras can be activated. This was implemented to take into account the uneven distribution of samples across instrument labels. Given that this did not result in improvements with the IRMAS data set, the use of this feature is optional in argument parsing.

Available models are: model_baseline, model_leaky, model_two_branch, and model_multi_res. Available optimizers are: adam and sgd

To train the network, use the following command:

python3 evaluate.py -m <model_name> -o <optimizer_name> -O <1/2/3>

Depending on the operation mode, follow the instructions to select the split.

Operation Mode 1¶

This operation mode is used to extract confusion matrices. Since confusion matrices are only valid for single labeled data, there are two possibilities depending on the used data sets:

IRMAS and IRMAS Wind data sets: since these data sets are multi-labeled on the testing data, a new model is trained by using 50% of the training data set and the remain for confusion matrix calculation.
Monotimbral and Jazz data sets: since these data sets are single-labeled in both training and testing data sets, confusion matrices can be extracted by using the pure test data set.

To plot the confusion matrices:

cd <MODEL_PATH>/<model_name>
python3 plot_conf_mat.py

Operation Mode 2¶

This operation mode evaluates global and class-wise performance metrics while varying the identification threshold from 0 to 1. Following the authors, the development test data set was used to calculate the performance metrics.

To be able to extract the final performance metrics, it is necessary to plot their behaviour with respect to the identification threshold:

cd <MODEL_PATH>/<model_name>
python3 plot_perf_metrics.py

Do not skip this step, before extracting final performance metrics.

Operation Mode 3¶

Finally, the final performance metrics are extracted by using the optima identification thresholds (for both global and class-wise cases) on the pure test data set.

Documentation¶

class evaluate.Evaluator(model_str, optimizer_str, num_classes, iter_num, op_mode)¶

Evaluator uses the testing data set to reproduce results from the original neural network design and evaluate the improvements made by audio source separation and different proposed experiments. This testing algorithm allows variable length of audio excerpts.

Parameters:

model (str) – model to be evaluated.
optimizer (str) – optimizer that was used in training.
num_classes (int) – number of classes that were trained in the model.
iter_num (int) – number of training iterations to evaluate.
op_mode (int) – operation mode [1] evaluate confusion matrix, [2] evaluate classwise and global performance metrics and [3] final performance metrics using the pure test data set.

aggregate_predictions(strategy, full_predictions)¶

This method performs different aggregation strategies to obtain a final prediction for each sample (complete audio excerpt) in the test data set.

Parameters:	strategy (str) – name of the aggregation strategy to evaluate. full_predictions (list) – complete predictions over all excerpts in test data set.

best_classwise_global_performance_metrics(strategy, iteration)¶

This method calculates the global performance metrics of the system, using a novel approach: using the optima class-wise thresholds in in the pure test data set. It calculates precision, recall and f-score for micro and macro averaging. It also evaluates different performance metrics using a variable threshold for each instrument separately.

Parameters:	strategy (str) – name of the aggregation strategy to store. iteration (int) – id of the iteration being evaluated.

classwise_performance_metrics(strategy, iteration)¶

This method calculates the classwise performance metrics of the system. It calculates precision, recall and f-score for micro and macro averaging. It also evaluates different performance metrics using a variable threshold for each instrument separately.

Parameters:	strategy (str) – name of the aggregation strategy to store. iteration (int) – id of the iteration being evaluated.

evaluate(predictions, model_str, optimizer_str, iteration)¶

This method evaluates the corresponding trained model i with different aggregation strategies for each sample in the dataset. It saves a dictionary with the corresponding performance metrics depending on the operation mode of the evaluator and the incoming predicted values.

Parameters:	predictions (list) – calculated predictions for model i and selected operation mode. model_str (str) – name of the model to load. optimizer_str (str) – name of the optimizer used in training. iteration (int) – id of the iteration being evaluated.

global_performance_metrics(strategy, iteration)¶

This method calculates the global performance metrics of the system. It calculates precision, recall and f-score for micro and macro averaging. It also evaluates different performance metrics using a variable global threshold for all instruments.

Parameters:	strategy (str) – name of the aggregation strategy to store. iteration (int) – id of the iteration being evaluated.

load_model(model_str, optimizer_str)¶

This method loads the model and the weights obtained in training. Each model is represented by the activation function used and the optimizer used in backpropagation. It also calculates the predictions for every melspectrogram in the testing data sets (validation, development, and pure test data sets).

Parameters:	model_str (str) – name of the model to load. optimizer_str (str) – name of the optimizer used in training.

load_multi_test_data()¶: This method loads the multiple-input development test data (melspectrogram arrays) and corresponding file ids to memory. It also implements one hot encoding and multilabel binarizer for the labels in the test data set. The loaded data set depends on the operation mode selected to perform the evaluation: [1] loads 50-50 train-validation split, [2] loads variable train-validation split, [3] loads pure testing data set.

load_test_data()¶: This method loads the single-input development test data (melspectrogram arrays) and corresponding file ids to memory. It also implements one hot encoding and multilabel binarizer for the labels in the test data set. The loaded data set depends on the operation mode selected to perform the evaluation: [1] loads 50-50 train-validation split, [2] loads variable train-validation split, [3] loads pure testing data set.

save_results(model_str, optimizer_str)¶

This method saves the results of the evaluation depending on the operation mode of the evaluator.

Parameters:	model_str (str) – name of the model optimizer_str (str) – name of the optimizer