Onsets detection is the primary step for other high-level music analysis. However, less research focuses on string instrument detection. Since the high demand for automatic system development, this project aims to addresses the string instrument onsets detection using a convolutional neural network (CNN). The project was programmed using Python, and Keras (Tensorflow). The network is built based on the dataset of more than 250 minutes of music files and 2,000 onsets annotation, aiming to fulfill the training, validation, and testing process.
起始音检测是许多高阶因为分析的基础。然而,针对弦乐起始音检测的研究却相对较少。由于对自动化系统的高开发需求,本项目着重于利用卷积神经网络(CNN)开发能够自动完成弦乐起始音检测的系统。项目使用 Python 进行编程, Keras (Tensorflow) 搭建神经网络。整个数据集包含超过 250 分钟的弦乐样本,并包含有超过 2000 个的人工标注。
The CNN network used MFCCs and Delta MFCCs as the two input channels. Then connected with two convolutional layers and one max-pooling layer to extract the feature maps. Then those maps were imported to the fully connected layer. The output was represented using two-hot-code: (1,0) is non-onset; (0,1) is onset.
项目中的 CNN 网络采用 MFCCs 和 Delta MFCCs 两个输入通道。随后将数据与两个 卷积层 和一个 池化层 相连接,提取特征图。随后,特征数据将导入 全连接层。 最终的结果采用独热码来表示:(1,0)表示非起始音;(0,1)表示起始音。
(Network Structure)
(神经网络结构图)
model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, data_format='channels_last')) model.add(Conv2D(64, (3, 3), activation='relu', input_shape=input_shape, data_format='channels_last')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.5)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation='softmax'))
(Network Structure Code)
(神经网络结构代码)
Despite Keras has the build-in EarlyStopping function for validation, I defined a new function using F-measure as the valid criteria for accurate detection. The stopping function was achieved by model saving and loading.
尽管 Keras 已经提供了内置的 EarlyStopping 函数用来验证训练结果。但为了提高训练准确性,我将其替换成了自定义的函数。该函数采用 F-measure 作为检测正确率的标准。停止训练和继续训练的功能通过保存和加载训练模型来实现。
(Custom Function Logic Structure Diagram)
(自定义函数逻辑结构图)
After optimizing the hyperparameters: MFCC frame size, MFCC filterbank number, Kernel number, batch size, and dropout, the highest F-measure was 0.86.
经过调参(MFCC 采样帧宽度,MFCC 滤波器,卷积核数量,批尺寸,随机忽略的神经元数量),最终的准确率为:0.86.
Final parameters:
最终参数:
(The red line represents network detected onsets, the blue line represents onsets annotation)
(红线代表检神经网络判断的起始音,蓝线代表人工标注的起始音)