Negative sampling Rate (NR)

 The negative sampling rate (NR) is a parameter that controls the number of negative samples. Such as, if 1,000 known pairs were set as positive samples and the negative sampling rate was set to 2, it would randomly generate 2,000 negative samples for the constructed dataset. These experiments mean to simulate the situation of the unbalanced dataset in reality.

The parameter range from 0.5 to 16 (0.5, 1, 2, 4, 8, 16). For the negative sampling procedure, we repeated it three times. For each dataset, we conducted five-fold cross-validation three times.

AUC/ AUPR/ F1 /MCC were calculated for comparison.

Dataset: LC

Metrics variance of different Negative sampling Rates (NR). The NR (x-axis) was set from 0.5 to 16. The y-axis is the metrics and its values. The count of random seeds for negative sampling is three, and the five-fold cross-validation was repeated three times. The dataset is LC. For the AUC, when the NR rate increased, the performance of RFCNN outperformed other methods. Regarding AUC/F1 and MCC, the model of RF degenerated fast, and CNN/RFCNN degenerated relatively slowly. The CNN is slightly better than RFCNN in AUPR, F1, and MCC. However, the increase in AUC can compensate for the slight deduction of AUPR/ F1/ MCC.
Metrics variance of different Negative sampling Rates (NR). The NR (x-axis) was set from 0.5 to 16. The y-axis is the metrics and its values. The count of random seeds for negative sampling is three, and the five-fold cross-validation was repeated three times. The dataset is IT.
Metrics variance of different Negative sampling Rates (NR). The NR (x-axis) was set from 0.5 to 16. The y-axis is the metrics and its values. The count of random seeds for negative sampling is three, and the five-fold cross-validation was repeated three times. The dataset is LCIT. The results show the same trend for different datasets (LC, IT, ITLC). The AUC for RFCNN is higher than RF/CNN for all NRs. For the rest metrics, RFCNN is slightly inferior to CNN. RF-only is more sensitive to negative sampling rates, while introducing the CNN can help to relieve. These illustrations indicate the ensemble methods can increase the AUC but not significantly degenerate other metrics.

Those figures are generated from results Files

LC_results_NR IT_results_NR LCIT_results_NR