Random Seeds Exploration

Random seeds on CV

We further conducted experiments on the variables that might contribute to the performance of the models. Those variables include the composition of Datasets (LC, IT, LCIT), random_seeds for the Negative samplings (File_rs), and random seeds for cutting the dataset into five-folds (Divide_rs). We sorted the value of the Standard Deviation of the metrics value of RF/ CNN/ RFCNN and made an illustration as Figure below:

Fig 5. Standard Deviation (Std) of different datasets and random seeds. The x-axis indicates the model and corresponding metrics. Such as RF_AUC means the downstream model is Random Forest, and the metric that we calculate the Standard Deviation is the AUC values, and so forth. The last one is the averaged std for all models and metrics. The y-axis indicates the Standard Deviation calculated on the Negative sampling Rate Exploration experiments. Three colors mean different variables, including the datasets (red), negative sampling random seeds (file random seeds, dark blue), and cross-validation random seeds( divide random seeds, light blue). As can be seen from the picture, the most varied factors are the datasets, which means that more experimental-validated known pairs may contribute to the increment of the performance. The Frs and Drs are less varied when the negative sampling rate changes. Also, from the view of models, RF is less influenced by Frs and Drs but highly influenced by the datasets. Conversely, the CNN is more likely to be influenced by the random seeds from the cross-validation and negative sampling. Furthermore, the metrics F1 and MCC, these two metrics are more easily to be affected by random seeds, while AUC and AUPR are more stable regardless of downstream models.

Random Seeds on Ensemble methods

We further conducted experiments concerning ensemble methods. We trained two more models with random seeds 4,5 and then randomly selected three models out of five models, recalculated, and collected the results.

rs_c	Rf-only	CNN-only	RFCNN-only	RF-ensemble	CNN-ensemble	RFCNN-ensemble	Mean	M_std
1,2,3	0.927	0.891	0.926	0.935	0.905	0.934	0.919667	0.016183
1,2,4	0.918	0.883	0.918	0.925	0.901	0.926	0.911833	0.015269
1,2,5	0.926	0.887	0.925	0.926	0.897	0.926	0.9145	0.016174
2,3,4	0.923	0.884	0.924	0.926	0.901	0.927	0.914167	0.016139
2,3,5	0.932	0.887	0.931	0.928	0.898	0.928	0.917333	0.017904
3,4,5	0.927	0.889	0.927	0.93	0.899	0.93	0.917	0.016563
Std	0.004272	0.002734	0.003891	0.003399	0.002609	0.002814
Mean	0.9255	0.886833	0.925167	0.928333	0.900167	0.9285

The table shows that the random seed combination 1,2,3 can achieve the highest AUC values in terms of the RC-CNN ensemble model, and they can also achieve the highest mean AUC among other random seed combinations. From the mean values of different ensemble policies, we can also find out that the RF-CNN ensemble can perform slightly better than RF-ensemble.

Here are some visualization of previous Table.