More Experiments Results

* Negative sampling Rate (NR)*

*Comparison with SOTA methods (Features)*

*Random Seeds Exploration*

*models-dimensions-and-kernels*

Here we provide an extra page to store other results that are interesting but limited to the page of our manuscript.

AS more dataset involved in training, it seems that the Deep-Leanring ensemble AUC/AUPR decrease, while the machine-Leaning ensemble AUC/AUPR increase.
As can be seen from the picture, the performance of ensemble methods like RF-ensemble and CNN-ensemble are increased compared with their sub-model for each dataset. Also, the RF ensemble can get the highest AUC and AUPR in the third task. The RF-CNN-ensemble ranked second in this task.
ITLC predict HC dataset. As we can see, the combination of different datasets like
the IT + LC can increase the predictive ability. Thus, in the presion@k tests,
we also leveraged the LC+IT dataset for training the models.
Thus, in this situation, the LCIT-HC is more likely to be our
situation. As, in the real scenario, the pairs to be predicted are huge and the
training dataset is limited but should be as big as possible.

when the FPR cut into a small part (as we want to predict top related relations), we can see that for the LC dataset, when FPR is small, the RF/RF-CNN-ensemble skyrocket. For the IT dataset, before FPR is smaller than 0.01, RF-CNN-ensemble get the highest AUC. Furthermore, when it comes to big dataset, when the FPR is small, the RF and RF-CNN are always with higher TPR. which means that the predicted methods might predict more ‘True Positive’ samples when the ‘False Positive rate’ increase.

Zoomed Figure

Considering we have 12417 lncRNAs and 16127 genes.

The total pair is 200,000,619, thus, even a small part of True positive rate increase, it may increase the preditive ability a lot.

that is also, maybe why, the top 100 Precision@K counts (TABLE 1) for RF-CNN-ensemble is much more than the CNN ensemble only and RF-ensemble only.

Table 1. the prediction ability of the different ensemble strategies

Precision@KRF ensemble (training included)CNN ensemble (training included)RF-CNN ensemble (training included)
100.100.5
200.150.050.65
300.170.070.57
500.10.060.48
700.070.070.47
900.060.080.46
1000.050.110.47
The predictor were trained on the dataset of IT+LC, and then, we predicted all the possible pairs and listed the top-100 to conduct the Precision@topK exerperiments. Each hit is a co-occurrence lncRNA-gene search on the Pubmed. If more papers contains a pair of lncRNA and gene, it has much more possibility to become a ‘real_positive’ pair.

Also, if we take AUPR(PR-curves) as another example, the CNN might have a slight decrease for IT dataset when recall is small (0,02). the AUPR of CNN decreased a lot when using LCIT dataset to predict HC, thus, when added a machine learning model into the system, like the RF-CNN, it would be more smooth. If we took the ensemble methods, it would be more smooth (like RF-CNN-enmsemble).

Using the ITLC dataset to predict the case Study dataset. Although the RF-CNN ensemble could achieve higher AUC and AUPR, there are not enough data(as the dataset of case study only have 12 pairs) to support that the results is confident enough to be better.

Understanding (Details) of RF-CNN only and RF-CNN-ensemble

Take LCIT (union set) predict case as an example, we have 1-5 random seed to generate IT+LC negative samples, which contains 1382 positive samples and 1382 negative samples. Noted as LCIT_1,2,3,4,5.

Then we have 1-5 random seeds to generate the nagative samples for Case_study_6 (6 means we have 6 positive samples), which contains 6 positive samples and 6 negative samples. Noted as Case_study_6_1,2,3,4,5.

Thus,if we want to calculate the RF-only AUC, we have to concate the 12 samples* 5 (IT+LC) random seed* 5 case_study_6 random seeds

In totally, we have 300 samples. And to avoid data leakage, the ith-(LCIT)and j-th case_study_6 dataset’s score would not count, If i is equal to j, which means that if the negative sample seeds for the training set and testing set are the same, the predictive score would not be counted.

12 samples* [( 5 (IT+LC) * 5 case_study_6) -5(i=j)]=12*20=240 samples

Thus, in total, we will have 240 samples to calculate, CNN-only, RF-only, and RF-CNN only.

CNN-only 240 samples
RF-only 240 samples
RF-CNN-only,240 samples

However, for the ensemble. That is, to add all of the scores from 5 predictor( actually 4 predictor,except i=j situation) into one score. which means, for each pair in the independent dataset, it will add all of scores together to calculate the AUC/AUPR, which makes the samples to calculate the is only 12 samples *5 case_study_6 random seeds=60 samples in total.

CNN-ensemble,60 samples
RF-ensemble,60 samples
RF-CNN-ensemble,60 samples

Thus, in our Controlled deep learning methods, we actually added all predictors’ scores (3RF+3NNs) together, to satisfy the requirements for the RF-CNN-ensemble.

Other predictive results

  1. Top-predicted results for a selected group of lncRNA

We further selected some lncRNAs as the case study and searched the keywords pair in the PUBMED. Suppose more papers reported one pair; more possible that pair would be related to each other. We selected lncRNAs MALAT1, XIST, and DANCR, which have been widely experimented with [1], and the lncRNAs HOXA-AS2, NEAT1, TUG1, and SNHG1 according to their high rankings in the prior discussion.

The second column shows that most chosen lncRNAs’ top 20 results have more than ten pairs reported by papers from PUBMED. One exception is HOXA-AS2. The HOXA-AS2 has the fewest articles reported (74 results) among the other specified lncRNAs.

We summarized the results and regarded them as evidence to support our hypothesis (Table 2). We specifically listed the unknown target (training set removed), PMID, and published year. The mediated miRNAs are also presented. For instance, lncRNA-MALAT1 could regulate ATG in gastric cancer through miR-30e [2], exogenous lncRNA-XIST sponge miR-153 in Osteosarcoma cells, and SNAI1 was a direct messenger RNA target of miR-153 [3]. The gathered data further supported the predictive capability of the proposed approach in the context of lncRNAs acting as ceRNAs to control the target gene.

Table 2. Top unknown targets for a selected set of IncRNAs

lncRNATop20 hits with possible foundsTop predicted (unknown)PMID/ YearmiRNA involved
MALAT116ATG531926239 [2]/ 2020miR-30e
XIST14SNAI132515520 [3]/ 2020miR-153
XIST14HMGB131418997/ 2019miR-29b
DANCR10MAPK134515297/ 2021miR-19a-3p
DANCR10CASP332581581/ 2020miR-758-3p
TUG112STAT334080023/ 2021miR‑204‑5p
TUG112DAPK130182733/ 2018miR-153-3p
TUG112EZR35012433/ 2022miR-377-3p
HOXA-AS22MMP929310118/ 2018miR-373
NEAT116AKT133336058/ 2020miR-1294
SNHG113CTNNB132239719/ 2020miR-4436a
[1]         Yi W, Li J, Zhu X, et al. CRISPR-assisted detection of RNA–protein interactions in living cells[J]. Nature methods, 2020, 17(7): 685-688.
[2]         Zhang Y F, Li C S, Zhou Y, et al. Propofol facilitates cisplatin sensitivity via lncRNA MALAT1/miR-30e/ATG5 axis through suppressing autophagy in gastric cancer[J]. Life sciences, 2020, 244: 117280.
[3]         Wen J F, Jiang Y Q, Li C, et al. LncRNA‐XIST promotes the oxidative stress‐induced migration, invasion, and epithelial‐to‐mesenchymal transition of osteosarcoma cancer cells through miR‐153‐SNAI1 axis[J]. Cell Biology International, 2020, 44(10): 1991-2001.