&more.
As our code are highly related to folder directory,
thus, it is better to download needed files when run the code:
For different tasks, we have uploaded different zip bag that contains needed file to run the program.
‘.\Results\temp\’ folder stored the files need for training and testing (which is constructed embedding dataset, with frs_as random seed that leveraged in the negative sampling that can be applied directly to machine learning algorithms):
If the file is not existed, then the program would try to generate a embedding file by itself, which needed the lncRNA and gene sequence-based embedding generated, and also an constructed dataset.
Results would be generated to the folder of ‘.\Results\’
Also we uploaded some demo with needed file for the experiments, after unzip those file, you can directly run the code without other files. To save the space, only one constructed embedding is put into the demos.
RF-Type1:
GOOGLE DRIVE: RF-TYPE1-LC+IT dataset
RF-Type2:
GOOGLE DRIVE: RF-TYPE2-IT dataset
RF-Type3:
CNN-Type1:
Tested Cuda/Cudnn version:
CUDA 11.2 Cudnn 8.1.1 tensorflow-gpu-2.5.0 Gtx2080ti 11G
or CUDA 10.1 Cudnn 7.6.0 tensorflow-gpu-2.3.0 gtx1060 6G
CNN requirements.txt
CNN-Type2:
CNN-Type3:
Transfer Predict (RF):
Transfer Predict (CNN):
CNN-ITLC predict Case_study 6 (without trained model)
CNN-ITLC predict Case_study_6 (with trained model)
Each model is 1.5 Giga bytes.
For HC data, it is extremely big, 10 Gigabytes, thus, it is a better choice to generate that file using constructed dataset rather than download.
RF-CNN-ENSEMBLE:
RF-CNNensemble- ITLC-case (small)
RF-CNN ensemble- ALL experiments (big)
Tips for code:
For those code, there are something in common:
- frs means the file random seeds, control the negative sampling process.
- Rs means the random seed to control the repetition,
- the constructed embedding dataset are name like this:
Constructed_{‘your dataset’}_rs{negative sampling rs}_mkPCA{dimension}_mkPCA{dimension}.csv
- You can name it as you wish but you need to change the dir in the code: (try to search this in the code)
X_train,y_train,X_test,y_test,X,Y=file_to_dataset( your file dir)
your file dir is the constructed embedding dataset’s directory.
Tips2: generate constructed embedding dataset
If you don’t have the constructed embedding dataset, then the code default is to generate one:
Which needed the lncRNA embedding and gene embedding. To do so, you need two folders that have the following path created in your current folder (where the code exists):
Change those clauses,
- fileA=’.\\lnc\lncRNA_features\\mix_kpca\\lnc-palindormic_mix_kernalpolyPCA{}).csv’.format(mp)
- fileB=’.\\gene\\mix_kpca\\gene mix_kernelpolyPCA{}).csv’.format(gp)
- dataset=’.\\dataset_construction\\constructed_LC(refine_493)_rs{}.csv’.format(frs)
the same, mp gp is set to 4096, frs is the random forest that controls the random seeds. By Changing those files, you can generate embedding according to your own constructed datasets.
constructed_LC(refine_493)_rs1_mkPCA4096mer_mkPCA4096.csv
Also, you can visit the code repo:
https://github.com/Xshelton/LncRNA-TOP
For More of the source code.