Features type:
If you want to generate the features from the very begining, here are some hints:
Gene Features :
by preparing all gene feature into a folder and then run the folder_pca.py, you can generate the KPCA feature. Here is a demo,as the k-mer we generated is from 3-8, and the size is huge, you can use the code and gene_max_sequence16127.csv file to generate them. And them put them into the gene folder to conduct KPCA.
For the iLearn features[1], it is a platform that can generate sequence-based features. We need to input the fasta and then generate corresponding feature from the platform, then apply KPCA to those feature to generate the features we leveraged. (btw,iLearn features is only one part of our methods, we still need k-mer data from 3mer-8mer for mix-kpca)
[1] Chen Z, Zhao P, Li F, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data[J]. Briefings in bioinformatics, 2020, 21(3): 1047-1057.
Gene KPCA : Examples
LncRNA fasta , ilearn KPCA: Examples
The Feautre below are the after-transformation features, which is ready to use (to build the constructed embedding dataset for machine learning and deep leanring models.)
LncRNA: LncRNA Features
Gene: Gene Features
If the download speed is low,or you just want to conduct downstream experiments, you can just try to download the embedding we leveraged for downstream tasks:
Using the feature provided, we can construct the construct embedding dataset, those datasets are input for the machine learning methods.
First, you need to generate constructed embedding dataset from original dataset:
Original Dataset which is provided at: here
Constructed dataset: here
Constructed embedding dataset generation samples<(Please unzip
lncRNA mix kpca 4096, Gene mix kpca 4096 into the same root folder to run this demo)