I’m new to R and computational work in general, so please bear with me as my questions are probably pretty basic.
I’m trying to analyze transcriptomics of colorectal cancer patients with vs without perineural invasion (PNI), and after digging into the cBioportal, I found 232 patients who have PNI annotations. I have their barcode saved in patient_barcode
and used GDCquery()
to extract RNAseq data.
query_PNI = GDCquery(
project = c("TCGA-READ","TCGA-COAD"),
data.category = "Transcriptome Profiling",
experimental.strategy = "RNA-Seq",
workflow.type = "STAR - Counts",
sample.type = c("Primary Tumor", "Solid Tissue Normal"),
data.type = "Gene Expression Quantification",
barcode = patient_barcode)
GDCdownload(query = query_PNI)
tcga_PNI = GDCprepare(query_PNI)
However, I get 239 patients instead of 232 patients due to duplicated patients "TCGA-AZ-6598" "TCGA-AZ-6599" "TCGA-AZ-6603" "TCGA-AZ-6605" "TCGA-AH-6643" "TCGA-AZ-6600" "TCGA-AZ-6601"
I have a clinical dataset that I extracted using GDCquery_clinic
which has 232 patients. Due to the differing matrix sizes, I can’t run the downstream analysis.
I have tried the below code to filter out those patients, but none worked.
tcga_filtered <- tcga_PNI[,tcga_PNI$patient %in% patient_barcode]
> tcga_filtered <- tcga_PNI %>%
+ filter(!patient %in% c("TCGA-AZ-6598","TCGA-AZ-6599","TCGA-AZ-6603","TCGA-AZ-6605", "TCGA-AH-6643","TCGA-AZ-6600","TCGA-AZ-6601"))
> rows_to_keep <- !rowData(tcga_PNI)$patient %in% c("TCGA-AZ-6598","TCGA-AZ-6599","TCGA-AZ-6603","TCGA-AZ-6605", "TCGA-AH-6643","TCGA-AZ-6600","TCGA-AZ-6601")
Any suggestions?
Meri Okorie is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.