regex - R extract specific text inside a string

Question

Welcome To Ask or Share your Answers For Others

regex - R extract specific text inside a string

asked Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - R extract specific text inside a string

I have a data.table with 1 Million rows with each cell looking like this:

ENST00000408384 // ENSEMBL // ncrna:miRNA chromosome:GRCh37:1:30366:30503:1 gene:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000469289 // ENSEMBL // havana:known chromosome:GRCh38:1:30267:31109:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000473358 // ENSEMBL // havana:known chromosome:GRCh38:1:29554:31097:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0

I need to extract what comes immediately after "gene_biotype:" (in this case it would "miRNA"). How to do that?

I tried finding a solution with stringR and regex and gave up after several hours. Appreciate your help. Thanks.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-16T17:34:24+0000

You could try regmatches with regexpr.

regmatches(x, regexpr("(?<=gene_biotype\:)\w*", x, perl=TRUE))
# [1] "miRNA"

Data:

x <- "
ENST00000408384 // ENSEMBL // ncrna:miRNA chromosome:GRCh37:1:30366:30503:1 gene:ENSG00000221311 gene_biotype:miRNA transcript_biotype:miRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000469289 // ENSEMBL // havana:known chromosome:GRCh38:1:30267:31109:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000473358 // ENSEMBL // havana:known chromosome:GRCh38:1:29554:31097:1 gene:ENSG00000243485 gene_biotype:lincRNA transcript_biotype:lincRNA // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002840 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002841 // Havana transcript // novel transcript[gene_biotype:lincRNA transcript_biotype:lincRNA] // chr1 // 100 // 100 // 0 // --- // 0
"

Categories

regex - R extract specific text inside a string

regex - R extract specific text inside a string

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags