I have too many lines like this:
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
TTAACTATAATCCCACTGCCTATTTTTTTATTTCTAAAAATATCATAAAAAGACACAAAA
the first line(starting with >
) is identifier and other lines are sequence and also each identifier has its own sequence. in the mentioned example, ENSG00000100206
is name and ENST00000216024
is isoform. in my file there are some identifier lines with the same name but everything else is different.
I would like to get the longest sequence for each name and make a new file. meaning there would be only one repeat of each name (but with the longest sequence).
for the above example the results would be like this:
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
do you guys know how to do that in python?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…