this is my first question, sorry if I do this wrong, and sorry for it being so long...
I have a table of genomes from an entire genus that I would like to compare at a smaller level, such as within one or more species. My table is contains 3 columns: p1, p2, and percent identity. My rows are each comparisons between species.
p1 contains a list of genomes as does p2. Whatever number starts with the lowest digit is placed in p1 and the number with the higher digit goes in p2. The genome names are in the format 1_1_1, so p1 may be 1_1_1 and p2 may be 2_1_1200, but in the next row p1 could be 2_1_1200 if p2 is 3_1_23. The third column is the percent identity between them, but should not be relevant I don't think.
Multiple genomes belong to the same species, but they are not in any kind of order. For example, 42, 54, 210, and 694 are the same species. I would like to find only the rows where both p1 and p2 contain these numbers, so 42 to 54, 54 to 210, etc, but not 1 to 42. This species only has 4 genomes, but some have as many as 582 to compare.
So far:
They are bacterial genomes, so the genes are not in the same order, and the third digit corresponds to the gene position, so I've been using "^42" to call 42_1_622, for example. I don't want 642_1, so I anchored the 42 to the beginning. All middle digits are 1.
subset_species_1 <- rbind(x[grep("^42_", x$p1), ],
x[grep("^42_", x$p2), ],
x[grep("^54_", x$p1), ],
x[grep("^54_", x$p2), ],
x[grep("^210_", x$p1), ],
x[grep("^210_", x$p2), ],
x[grep("^694_", x$p1), ],
x[grep("^694_", x$p2), ])
This is obviously tedious, and it gives me all of the rows with any of these genomes in either column, not only rows with these genomes in both columns.
In addition, each table only represents one gene, and ideally I'd like to use the same subsets for every table, of which there are thousands.
Thank you in advance, I need all the help I can get!
Edited to add: I'm doing this in R/Rstudio
question from:
https://stackoverflow.com/questions/65922772/how-do-you-select-multiple-values-for-grep-across-multiple-columns-in-r