r - Find and break on repeated runs

Question

Welcome To Ask or Share your Answers For Others

r - Find and break on repeated runs

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Find and break on repeated runs

I have a vector with repeating patterns within it. I want to break any where the repeating pattern of n length changes. Here's the data:

x <- c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))

##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 5 6 5 6 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 5 7 2 3 4 2 3 4 2 3 4

I want to be able to find those places the pattern changes so it breaks like this:

I think rle may be of use but don't see how.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:32:11+0000

Here's a function to do it. By the way, this is a problem in genetics - finding tandem repeats. Here's a link to an algorithm paper that is a much better treatment than this, but much more complicated to implement.

The output is a vector of groups to split x into.

First a helper function:

factorise <- function(x) {
  x <- length(x)
  if(x == 1){return(1)}
  todivide <- seq(from = 2, to = x)
  out <- todivide[x %% todivide == 0L]
  return(out)
}

Now the main function:

findreps <- function(x, counter = NULL){
  if(is.null(counter)){
    counter <- c()
    maxcounter <- 0
  } else {
    maxcounter <- max(counter)
  }
  holding <- lapply(1:length(x), function(y){x[1:y]})
  factors <- lapply(holding, factorise)
  repeats <- sapply(1:length(factors), function(index) {any(sapply(1:length(factors[[index]]), function(zz) {all((rep(holding[[index]][1:(length(holding[[index]])/factors[[index]][zz])], factors[[index]][zz]))==holding[[index]])}))})
  holding <- holding[max(which(repeats))][[1]]
  if(length(holding) == length(x)){
    return(c(counter, rep(maxcounter + 1, length(x))))
  } else {
    counter <- c(counter, rep(maxcounter + 1, length(holding)))
    return(findreps(x[(length(holding) + 1):length(x)], counter))
  }
}

How it works: It's a recursive function that runs, cuts off the biggest repeats group it can find from the start of the vector, and then runs until they are all gone.

First we make a counter for the final output.

Next, we split x into each subset starting from 1 into a list, holding.

Then we find all factors of the size of a group, except 1.

Then is the worst part. We take each subset of the biggest subset, and check if it is equal to the biggest subset in its group after being repeated the sensible amount of times.

findreps(x)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[37] 3 3 3 3 3 4 5 6 7 7 7 7 7 7 7 7 7

If you want non-repeats to be grouped, we can use a little dplyr and tidyr:

library(dplyr)
library(tidyr)

z <- data.frame(x = x, y = findreps(x))

z %>% mutate(y = ifelse(duplicated(y) | rev(duplicated(rev(y))), y, NA),
             holding = c(0, y[2:n()])) %>%
      fill(holding) %>%
      mutate(y = ifelse(is.na(y), holding +1, y)) %>%
      select(-holding)

Which gives:

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 7 7 7 7 7 7 7 7
[53] 7

Categories

r - Find and break on repeated runs

r - Find and break on repeated runs

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags