r - Efficient way to Fill Time-Series per group

Question

Welcome To Ask or Share your Answers For Others

r - Efficient way to Fill Time-Series per group

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Efficient way to Fill Time-Series per group

I was looking for a way to fill a time series data set by time, per group. The very very inefficient way I was using was to split the data set per group and apply a custom time-series fill function (create sequence between max and min, and merge) in all elements of that list. Needless to say, this operations would not go pass the splitting.

My dataset looks like,

    source                 grp cnt
 1:     83 2017-06-06 13:00:00   1
 2:     83 2017-06-06 23:00:00   1
 3:     83 2017-06-07 03:00:00   1
 4:     83 2017-06-07 07:00:00   2
 5:     83 2017-06-07 13:00:00   1
 6:     83 2017-06-07 19:00:00   1
 7:     83 2017-06-08 00:00:00   1
 8:     83 2017-06-08 14:00:00   1
 9:     83 2017-06-08 15:00:00   1
10:     83 2017-06-08 20:00:00   1
11:    137 2017-06-04 02:00:00   1
12:    137 2017-06-04 05:00:00   1
13:    137 2017-06-04 23:00:00   1
...

My attempt was to use tidyverse methods by utilising the complete function, i.e.

library(tidyverse)

d1 %>% 
 group_by(source) %>% 
 complete(source, grp = seq(min(grp), max(grp), by = 'hour'))

However, after about 40-45 seconds, a progress bar appeared (apparently a neat feature in some tidyverse functions - I suspect complete in this case) which estimated 9 hours to completion. My dataset is very very big and this is not the lightest operation, so something really efficient is what I am looking for.

DATA

#dput(d1)
structure(list(source = c("83", "83", "83", "83", "83", "83", 
"83", "83", "83", "83", "137", "137", "137", "137", "137", "137", 
"137", "137", "137", "137", "137", "137", "137", "137"), grp = structure(c(1496743200, 
1496779200, 1496793600, 1496808000, 1496829600, 1496851200, 1496869200, 
1496919600, 1496923200, 1496941200, 1496530800, 1496541600, 1496606400, 
1496617200, 1496649600, 1496696400, 1496808000, 1496844000, 1496876400, 
1496962800, 1497880800, 1497888000, 1497978000, 1497996000), class = c("POSIXct", 
"POSIXt"), tzone = ""), cnt = c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
)), .Names = c("source", "grp", "cnt"), row.names = c(NA, -24L
), class = "data.frame")

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:23:33+0000

It appears that data.table is really much faster than the tidyverse option. So merely translating the above into data.table(compliments of @Frank) completed the operation in little under 3 minutes.

library(data.table)

mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]

new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed

Categories

r - Efficient way to Fill Time-Series per group

r - Efficient way to Fill Time-Series per group

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags