I have a Pandas DataFrame that looks like that:
ipdb> df[:4]
name thread_id state time uuid
0 CB:Component_Loc 829946510 1 2683817 0
1 CB:Component_Loc 829946510 0 2683874 0
2 CB:Component_Fusion 3025005749 1 2683683 1
3 CB:Component_Fusion 3025005749 0 2683882 1
thread_id
is not important here. The data represents time points when an event (identified by uuid
was started (state = 1) and stopped (state = 0).
From that, I want to compute the duration and time (defined as mean between start and stop). I can achieve this with this code:
df = df.groupby("uuid").apply(lambda df:
pd.Series({
"name" : df.iloc[0]["name"],
"time" : df["time"].mean(),
"duration": df["time"].diff().iat[1]
}))
df = df.pivot(columns="name", values="duration", index="time")
which works fine, but is very slow for the number of events I have. Furthermore, it is not really what I consider elegant.
What are ways to improve that code, mostly for performance?
EDIT: Some additional information, as requested:
name is not unique, i.e., there can be many events named CB:Component_Loc
. However uuid
is unique for a start/stop cycle.
What I want is the duration (time when state = 0 minus time when state = 1) for each uuid.
Thanks!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…