Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
446 views
in Technique[技术] by (71.8m points)

apache spark - Datatype error when comparing rows of a dataframe (Python)

I have a dataframe with the folloing schema:

root
 |-- distanceValue: integer (nullable = true)
 |-- timeOfMeasurement: timestamp (nullable = true)
 |-- EventProcessedUtcTime: timestamp (nullable = true)
 |-- latency: interval (nullable = true)

And the dataframe looks something like this:

distance |timeOfMeasurement           |EventProcessedUtcTime       |latency
---------+----------------------------+----------------------------+---------------------------------
15       |2021-01-04T07:07:45.098+0000|2021-01-04T07:07:45.676+0000|{"months": 0, "days": 0, "microseconds": 578885}
26       |2021-01-04T07:07:46.098+0000|2021-01-04T07:07:46.301+0000|{"months": 0, "days": 0, "microseconds": 203909}
23       |2021-01-04T07:07:47.113+0000|2021-01-04T07:07:47.353+0000|{"months": 0, "days": 0, "microseconds": 240287}

When trying to compare the distance with the distance from the previous row

import pandas as pd
df['same'] = df.distance.eq(df.distance.shift()) 
    
# OR

import numpy as np
df['same'] = np.where(df.distance == df.distance.shift())

I get an error saying: Could not parse datatype: interval

The distance however is integer... Is the value getting mixed up with the latency which is an interval? Thank you for any help


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use the lag function in Spark:

from pyspark.sql import functions as F, Window

df = df.withColumn(
    'same',
    F.col('distance') == F.lag('distance').over(Window.orderBy('timeOfMeasurement'))
)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...