I have a CSV file that is structured this way:
Header
Blank Row
"Col1","Col2"
"1,200","1,456"
"2,000","3,450"
I have two problems in reading this file.
- I want to Ignore the Header and Ignore the blank row
- The commas within the value is not a separator
Here is what I tried:
df = sc.textFile("myFile.csv")
.map(lambda line: line.split(",")) #Split By comma
.filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows
However, This did not work, because the commas within the value was being read as a separator and the len(line)
was returning 4 instead of 2.
I tried an alternate approach:
data = sc.textFile("myFile.csv")
headers = data.take(2) #First two rows to be skipped
The idea was to then use filter and not read the headers. But, when I tried to print the headers, I got encoded values.
[x00Ax00Yx00 x00Jx00ux00lx00yx00 x002x000x001x006x00]
What is the correct way to read a CSV file and skip the first two rows?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…