I'm using ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes spxE9cifixE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
I get the following error
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non spxE9cifixE9" instead of "Non spécifié".
I can't convert "Non spxE9cifixE9" to "Non spécifié" by doing this
"Non spxE9cifixE9".encode("UTF-8")
because I get this error:
Encoding::UndefinedConversionError: "xE9" from ASCII-8BIT to UTF-8
,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
Questions:
- Can I get CSV to read my file in the appropriate encoding? If so, how?
- How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…