string - Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

Question

Welcome To Ask or Share your Answers For Others

string - Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

string - Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

I'm using ruby 1.9.2

I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.

When I read the lines from the CSV file,

file_contents = CSV.read("csvfile.csv", col_sep: "$")

The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes spxE9cifixE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.

Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.

So, if I try to make CSV force the encoding like this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")

I get the following error

ArgumentError: invalid byte sequence in UTF-8:

If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non spxE9cifixE9" instead of "Non spécifié".

I can't convert "Non spxE9cifixE9" to "Non spécifié" by doing this "Non spxE9cifixE9".encode("UTF-8")

because I get this error:

Encoding::UndefinedConversionError: "xE9" from ASCII-8BIT to UTF-8,

which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".

Questions:

Can I get CSV to read my file in the appropriate encoding? If so, how?
How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T02:45:54+0000

deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

If latin1_string is "Non spxE9cifixE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

With newer Rubies, you can do things like this:

utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

Categories

string - Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

string - Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags