command line - How to use Ruby's readlines.grep for UTF-16 files?

Question

Welcome To Ask or Share your Answers For Others

command line - How to use Ruby's readlines.grep for UTF-16 files?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

command line - How to use Ruby's readlines.grep for UTF-16 files?

Given the following two files created by the following commands:

$ printf "foo
bar
baz
" | iconv -t UTF-8 > utf-8.txt
$ printf "foo
bar
baz
" | iconv -t UTF-16 > utf-16.txt
$ file utf-8.txt utf-16.txt
utf-8.txt:  ASCII text
utf-16.txt: Little-endian UTF-16 Unicode text

I'd like to find the matching pattern in UTF-16 formatted file, the same way as in UTF-8 using Ruby.

Here is the working example for UTF-8 file:

$ ruby -e 'puts File.open("utf-8.txt").readlines.grep(/foo/)'
foo

However, it doesn't work for UTF-16LE formatted file:

$ ruby -e 'puts File.open("utf-16.txt").readlines.grep(/foo/)'
Traceback (most recent call last):
    3: from -e:1:in `<main>'
    2: from -e:1:in `grep'
    1: from -e:1:in `each'
-e:1:in `===': invalid byte sequence in US-ASCII (ArgumentError)

I've tried to convert the file based on this post by:

$ ruby -e 'puts File.open("utf-16.txt", "r").read.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)' 
?tfoo
bar
baz

but it prints some invalid characters (?t) before foo, secondly I don't know how to use grep method after conversion (it reports as undefined method).

How I can use readlines.grep() method for UTF-16 file? Or some other simple way, where my goal is to print the lines with the specific regex pattern.

Ideally in one line, so the command can be used for CI tests.

Here is some real world scenario:

ruby -e 'if File.readlines("utf-16.log").grep(/[1-9] error/) {exit 1}; end'

but the command doesn't work due to UTF-16 formatting of the log file.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:33:57+0000

While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE into UTF-8 is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:

puts File.open(
  "utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
  Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#??foo

Categories

command line - How to use Ruby's readlines.grep for UTF-16 files?

command line - How to use Ruby's readlines.grep for UTF-16 files?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags