string - sys.argv as bytes in Python 3k

Question

Welcome To Ask or Share your Answers For Others

string - sys.argv as bytes in Python 3k

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

string - sys.argv as bytes in Python 3k

As Python 3k introduces strict distinction between strings and bytes, command line arguments in the array sys.argv are presented as strings. Sometimes it is necessary to treat the arguments as bytes, e.g. when passing a path that needn't to be in any particular character encoding in Unix.

Let's see an example. A brief Python 3k program argv.py follows:

import sys

print(sys.argv[1])
print(b'bytes')

When it is executed as python3.1 argv.py fran?ais it produces expected output:

fran?ais

b'bytes'

Note that the argument fran?ais is in my locale encoding. However, when we pass the argument in a different encoding we obtain an error: python3.1 argv.py `echo fran?ais|iconv -t latin1`

Traceback (most recent call last):
  File "argv.py", line 3, in <module>
    print(sys.argv[1])
  UnicodeEncodeError: 'utf-8' codec can't encode character 'udce7' in position 4: surrogates not allowed

How shall we pass binary data to Python 3k program via command line arguments? An example of usage is passing a path to a file of a user who uses other locale.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:55:57+0000

Note that the error is a UnicodeEncodeError rather than a UnicodeDecodeError. Python is preserving the exact bytes passed on the command line (via the PEP 383 surrogateescape error handler), but those bytes are not valid UTF-8 and hence can't be encoded as such for writing to the console.

The best way to deal with this is to use the application level knowledge of the correct encoding to reinterpret the command line argument inside the application, as in the following example code:

$ python3.2 -c "import os, sys; print(os.fsencode(sys.argv[1]).decode('latin-1'))" `echo fran?ais|iconv -t latin1`
fran?ais

The os.fsencode function invocation reverses the transformation Python applied automatically when processing the command line arguments. The decode('latin-1') method invocation then performs the correct conversion in order to get a properly decoded string.

Python 3.2 added os.fsencode to specifically to make this kind of problem easier to deal with.

For Python 3.1, the equivalent construct for os.fsencode(sys.argv[1]) is sys.argv[1].encode(sys.getfilesystemencoding(), 'surrogateescape')

Edit Feb 2013: updated for Python 3.2+, and to avoid assuming that Python autodetected "UTF-8" as the command line encoding

Categories

string - sys.argv as bytes in Python 3k

string - sys.argv as bytes in Python 3k

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags