Update
As @ikegami suggested, I reported this as a bug.
Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output
Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:
C version:
#include <stdio.h>
int main(void) {
/* UTF-8 encoded alpha, beta, gamma */
char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
puts(x);
return 0;
}
Output:
C:…> chcp 65001
Active code page: 65001
C:…> cttt.exe
αβγ
Perl version:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3
}"
αβγ
?
From what I can tell, the last octet, 0xb3
is being output again, on another line, which is being translated to U+FFFD
.
Note that redirecting output eliminates this effect.
I can also verify that it is the last octet being repeated:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz
z
On the other hand, syswrite avoids this problem.
C:…> perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz
I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.
I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.
Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:
C:…> chcp 437
Active code page: 437
C:…> cttt.exe
╬?╬▓╬│
C:…> perl -e "print qq{xcexb1xcexb2xcexb3
}"
╬?╬▓╬│
What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?
PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.
PPS: Leaving out the
results in something even more interesting:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz}"
αβγxyzxyz
C:…> perl -e "print qq{xcexb1xcexb2xcexb3}"
αβγ?γ?
See Question&Answers more detail:
os