sql server - Unicode to Non-Unicode conversion

Question

Welcome To Ask or Share your Answers For Others

sql server - Unicode to Non-Unicode conversion

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

sql server - Unicode to Non-Unicode conversion

I have some Unicode characters in an NVarchar field named "PostalCode". When I convert them to Varchar, there is a ? in the result.

My code is:

select PostalCode, cast((PostalCode) as varchar)) as val from  table

and the result is:

PostalCode       |   val
053000           | 053000?

Here I am getting a ? in the result. Is there any way to remove such special characters?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T20:07:09+0000

There are a few things to note here:

If you want to see exactly which character is there, you can convert the value to VARBINARY which will give you the hex / binary value of all characters in the string and there is no concept of "hidden" characters in hex:
```
DECLARE @PostalCode NVARCHAR(20);
SET @PostalCode = N'053000'+ NCHAR(0x2008); -- 0x2008 = "Punctuation Space"
SELECT @PostalCode AS [NVarCharValue],
       CONVERT(VARCHAR(20), @PostalCode) AS [VarCharValue],
       CONVERT(VARCHAR(20), RTRIM(@PostalCode)) AS [RTrimmedVarCharValue],
       CONVERT(VARBINARY(20), @PostalCode) AS [VarBinaryValue];
```
Returns:
```
NVarCharValue   VarCharValue   RTrimmedVarCharValue   VarBinaryValue
053000          053000?        053000?                0x3000350033003000300030000820
```
NVARCHAR data is stored as UTF-16 which works in 2-byte sets. Looking at the last 4 hex digits to see what the hidden 2-byte set is, we see "0820". Since Windows and SQL Server are UTF-16 Little Endian (i.e. UTF-16LE), the bytes are in reverse order. Flipping the final 2 bytes -- 08 and 20 -- we get "2008", which is the "Punctuation Space" that we added via NCHAR(0x2008).

Also, please note that RTRIM did not help at all here.
Simplistically, you can just replace the question marks with nothing:
```
SELECT REPLACE(CONVERT(VARCHAR(20), [PostalCode]), '?', '');
```
More importantly, you should convert the [PostalCode] field to VARCHAR so that it doesn't store these characters. No country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype, at least as far as I have ever read about (see bottom section for references). In fact, what is allowed is a rather small subset of ASCII, which means you can easily filter on the way in (or just do the same REPLACE as shown above when inserting or updating):
```
ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR(20) [NOT]? NULL;
```
Be sure to check the current NULL / NOT NULL setting for the column and make it the same in the ALTER statement above, else it could be changed as the default is NULL if not specified.
If you cannot change the schema of the table and need to do a periodic "cleansing" of the bad data, you can run the following:
```
;WITH cte AS
(
   SELECT *
   FROM   TableName
   WHERE  [PostalCode] <>
                  CONVERT(NVARCHAR(50), CONVERT(VARCHAR(50), [PostalCode]))
)
UPDATE cte
SET    cte.[PostalCode] = REPLACE(CONVERT(VARCHAR(50), [PostalCode]), '?', '');
```
Please keep in mind that the above query is not meant to work efficiently if the table has millions of rows. At that point it would need to be handled in smaller sets via a loop.

For reference, here is the wikipedia article for Postal code, which currently states that the only characters ever used are:

The arabic numerals "0" to "9"

Letters of the ISO basic Latin alphabet

Spaces, hyphens

And regarding the max size of the field, here is the Wikipedia List of postal codes

Categories

sql server - Unicode to Non-Unicode conversion

sql server - Unicode to Non-Unicode conversion

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags