Velvet Star Monitor

Standout celebrity highlights with iconic style.

updates

Is it safe to replace CP850 with UTF-8 encoding

Writer Olivia Zamora

I have an old project reading files with CP850 encoding. But it handles accent characters wrong (e.g., Montréal becomes MontrÚal). I want to replace CP850 with UTF-8. The question is:

Is it safe? In other word, can we assume UTF-8 is a super set and Encoding the same way as CP850 encoding characters?

Thanks

I tried hexdump, below is the sample of my csv file, is it UTF-8?

000000d0 76 20 64 65 20 4d 61 72 6c 6f 77 65 2c 2c 4d 6f |v de Marlowe,,Mo|
000000e0 6e 74 72 c3 a9 61 6c 2c 51 43 2c 48 34 41 20 20 |ntr..al,QC,H4A |
1

1 Answer

If by superset you mean does UTF-8 include all the characters of CP850, then trivially yes, since UTF-8 can encode all valid Unicode code points using a variable-length encoding (1–4 bytes).

If you mean are characters encoded the same way, then as you've seen this is not the case, since é (U+00E9) is encoded as 82 in CP850 and C3 A9 in UTF-8.

I cannot see a character set / code page that encodes Ú as 82, but Ú is encoded as E9 in CP850, which is the ISO-8859-1 representation of é, so it's possible you've got your conversion the wrong way around (i.e. you're converting your file from ISO-8859-1 to CP850, and you want to convert from CP850 to UTF-8).

Here's an example using hd and iconv:

hd test.cp850.txt
00000000 4d 6f 6e 74 72 82 61 6c |Montr.al|
00000008
iconv --from cp850 --to utf8 test.cp850.txt > test.utf8.txt
hd test.utf8.txt
00000000 4d 6f 6e 74 72 c3 a9 61 6c |Montr..al|
00000009
9

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.