Is it safe to replace CP850 with UTF-8 encoding
Olivia Zamora
I have an old project reading files with CP850 encoding. But it handles accent characters wrong (e.g., Montréal becomes MontrÚal). I want to replace CP850 with UTF-8. The question is:
Is it safe? In other word, can we assume UTF-8 is a super set and Encoding the same way as CP850 encoding characters?
Thanks
I tried hexdump, below is the sample of my csv file, is it UTF-8?
000000d0 76 20 64 65 20 4d 61 72 6c 6f 77 65 2c 2c 4d 6f |v de Marlowe,,Mo|
000000e0 6e 74 72 c3 a9 61 6c 2c 51 43 2c 48 34 41 20 20 |ntr..al,QC,H4A | 1 1 Answer
If by superset you mean does UTF-8 include all the characters of CP850, then trivially yes, since UTF-8 can encode all valid Unicode code points using a variable-length encoding (1–4 bytes).
If you mean are characters encoded the same way, then as you've seen this is not the case, since é (U+00E9) is encoded as 82 in CP850 and C3 A9 in UTF-8.
I cannot see a character set / code page that encodes Ú as 82, but Ú is encoded as E9 in CP850, which is the ISO-8859-1 representation of é, so it's possible you've got your conversion the wrong way around (i.e. you're converting your file from ISO-8859-1 to CP850, and you want to convert from CP850 to UTF-8).
Here's an example using hd and iconv:
hd test.cp850.txt
00000000 4d 6f 6e 74 72 82 61 6c |Montr.al|
00000008
iconv --from cp850 --to utf8 test.cp850.txt > test.utf8.txt
hd test.utf8.txt
00000000 4d 6f 6e 74 72 c3 a9 61 6c |Montr..al|
00000009 9