unicode - How to programatically identify the character set of a file? -

- August 15, 2010

this question has answer here:

how detect character encoding of text file? 8 answers

from detailed perspective how 1 identify character set of file? information found checking magic number of file, other articles found strayed away this.

i have tried opening different files encoded in different character sets (ascii/utf8 example) hexdump , there no file identifier on character set file is.

it practically impossible identify arbitrary character sets looking @ raw byte dump. character sets show typical patterns can identified, still doesn't make clear match. best can typically guess exclusion, starting character sets have rules. if file not valid in utf-8, try shift-jis, big-5 etc... problem any file valid in latin-1 , other single byte encodings. that's makes fundamentally impossible. it's virtually impossible distinguish 1 single-byte charset other single-byte charset. in end you'd have employ text analysis determine whether decoded piece of text appears make sense or whether looks gibberish , hence encoding incorrect.

in short: there's no foolproof way detect character sets, period. should have metadata specifies charset.

Search This Blog

SSIS

unicode - How to programatically identify the character set of a file? -

Comments

Post a Comment

Popular posts from this blog

c# - Pausing a storyboard on TabItem mouse over -

erlang - Saving a digraph to mnesia is hindered because of its side-effects -

c# - How Configure Devart dotConnect for SQLite Code First? -