unicode - How to programatically identify the character set of a file? -


this question has answer here:

from detailed perspective how 1 identify character set of file? information found checking magic number of file, other articles found strayed away this.

i have tried opening different files encoded in different character sets (ascii/utf8 example) hexdump , there no file identifier on character set file is.

it practically impossible identify arbitrary character sets looking @ raw byte dump. character sets show typical patterns can identified, still doesn't make clear match. best can typically guess exclusion, starting character sets have rules. if file not valid in utf-8, try shift-jis, big-5 etc... problem any file valid in latin-1 , other single byte encodings. that's makes fundamentally impossible. it's virtually impossible distinguish 1 single-byte charset other single-byte charset. in end you'd have employ text analysis determine whether decoded piece of text appears make sense or whether looks gibberish , hence encoding incorrect.

in short: there's no foolproof way detect character sets, period. should have metadata specifies charset.


Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

c++ - Clear the memory after returning a vector in a function -

erlang - Saving a digraph to mnesia is hindered because of its side-effects -