python - UnicodeDecodeError in textblob tutorial -
i'm trying run through textblob tutorial in windows (using git bash shell) python 3.3.
i've installed textblob
, nltk
dependencies.
the python code is:
from text.blob import textblob wiki = textblob("python high-level, general-purpose programming language.") tags = wiki.tags
i'm getting following error
traceback (most recent call last): file "textblob.py", line 4, in <module> tags = wiki.tags file "c:\python33\lib\site-packages\text\decorators.py", line 18, in __get__ value = obj.__dict__[self.func.__name__] = self.func(obj) file "c:\python33\lib\site-packages\text\blob.py", line 357, in pos_tags word, t in self.pos_tagger.tag(self.raw) file "c:\python33\lib\site-packages\text\taggers.py", line 40, in tag return pattern_tag(sentence, tokenize) file "c:\python33\lib\site-packages\text\en.py", line 115, in tag sentence in parse(s, tokenize, true, false, false, false, encoding).split(): file "c:\python33\lib\site-packages\text\en.py", line 99, in parse return parser.parse(unicode(s), *args, **kwargs) file "c:\python33\lib\site-packages\text\text.py", line 1213, in parse s[i] = self.find_tags(s[i], **kwargs) file "c:\python33\lib\site-packages\text\en.py", line 49, in find_tags return _parser.find_tags(self, tokens, **kwargs) file "c:\python33\lib\site-packages\text\text.py", line 1161, in find_tags map = kwargs.get( "map", none)) file "c:\python33\lib\site-packages\text\text.py", line 967, in find_tags tagged.append([token, lexicon.get(token, i==0 , lexicon.get(token.lower()) or none)]) file "c:\python33\lib\site-packages\text\text.py", line 98, in return self._lazy("get", *args) file "c:\python33\lib\site-packages\text\text.py", line 79, in _lazy self.load() file "c:\python33\lib\site-packages\text\text.py", line 367, in load dict.update(self, (x.split(" ")[:2] x in _read(self._path) if x.strip())) file "c:\python33\lib\site-packages\text\text.py", line 367, in <genexpr> dict.update(self, (x.split(" ")[:2] x in _read(self._path) if x.strip())) file "c:\python33\lib\site-packages\text\text.py", line 346, in _read line in f: file "c:\python33\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] unicodedecodeerror: 'charmap' codec can't decode byte 0x9d in position 16: character maps <undefined>
any idea wrong here? adding 'u'
before string didn't help.
release 0.7.1 fixes issue, means it's time
$ pip install -u textblob
the problem en-lexicon.txt
file used part-of-speech tagging opened file using windows' default platform encoding, cp1252. file apparently had characters python not decode encoding. fixed explicitly opening file in utf-8 mode.
Comments
Post a Comment