python - Split word containing unicode character -
i working on nlp project involving emojis in tweets.
an example of tweets given here:
"sometimes wish wa octopus slap 8 people @ once๐"
my problem once๐
considered 1 word split unique word 2 tweet this:
"sometimes wish wa octopus slap 8 people @ once ๐"
note have compiled regexp containing each emojis!
i looking efficient way of doing since have hundreds of thousands tweets can't figure out start.
thank you
can't this:
>>> import re >>> s = "sometimes wish wa octopus slap 8 people @ once๐" >>> re.findall("(\w+|[^\w ]+)",s) ['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '๐']
if need them single space-delimited string again, join them:
>>> " ".join(re.findall("(\w+|[^\w ]+)",s)) 'sometimes wish wa octopus slap 8 people @ once ๐'
edit: fixed.
Comments
Post a Comment