python - Split word containing unicode character -


i working on nlp project involving emojis in tweets.

an example of tweets given here:
"sometimes wish wa octopus slap 8 people @ once๐Ÿ™"

my problem once๐Ÿ™ considered 1 word split unique word 2 tweet this:
"sometimes wish wa octopus slap 8 people @ once ๐Ÿ™"

note have compiled regexp containing each emojis!

i looking efficient way of doing since have hundreds of thousands tweets can't figure out start.

thank you

can't this:

>>> import re >>> s = "sometimes wish wa octopus slap 8 people @ once๐Ÿ™" >>> re.findall("(\w+|[^\w ]+)",s) ['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '๐Ÿ™'] 

if need them single space-delimited string again, join them:

>>> " ".join(re.findall("(\w+|[^\w ]+)",s)) 'sometimes wish wa octopus slap 8 people @ once ๐Ÿ™' 

edit: fixed.


Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -