python - Split word containing unicode character -

- April 15, 2013

i working on nlp project involving emojis in tweets.

an example of tweets given here:
"sometimes wish wa octopus slap 8 people @ once🐙"

my problem once🐙 considered 1 word split unique word 2 tweet this:
"sometimes wish wa octopus slap 8 people @ once 🐙"

note have compiled regexp containing each emojis!

i looking efficient way of doing since have hundreds of thousands tweets can't figure out start.

thank you

can't this:

>>> import re >>> s = "sometimes wish wa octopus slap 8 people @ once🐙" >>> re.findall("(\w+|[^\w ]+)",s) ['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '🐙']

if need them single space-delimited string again, join them:

>>> " ".join(re.findall("(\w+|[^\w ]+)",s)) 'sometimes wish wa octopus slap 8 people @ once 🐙'

edit: fixed.

Search This Blog

SSIS

python - Split word containing unicode character -

Comments

Post a Comment

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -