r - Grouping words that are similar -

- September 15, 2010

companyname <- c('kraft', 'kraft foods', 'kfraft', 'nestle', 'nestle usa', 'gm', 'general motors', 'the dow chemical company', 'dow')

i want either:

companyname2 kraft kraft kraft nestle nestle general motors general motors dow dow

but absolutely fine with:

companyname2 1 1 1 2 2 3 3

i see algorithms getting distance between 2 words, if had 1 weird name compare other names , pick 1 lowest distance. have thousands of names , want group them groups.

i not know elastic search, 1 of functions in elastic package or other function me out here?

i'm sorry there's no programming here. know. way out of area of normal expertise.

solution: use string distance

you're on right track. here r code started:

install.packages("stringdist") # install package library("stringdist")  companyname <- c('kraft', 'kraft foods', 'kfraft', 'nestle', 'nestle usa', 'gm', 'general motors', 'the dow chemical company', 'dow') companyname = tolower(companyname) # otherwise case matters # calculate string distance matrix; lcs 1 option ?"stringdist-metrics" # see others sdm = stringdistmatrix(companyname, companyname, usenames=t, method="lcs")

let's take look. these calculated distances between strings, using longest common subsequence metric (try others, e.g. cosine, levenshtein). measure, in essence, how many characters strings have in common. pros , cons beyond q&a. might gives higher similarity value 2 strings contain exact same substring (like dow)

sdm[1:5,1:5]             kraft kraft foods kfraft nestle nestle usa kraft           0           6      1      9         13 kraft foods     6           0      7     15         15 kfraft          1           7      0     10         14 nestle          9          15     10      0          4 nestle usa     13          15     14      4          0

some visualization

# hierarchical clustering sdm_dist = as.dist(sdm) # convert dist object (you have distances calculated) plot(hclust(sdm_dist))

if want group explicitly k groups, use k-medoids.

library("cluster") clusplot(pam(sdm_dist, 5), color=true, shade=f, labels=2, lines=0)

Search This Blog

SSIS

r - Grouping words that are similar -

solution: use string distance

some visualization

Comments

Post a Comment

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -