r - Grouping words that are similar -
companyname <- c('kraft', 'kraft foods', 'kfraft', 'nestle', 'nestle usa', 'gm', 'general motors', 'the dow chemical company', 'dow')
i want either:
companyname2 kraft kraft kraft nestle nestle general motors general motors dow dow
but absolutely fine with:
companyname2 1 1 1 2 2 3 3
i see algorithms getting distance between 2 words, if had 1 weird name compare other names , pick 1 lowest distance. have thousands of names , want group them groups.
i not know elastic search, 1 of functions in elastic
package or other function me out here?
i'm sorry there's no programming here. know. way out of area of normal expertise.
solution: use string distance
you're on right track. here r code started:
install.packages("stringdist") # install package library("stringdist") companyname <- c('kraft', 'kraft foods', 'kfraft', 'nestle', 'nestle usa', 'gm', 'general motors', 'the dow chemical company', 'dow') companyname = tolower(companyname) # otherwise case matters # calculate string distance matrix; lcs 1 option ?"stringdist-metrics" # see others sdm = stringdistmatrix(companyname, companyname, usenames=t, method="lcs")
let's take look. these calculated distances between strings, using longest common subsequence metric (try others, e.g. cosine, levenshtein). measure, in essence, how many characters strings have in common. pros , cons beyond q&a. might gives higher similarity value 2 strings contain exact same substring (like dow)
sdm[1:5,1:5] kraft kraft foods kfraft nestle nestle usa kraft 0 6 1 9 13 kraft foods 6 0 7 15 15 kfraft 1 7 0 10 14 nestle 9 15 10 0 4 nestle usa 13 15 14 4 0
some visualization
# hierarchical clustering sdm_dist = as.dist(sdm) # convert dist object (you have distances calculated) plot(hclust(sdm_dist))
if want group explicitly k groups, use k-medoids.
library("cluster") clusplot(pam(sdm_dist, 5), color=true, shade=f, labels=2, lines=0)
Comments
Post a Comment