Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
880 views
in Technique[技术] by (71.8m points)

string - Group together levels with similar names R

I have a variable q with various levels. Some of the levels are actually the same but have been bad reported.

 length(q)
[1] 13490
> levels(q)
  [1] ""                          " "                        
  [3] "?"                         "."                        
  [5] "Activelle"                 "CERACETT"                 
  [7] "CERACETTE"                 "CERASETTE"                
  [9] "cerazette"                 "Cerazette"                
 [11] "CERAZETTE"                 "CERAZETTI"                
 [13] "CEVAZETTE"                 "cilest"                   
 [15] "Cilest"                    "Cileste"                  
 [17] "Conludag"                  "COPALETTA?"               
 [19] "DEPO..."                   "Depo-Provera"             
 [21] "Depo. Pro Vera"            "DEPOPROVERA"              
 [23] "DEPO PROVERA"              "depoprovin"               
 [25] "DEPROVERA"                 "DESOLETT"                 
 [27] "desorelle"                 "Diane"                    
 [29] "Diane mite"                "Divana"                   
 [31] "ENDEVINA"                  "Estradot"                 
 [33] "ETHISYLESTRA,LEVONORGESTR" "Evra"                     
 [35] "EXCLUTENA"                 "EXKLUTENA"                
 [37] "EXLUENTA 0,5MG"            "EXLUTENA"                 
 [39] "Femanest"                  "femenest"                 
 [41] "gastonette"                "Harmonet"                 
 [43] "hormon"                    "Hormonspiral"             
 [45] "IMPLANON"                  "INPLANON"                 
 [47] "KOMMER EJ IHxc5G"         "LEBONOVA"                 
 [49] "LEMINOVA"                  "lemonora"                 
 [51] "LENONOVA"                  "LENOR"                    
 [53] "lenova"                    "Lenova"                   
 [55] "LENOVA"                    "LENOVA?"                  
 [57] "Leonova"                   "Levanova"                 
 [59] "LEVENOVA"                  "LEVINA"                   
 [61] "Levinova"                  "LEVINOVA"                 
 [63] "LEVIONOVA"                 "Levnova"                  
 [65] "levonova"                  "Levonova"                 
 [67] "LEVONOVA"                  "Levonova hormonspiral"    
 [69] "Levonova lykkja"           "Lindinette"               
 [71] "lindynette"                "Lindynette"               
 [73] "loette"                    "lyndynette"               
 [75] "malonetta"                 "Marvelon"                 
 [77] "Meniva"                    "Mercilon"                 
 [79] "Mereilom"                  "merivan"                  
 [81] "Microgyn"                  "microgynon"               
 [83] "Microgynon"                "Mikrogyn"                 
 [85] "Milvane"                   "MINERVA/LEVONORG."        
 [87] "MINI P"                    "MINI-P"                   
 [89] "Mini-pe"                   "mini-pl"                  
 [91] "MINIRA"                    "MINNS EJ"                 
 [93] "minulet"                   "Minulet"                  
 [95] "minulet p-piller"          "MIRANDA"                  
 [97] "Mircne"                    "mirena"                   
 [99] "Mirena"                    "MIRENA"                   
[101] "mirena levonorge"          "MIRENA LEVONORGESTREL"    
[103] "Modina p-piller"           "Mod turner: milv"         
[105] "NEOULETTA"                 "NEOVLETTA"                
[107] "NORLEVO"                   "NOV?"                     
[109] "Novaring"                  "novynette"                
[111] "Novynette"                 "nuva ring"                
[113] "Nuva ring"                 "NUVARING"                 
[115] "?stradiol dlf 2"           "?stradiolgel"             
[117] "P-plaster"                 "PROVERA"                  
[119] "RESTOVAR"                  "spiral"                   
[121] "Spiral"                    "Synfase"                  
[123] "T-GYN"                     "triminetta sando"         
[125] "TRIMORDIOL"                "TRINOVUM"                 
[127] "TRIONETTA 28"              "TRIREGOL"                 
[129] "T-spiral"                  "Vagifem"                  
[131] "VET EJ"                    "yas, bayer"               
[133] "yasmin"                    "Yasmin"                   
[135] "YASMINELL"                 "yasminelle"               
[137] "Yasminelle"                "YAZ"                      
[139] "ZYRONA"   

I would like to group all similar levels. For example in this case I want to group together cerazetti, cerasete, ceracett... How can I do that?

EDIT :

> dput(levels(q))
c("", " ", "?", ".", "Activelle", "CERACETT", "CERACETTE", "CERASETTE", 
"cerazette", "Cerazette", "CERAZETTE", "CERAZETTI", "CEVAZETTE", 
"cilest", "Cilest", "Cileste", "Conludag", "COPALETTA?", "DEPO...", 
"Depo-Provera", "Depo. Pro Vera", "DEPOPROVERA", "DEPO PROVERA", 
"depoprovin", "DEPROVERA", "DESOLETT", "desorelle", "Diane", 
"Diane mite", "Divana", "ENDEVINA", "Estradot", "ETHISYLESTRA,LEVONORGESTR", 
"Evra", "EXCLUTENA", "EXKLUTENA", "EXLUENTA 0,5MG", "EXLUTENA", 
"Femanest", "femenest", "gastonette", "Harmonet", "hormon", "Hormonspiral", 
"IMPLANON", "INPLANON", "KOMMER EJ IHxc5G", "LEBONOVA", "LEMINOVA", 
"lemonora", "LENONOVA", "LENOR", "lenova", "Lenova", "LENOVA", 
"LENOVA?", "Leonova", "Levanova", "LEVENOVA", "LEVINA", "Levinova", 
"LEVINOVA", "LEVIONOVA", "Levnova", "levonova", "Levonova", "LEVONOVA", 
"Levonova hormonspiral", "Levonova lykkja", "Lindinette", "lindynette", 
"Lindynette", "loette", "lyndynette", "malonetta", "Marvelon", 
"Meniva", "Mercilon", "Mereilom", "merivan", "Microgyn", "microgynon", 
"Microgynon", "Mikrogyn", "Milvane", "MINERVA/LEVONORG.", "MINI P", 
"MINI-P", "Mini-pe", "mini-pl", "MINIRA", "MINNS EJ", "minulet", 
"Minulet", "minulet p-piller", "MIRANDA", "Mircne", "mirena", 
"Mirena", "MIRENA", "mirena levonorge", "MIRENA LEVONORGESTREL", 
"Modina p-piller", "Mod turner: milv", "NEOULETTA", "NEOVLETTA", 
"NORLEVO", "NOV?", "Novaring", "novynette", "Novynette", "nuva ring", 
"Nuva ring", "NUVARING", "?stradiol dlf 2", "?stradiolgel", 
"P-plaster", "PROVERA", "RESTOVAR", "spiral", "Spiral", "Synfase", 
"T-GYN", "triminetta sando", "TRIMORDIOL", "TRINOVUM", "TRIONETTA 28", 
"TRIREGOL", "T-spiral", "Vagifem", "VET EJ", "yas, bayer", "yasmin", 
"Yasmin", "YASMINELL", "yasminelle", "Yasminelle", "YAZ", "ZYRONA"
)
> 
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use the function agrep, which searches for approximate matches. It uses the Levenshtein distance and you can maximum distance allowed for a match by means of the argument max.distance.

Taking this vector (the one that you posted except the empty string "" and "KOMMER EJ IHxc5G"):

x <- c("Activelle", "CERACETTE", "cerazette", "CERAZETTE", "CEVAZETTE", 
"Cilest", "Conludag", "DEPO...", "Depo. Pro Vera", "DEPO PROVERA", 
"DEPROVERA", "desorelle", "Diane mite", "ENDEVINA", "ETHISYLESTRA,LEVONORGESTR", 
"EXCLUTENA", "EXLUENTA 0,5MG", "Femanest", "gastonette", "hormon", 
"IMPLANON", "LEMINOVA", "LENONOVA", "lenova", "LENOVA", "Leonova", 
"LEVENOVA", "Levinova", "LEVIONOVA", "levonova", "LEVONOVA", 
"Levonova lykkja", "lindynette", "loette", "malonetta", "Meniva", 
"Mereilom", "Microgyn", "Microgynon", "Milvane", "MINI P", "Mini-pe", 
"MINIRA", "minulet", "minulet p-piller", "Mircne", "Mirena", 
"mirena levonorge", "Modina p-piller", "NEOULETTA", "NORLEVO", 
"Novaring", "Novynette", "Nuva ring", "?stradiol dlf 2", "P-plaster", 
"RESTOVAR", "Spiral", "T-GYN", "TRIMORDIOL", "TRIONETTA 28", 
"T-spiral", "VET EJ", "yasmin", "YASMINELL", "Yasminelle", "ZYRONA", 
"CERACETT", "CERASETTE", "Cerazette", "CERAZETTI", "cilest", 
"Cileste", "COPALETTA?", "Depo-Provera", "DEPOPROVERA", "depoprovin", 
"DESOLETT", "Diane", "Divana", "Estradot", "EXKLUTENA", "EXLUTENA", 
"femenest", "Harmonet", "Hormonspiral", "INPLANON", "LEBONOVA", 
"lemonora", "LENOR", "Lenova", "LENOVA?", "Levanova", "LEVINA", 
"LEVINOVA", "Levnova", "Levonova", "Levonova hormonspiral", "Lindinette", 
"Lindynette", "lyndynette", "Marvelon", "Mercilon", "merivan", 
"microgynon", "Mikrogyn", "MINERVA/LEVONORG.", "MINI-P", "mini-pl", 
"MINNS EJ", "Minulet", "MIRANDA", "mirena", "MIRENA", "MIRENA LEVONORGESTREL", 
"Mod turner: milv", "NEOVLETTA", "novynette", "nuva ring", "NUVARING", 
"?stradiolgel", "PROVERA", "spiral", "Synfase", "triminetta sando", 
"TRINOVUM", "TRIREGOL", "Vagifem", "yas, bayer", "Yasmin", "yasminelle")

You can do:

groups <- list()
i <- 1
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

The first groups are defined as follows:

head(groups)
[[1]]
[1] "Activelle"

[[2]]
[1] "CERACETTE" "cerazette" "CERAZETTE" "CERACETT"  "CERASETTE" "Cerazette"

[[3]]
[1] "CEVAZETTE"

[[4]]
[1] "Cilest"  "cilest"  "Cileste"

[[5]]
[1] "Conludag"

[[6]]
[1] "DEPO..."

Be aware that the above code removes the elements in x. When the loop is finished the vector x will be empty.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...