I am fairly new to R, and I am trying to compare two character columns.
I have a file with two columns and 5000+ rows of species names. The columns are different lengths, and many species names are repeated.
Column 1 (Bee) is an old list of species. Column 2 (scientificName) is the updated list of species. I need to measure the size of each species, so I want the updated list to have every species from both columns, once.
I need more than just “equal” or “not equal”; I need to see what species are missing from each column compared to the other.
Essentially, I need each column to be a list of only unique species (not repeats) and to see if either column has species the other does not (I hope I am articulating that clearly).
I used the unique() function initially to get a list of names without repetition. I was then going to compare the lists for each column, but the output is not in a format that is easily transferred manually to a .csv. (It lists 1 “species name” 2 “species name” etc. and I need each species in a row without numbers or quotations).
Most of the code I’ve found is only applicable to numerical data.
I’ve since tried the following:
df %>% mutate(comparison = if_else( as.character(df$Bees) == as.character(df$scientificName), "equal", "different"))
I got an error: ‘Comparison’ must be size 5386 or 1, not 0.
—–
df$Match <- as.character(df$Bees) == as.character(df$scientificName)
Here the error says:
Error in $<-: ! Assigned data as.character(df$Bees) == as.character(df$scientificName)must be compatible with existing data. ✖ Existing data has 5386 rows. ✖ Assigned data has 0 rows. ℹ Only vectors of size 1 are recycled. Caused by error invectbl_recycle_rhs_rows(): ! Can't recycle input of size 0 to size 5386.
—–
library(vecsets) eg_data <- data.frame( col1 = df$Bee, col2 = df$scientificName, stringsAsFactors=FALSE) eg_data$name1_diff1_2 <- mapply(vsetdiff, strsplit(eg_data$col1, split = ""), strsplit(eg_data$col2, split = "")) eg_data$name2_diff2_1 <- mapply(vsetdiff, strsplit(eg_data$col2, split = ""), strsplit(eg_data$col1, split = ""))
The output for this has each character as it’s own string.
—–
setdiff(df$Bee, df$scientificName)
This output was the closest I got, it does show the different species, but it doesn’t tell me where that difference is (i.e. which column has it and which one doesn’t)
—–
anti_join(df$Bee, df$scientificName, by = "text")
The error message was
Error in UseMethod("anti_join") : no applicable method for 'anti_join' applied to an object of class "character"
alexia m is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.