I downloaded this medical dictionary into R:
url <- "https://archive.org/stream/azfamilymedicalencyclopedia/A-Z%20Family%20Medical%20Encyclopedia_djvu.txt"
destfile <- "A-Z_Family_Medical_Encyclopedia.txt" # Save it locally with this name
download.file(url, destfile)
file_content <- readLines(destfile, encoding = "UTF-8")
Is it possible to only keep medical related terms and remove everything else?
I know how to remove stop words, e.g.
library(tm)
all_text <- paste(file_content, collapse = " ")
words <- unlist(strsplit(all_text, "\W+"))
filtered_words <- words[!tolower(words) %in% stopwords("en")]
filtered_text <- paste(filtered_words, collapse = " ")
But is there something that can only keep things related to the medical/health? Ie. technical/scientific vocabluary?
3
You can obtain a list of medical terms here link.Then compare them with %in%
just like you did before.
It will go something like this:
library(tm)
# set the current script's location as working directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
url <- "https://archive.org/stream/azfamilymedicalencyclopedia/A-Z%20Family%20Medical%20Encyclopedia_djvu.txt"
destfile <- "A-Z_Family_Medical_Encyclopedia.txt" # Save it locally with this name
download.file(url, destfile)
file_content <- readLines(destfile, encoding = "UTF-8")
all_text <- paste(file_content, collapse = " ")
words <- unlist(strsplit(all_text, "\W+"))
filtered_words <- words[!tolower(words) %in% stopwords("en")]
filtered_text <- paste(filtered_words, collapse = " ")
# URL for the medical terms word list
medical_terms_url <- "https://raw.githubusercontent.com/glutanimate/wordlist-medicalterms-en/master/wordlist.txt"
medical_terms_file <- "medical_terms.txt"
# Download the medical terms list
download.file(medical_terms_url, medical_terms_file)
# Read the medical terms into a vector
medical_terms <- readLines(medical_terms_file, encoding = "UTF-8")
medical_terms <- tolower(medical_terms) # Ensure all terms are in lowercase for matching
# Ensure all words in the text are in lowercase
filtered_words_lower <- tolower(filtered_words)
# Keep only words that are in the medical terms list
medical_filtered_words <- filtered_words_lower[filtered_words_lower %in% medical_terms]
unique_medical_filtered_words <-sort(unique(medical_filtered_words))
# do whatever you want with that :)
Results in 282.890 medical related words from 428.875 filtered_words, 10.950 of which being unique.
2