I want to separate a column of my dataset into different columns. However, the problem is that the column of interest is dirty and does not split the I want. This is the column I have.
notified_amount | nomenclature_of_loans |
---|---|
7000 | 6.64% GS 2029 |
8000 | 7.46%GS 2017 (1) |
25000 | 12.25 % GS, 2010 (1) (12) |
I have tried the following code:
govt_1 |>
separate_wider_delim(nomenclature_of_loans, delim = " ", names = c("rate", "type", "year"),
too_few = "align_end", too_many = "merge")
I tried the separate_wider_delim
function to extract the year
from the nomenclature_of_loans
column. Though I am successful in doing so for quite some rows, dirty rows are preventing me from extracting year
for all the rows and I am stuck.
Now I don’t care about the rate
and type
columns. I want to successfully separate the year
from the nomenclature_of_loans
column. While the first column perfectly splits, I am having problems with the second and third type. Mind you, my dataset has 2,800 rows. And around 350 rows are throwing a similar problem. I am stuck and unsure on how to proceed.
Ideally, this is the output I want
notified_amount | rate | year |
---|---|---|
7000 | 6.64% | 2029 |
8000 | 7.46% | 2017 |
25000 | 12.25% | 2010 |
But, I am okay with extracting just the year
alone too.
If you have any information on how to pre-clean the data before splitting or how to further clean the after splitting the data, please let me know. Thanks in advance!
You probably need to use regular expressions to separate the rate and year. These can be tricky, and probably what I suggest here will get you started but not quite all of the way. If you get stuck with more confounding strings in your nomenclature column, either add them to your question or start a new question. But here’s a start.
If you start with the tibble
library(tidyverse)
govt_1 <- tibble(nomenclature_of_loans = c("6.64% GS 2029", "7.46%GS 2017 (1)", "12.25 % GS, 2010 (1) (12)"))
then to pull out the year, you could use
govt_1 |>
mutate(yr = str_extract(nomenclature_of_loans, "20[0-9]{2}"))
assuming that your years are in the current century and that the messy strings around them don’t look like dates too.
To get the percentage and the year, you could try
govt_1 |>
mutate(x = str_match(nomenclature_of_loans, "(\d+?\.\d+\%).+?(20[0-9]{2})"),
rate = x[,2],
year = x[,3])
The x
column is actually a column of tibbles. Those can be confusing to work with so I showed you how to extract the rate and column from them.
Creating regular expressions can feel like a superpower and a curse at the same time. The one above looks for a digit \d
that might be repeated +
but also might not be there at all ?
. To look for a decimal point, you must escape it \.
The percentage sign also should be escaped \%
. The bit for the percentage is wrapped in round parentheses because it will be the first capture group. The year is in the second capture group. Between them I use an unescaped period to represent any character that might appear there. I use the +?
after the period to catch as much or little garbage as there might be.
Using str_extract()
from the stringr
library, we can try:
govt_1$rate <- str_extract(govt_1$nomenclature_of_loans, "^\d+(?:\.\d+)?\s*%")
govt_1$year <- str_extract(govt_1$nomenclature_of_loans, "\b\d{4}\b")
Instead of using separate_wider_delim
you can use separate_wider_regex
to separate on patterns or regular expressions similar to the other answers using str_xxx
but requires to be more explicit about the parts of the string that you want to discard:
govt_1 <- data.frame(
notified_amount = c(7000L, 8000L, 25000L),
nomenclature_of_loans = c(
"6.64% GS 2029",
"7.46%GS 2017 (1)", "12.25 % GS, 2010 (1) (12)"
)
)
library(tidyr)
govt_1 |>
separate_wider_regex(
nomenclature_of_loans,
patterns = c(
rate = "^\d+\.\d+\s?%",
".*?",
year = "\d{4}",
".*$"
)
) |>
dplyr::mutate(rate = gsub("\s%", "%", rate))
#> # A tibble: 3 × 3
#> notified_amount rate year
#> <int> <chr> <chr>
#> 1 7000 6.64% 2029
#> 2 8000 7.46% 2017
#> 3 25000 12.25% 2010
1) tidyverse Separate the fields by percent (%) and then extract the first number from each using parse_number
. No regular expressions are used.
library(dplyr)
library(readr)
library(tidyr)
govt_1 %>%
separate_wider_delim (nomenclature_of_loans,
names = c("rate", "year"), delim = "%") %>%
mutate(across(rate:year, parse_number))
giving
# A tibble: 3 × 3
notified_amount rate year
<int> <dbl> <dbl>
1 7000 6.64 2029
2 8000 7.46 2017
3 25000 12.2 2010
2) strcapture Using only base R we can use strcapture
with the indicated regular expression. The matches to the capture groups in the regular expression will be assigned to the names and types in proto
.
pat <- "(\d+.\d+)\D+(\d{4})"
proto <- list(rate = numeric(0), year = numeric(0))
with(govt_1,
data.frame(notified_amount, strcapture(pat, nomenclature_of_loans, proto)))
giving
notified_amount rate year
1 7000 6.64 2029
2 8000 7.46 2017
3 25000 12.25 2010
3) read.table Again using only base R we can split the field into two using read.table
and then fix up year
. The only regular expression used is D
which matches any non-digit.
govt_1 |>
with(data.frame(
notified_amount,
read.table(text = nomenclature_of_loans, sep = "%", comment.char = "(",
col.names = c("rate", "year"))) |>
transform(year = as.numeric(trimws(year, whitespace = "\D"))))
giving
notified_amount rate year
1 7000 6.64 2029
2 8000 7.46 2017
3 25000 12.25 2010
Note
The input in reproducible form
govt_1 <- data.frame(
notified_amount = c(7000L, 8000L, 25000L),
nomenclature_of_loans = c("6.64% GS 2029", "7.46%GS 2017 (1)", "12.25 % GS, 2010 (1) (12)")
)