Lost on how to separate columns

I want to separate a column of my dataset into different columns. However, the problem is that the column of interest is dirty and does not split the I want. This is the column I have.

notified_amount	nomenclature_of_loans
7000	6.64% GS 2029
8000	7.46%GS 2017 (1)
25000	12.25 % GS, 2010 (1) (12)

I have tried the following code:

govt_1 |> 
    separate_wider_delim(nomenclature_of_loans, delim = " ", names = c("rate", "type", "year"),
                         too_few = "align_end", too_many = "merge")

I tried the separate_wider_delim function to extract the year from the nomenclature_of_loans column. Though I am successful in doing so for quite some rows, dirty rows are preventing me from extracting year for all the rows and I am stuck.

Now I don’t care about the rate and type columns. I want to successfully separate the year from the nomenclature_of_loans column. While the first column perfectly splits, I am having problems with the second and third type. Mind you, my dataset has 2,800 rows. And around 350 rows are throwing a similar problem. I am stuck and unsure on how to proceed.

Ideally, this is the output I want

notified_amount	rate	year
7000	6.64%	2029
8000	7.46%	2017
25000	12.25%	2010

But, I am okay with extracting just the year alone too.

If you have any information on how to pre-clean the data before splitting or how to further clean the after splitting the data, please let me know. Thanks in advance!

You probably need to use regular expressions to separate the rate and year. These can be tricky, and probably what I suggest here will get you started but not quite all of the way. If you get stuck with more confounding strings in your nomenclature column, either add them to your question or start a new question. But here’s a start.

If you start with the tibble

library(tidyverse)

govt_1 <- tibble(nomenclature_of_loans = c("6.64% GS 2029", "7.46%GS 2017 (1)", "12.25   % GS, 2010 (1) (12)"))

then to pull out the year, you could use

govt_1 |> 
    mutate(yr = str_extract(nomenclature_of_loans, "20[0-9]{2}"))

assuming that your years are in the current century and that the messy strings around them don’t look like dates too.

To get the percentage and the year, you could try

govt_1 |> 
    mutate(x = str_match(nomenclature_of_loans, "(\d+?\.\d+\%).+?(20[0-9]{2})"),
           rate = x[,2],
           year = x[,3])

The x column is actually a column of tibbles. Those can be confusing to work with so I showed you how to extract the rate and column from them.

Creating regular expressions can feel like a superpower and a curse at the same time. The one above looks for a digit \d that might be repeated + but also might not be there at all ?. To look for a decimal point, you must escape it \. The percentage sign also should be escaped \%. The bit for the percentage is wrapped in round parentheses because it will be the first capture group. The year is in the second capture group. Between them I use an unescaped period to represent any character that might appear there. I use the +? after the period to catch as much or little garbage as there might be.

Using str_extract() from the stringr library, we can try:

govt_1$rate <- str_extract(govt_1$nomenclature_of_loans, "^\d+(?:\.\d+)?\s*%")
govt_1$year <- str_extract(govt_1$nomenclature_of_loans, "\b\d{4}\b")

Instead of using separate_wider_delim you can use separate_wider_regex to separate on patterns or regular expressions similar to the other answers using str_xxx but requires to be more explicit about the parts of the string that you want to discard:

govt_1 <- data.frame(
  notified_amount = c(7000L, 8000L, 25000L),
  nomenclature_of_loans = c(
    "6.64% GS 2029",
    "7.46%GS 2017 (1)", "12.25 % GS, 2010 (1) (12)"
  )
)

library(tidyr)

govt_1 |>
  separate_wider_regex(
    nomenclature_of_loans,
    patterns = c(
      rate = "^\d+\.\d+\s?%",
      ".*?",
      year = "\d{4}",
      ".*$"
    )
  ) |>
  dplyr::mutate(rate = gsub("\s%", "%", rate))
#> # A tibble: 3 × 3
#>   notified_amount rate   year 
#>             <int> <chr>  <chr>
#> 1            7000 6.64%  2029 
#> 2            8000 7.46%  2017 
#> 3           25000 12.25% 2010

Recognized by R Language Collective

1) tidyverse Separate the fields by percent (%) and then extract the first number from each using parse_number. No regular expressions are used.

library(dplyr)
library(readr)
library(tidyr)

govt_1 %>%
  separate_wider_delim (nomenclature_of_loans, 
    names = c("rate", "year"), delim = "%") %>%
  mutate(across(rate:year, parse_number))

giving

# A tibble: 3 × 3
  notified_amount  rate  year
            <int> <dbl> <dbl>
1            7000  6.64  2029
2            8000  7.46  2017
3           25000 12.2   2010

2) strcapture Using only base R we can use strcapture with the indicated regular expression. The matches to the capture groups in the regular expression will be assigned to the names and types in proto.

pat <- "(\d+.\d+)\D+(\d{4})"
proto <- list(rate = numeric(0), year = numeric(0))
with(govt_1, 
  data.frame(notified_amount, strcapture(pat, nomenclature_of_loans, proto)))

giving

  notified_amount  rate year
1            7000  6.64 2029
2            8000  7.46 2017
3           25000 12.25 2010

3) read.table Again using only base R we can split the field into two using read.table and then fix up year. The only regular expression used is D which matches any non-digit.

govt_1 |>
  with(data.frame(
    notified_amount,
    read.table(text = nomenclature_of_loans, sep = "%", comment.char = "(",
      col.names = c("rate", "year"))) |>
  transform(year = as.numeric(trimws(year, whitespace = "\D"))))

giving

  notified_amount  rate year
1            7000  6.64 2029
2            8000  7.46 2017
3           25000 12.25 2010

Note

The input in reproducible form

govt_1  <- data.frame(
  notified_amount = c(7000L, 8000L, 25000L),
  nomenclature_of_loans = c("6.64% GS 2029", "7.46%GS 2017 (1)", "12.25 % GS, 2010 (1) (12)")
)

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 14:22

Thẻ: rstringr

Thiết kế website giá rẻ

Danh mục

Lost on how to separate columns

Note