I am reading a pdf in r using library(pdftools)
<code>library(tidyverse)
library(pdftools)
library(lubridate)
pdf_rowwise <- strsplit(pdf_text("V://path//sample.pdf"), split = "n")
</code>
<code>library(tidyverse)
library(pdftools)
library(lubridate)
pdf_rowwise <- strsplit(pdf_text("V://path//sample.pdf"), split = "n")
</code>
library(tidyverse)
library(pdftools)
library(lubridate)
pdf_rowwise <- strsplit(pdf_text("V://path//sample.pdf"), split = "n")
<code>class(pdf_rowwise[[1]][8:18])
</code>
<code>class(pdf_rowwise[[1]][8:18])
</code>
class(pdf_rowwise[[1]][8:18])
output: [1] "character"
Now taking a sample from this pdf
<code>pdf_rowwise[[1]][8:18]
</code>
<code>pdf_rowwise[[1]][8:18]
</code>
pdf_rowwise[[1]][8:18]
<code> [1] "Test Name Result Biological Ref. Int. Unit"
[2] ""
[3] " 100 TEST AAROGYA 2.0"
[4] " THYROID PROFILE,Serum"
[5] "TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml"
[6] " (Method : CLIA)"
[7] ""
[8] "TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL"
[9] " (Method : CLIA)"
[10] ""
[11] "THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL"
</code>
<code> [1] "Test Name Result Biological Ref. Int. Unit"
[2] ""
[3] " 100 TEST AAROGYA 2.0"
[4] " THYROID PROFILE,Serum"
[5] "TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml"
[6] " (Method : CLIA)"
[7] ""
[8] "TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL"
[9] " (Method : CLIA)"
[10] ""
[11] "THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL"
</code>
[1] "Test Name Result Biological Ref. Int. Unit"
[2] ""
[3] " 100 TEST AAROGYA 2.0"
[4] " THYROID PROFILE,Serum"
[5] "TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml"
[6] " (Method : CLIA)"
[7] ""
[8] "TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL"
[9] " (Method : CLIA)"
[10] ""
[11] "THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL"
I have also saved above output as text file at https://raw.githubusercontent.com/johnsnow09/stackover_doubts/main/sample_pdf_text.txt
Above text or text file can be used as a source of data and from this I am trying to extract data (line No 5,8,11) as 3 or 4 columns as dataframe from this text.
Desired Output:
I have tried few codes below but none of them is working for me:
<code>strsplit(pdf_rowwise[[1]][8:18], split = "t")
</code>
<code>strsplit(pdf_rowwise[[1]][8:18], split = "t")
</code>
strsplit(pdf_rowwise[[1]][8:18], split = "t")
<code>pdf_rowwise[[1]][8:18] %>% as.tibble()
# this combines everything into 1 column dataframe
</code>
<code>pdf_rowwise[[1]][8:18] %>% as.tibble()
# this combines everything into 1 column dataframe
</code>
pdf_rowwise[[1]][8:18] %>% as.tibble()
# this combines everything into 1 column dataframe
<code># below codes also doesn't work
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% as.tibble()
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% list2DF()
</code>
<code># below codes also doesn't work
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% as.tibble()
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% list2DF()
</code>
# below codes also doesn't work
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% as.tibble()
strsplit(pdf_rowwise[[1]][8:18], split = "t") %>% list2DF()
<code>str_split_fixed(pdf_rowwise[[1]][8:18]," ",2)
# not giving what I expected
</code>
<code>str_split_fixed(pdf_rowwise[[1]][8:18]," ",2)
# not giving what I expected
</code>
str_split_fixed(pdf_rowwise[[1]][8:18]," ",2)
# not giving what I expected
I am New to this sort of parsing and extraction so not sure which library & functions are best suited for this work.