I have a huge text file whose size is much larger than the total amount of RAM. For example,
...a01012... [a: 120345...6701019876001110222...
t0**00010101fghi bcdef]jj018024[a:__n*m*...
k0102010102eee/01eee/0101//] ghi [a:01]
I need to transform the input file to the output file as follows:
- In a block of text between
[a:
and]
— replace any string that matches the regular expression(01|02){1,}
with/
(at any distance from[a:
and]
); - In a block of text which is not between
[a:
and]
— replace any string that matches the regular expression(01|02){1,}
with(at any distance from
[a:
or]
, and regardless of whether they exist).
So for the input file above the expected output is
...a2... [a: 120345...67/98760/11/22...
t0**00/fghi bcdef]jj84[a:__n*m*...
k/eee//eee////] ghi [a:/]
There are four specific caveats to consider:
- The length of each block of text satisfying property #1 or property #2 may be arbitrarily large, possibly even larger than the total amount of available RAM;
- The length of a string that matches the regular expression
(01|02){1,}
may be larger than the total amount of available RAM. That is, there may be a string that matches, for example,(01|02){5234567890}
, and it must be replaced with the corresponding character in its entirety; - I need a reasonable performance (say, ~1–4 hours for 4GB of text, ~2–8 hours for 8GB, ~3–12 hours for 12GB etc. on a typical user PC with 8GB of RAM);
- The total amount of free space in data storage device is not larger than
4*S
, whereS
is the size of the input file. That is,4*S
is the maximum size of data that a program can need to be stored in non-volatile memory.
Is it possible to solve the problem with sed, Node.js or Perl? If no, what specific scripting language is practically suitable to solve a problem like this, and how?