One of our application filters files in certain directory, extract some data from it and export a document from the extracted data. The algorithm for extracting the data depends on the file, and so far we use regex to select the algorithm to be used, for example .*.txt
will be processed by algorithm A, foo[0-5].xml
will be processed by algo B, etc.
However now we need some files to be processed together. For example, in one case we need two files, foo.*.xml
and bar.*.xml
. Part of the information to be extracted exist in the foo file, and the other part in the bar file. Moreover, we need to make sure the wild card is compatible. For example, if there are 6 files
foo1.xml
foo23.xml
bar1.xml
bar9.xml
bar23.xml
foo4.xml
I would expect foo1 and bar1 to be identified as a group, and foo23 and bar23 as another group. bar9 and foo4 has no pair, so they will not be treated.
Now, since the filter is configured by user, we need to have a pattern that can express the above requirement. I don’t think you can express meaning like above in standard regex. (foo|bar).*.xml
will match all 6 file above and we can’t identify which file is paired for a particular file.
Is there any standard pattern that can express it? Or any idea how to modify regex to support this, that can be implemented easily?
I think what you have in mind could be solved by backreferences. See, for example, here:
http://msdn.microsoft.com/en-us/library/thwdfzxy.aspx
or here
http://www.regular-expressions.info/brackets.html
An expression like
(foo([0-9]).xml) .* (bar1.xml)
applied to the space separated list of file names will deliver you pairs like foo1.xml
, bar1.xml
as matches. Of course, you may have to solve the problem of bringing the files names into correct order before (or provide a reg exp which is indedependent from the order of files).
EDIT: concerning order of files: you could specify this with two different rules, since I guess you want your processing done in a specific order. So when the above expression delivers you a pair
(filename1,filename2)
you run the processing algo P with parameters
P(filename1,filename2)
and when the second rule
(bar([0-9]).xml) .* (foo1.xml)
delivers you a pair
(filename1,filename2)
you call P with the order of the names switched:
P(filename2,filename1)
Of course, depending on your reg exp processor, you can also use newlines for separating the file names and use multiline matching. I used whitespace above just for easier demonstration purposes.
6