I want to extract the paths of all files with unique contents from a directory parent_dir
recursively. All of them have the same file name, summary.txt
. Their paths look like this:
parent_dir/sub_dir_1/sum_1/A/../B/summary.txt
However, sum_1
is linked, so I can also access summary.txt
as
parent_dir/sub_dir_1/sum_1/summary.txt
.
Now I want to extract the paths of all such “summary.txt” from parent_dir
such that all of them have unique contents, and I’m writing a python code for it.
I tried this:
...
List = glob.glob('parent_dir/**/summary.txt', recursive=True)
Newer_list = [List[0]]
for file_1 in List:
for file_2 in Newer_list:
if not filecmp.cmp(file_1, file_2, shallow=False):
Newer_list.append(file_1)
...
This means glob.glob(...)
returns paths of both the formats, hence I tried a workaround to ‘cleanse’ the List
after glob.glob()
extracts all path files. However this has a long run time, and considering I’ll have to work with potentially large number of files I want to avoid this.
Almost all the files have very similar sizes, so I think comparing sizes isn’t the most effective solution either.
I considered using set()
method to remove the non-unique files from List
but since a single file can have two different paths I’m not sure if it’ll work.
fdupes
seems to be a good choice but it returns duplicate files, the opposite of what I want to do. And I’m not really sure how to fit it in my code either.
Any help is appreciated, thanks.
yamada is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.