I want to use the .NET TPL to asynchronously do a DIR /S
and search each subdirectory on a hard drive, and want to search for a word in each file… what should my API look like?
In this scenario I know that each sub directory will have 0..10000 files or 0…10000 directories. I know the tree is unbalanced and want to return data (in relation to its position in the hierarchy) as soon as it’s available. I am interested in getting data as quickly as possible, but also want to update that result if “better” data is found (better means closer to the root of c:)
I may also be interested in finding all matches in relation to its position in the hierarchy. (akin to a report)
Question:
How should I return data to my caller?
My first guess is that I think I need a shared object that will maintain the current “status” of the traversal (started | notstarted | complete ) , and might base it on the System.Collections.Concurrent
.
Another idea that I’m considering is the consumer/producer pattern (which ConcurrentCollections can handle) however I’m not sure what the objects “look” like.
Optional Logical Constraint: The API doesn’t have to address this, but in my “real world” design, if a directory has files, then only one file will ever contain the word I’m looking for. If someone were to literally do a DIR /S
as described above then they would need to account for more than one matching file per subdirectory.
More information :
I’m using Azure Tables to store a hierarchy of data using these TPL extension methods. A “node” is a table. Not only does each node in the hierarchy have a relation to any number of nodes, but it’s possible for each node to have a reciprocal link back to any other node. This may have issues with recursion but I’m addressing that with a shared object in my recursion loop.
Note that each “node” also has the ability to store local data unique to that node. It is this information that I’m searching for. In other words, I’m searching for a specific fixed RowKey in a hierarchy of nodes.
When I search for the fixed RowKey in the hierarchy I’m interested in getting the results FAST (first node found) but prefer data that is “closer” to the starting point of the hierarchy.
Since many nodes may have the particular RowKey I’m interested in, sometimes I may want to get a report of ALL the nodes that contain this RowKey.
6
So I would break this out to two stages – the directory listing / traversal and the search function.
The listing / traversal is a classic tree node problem and there’s a boodle of examples out there for solving that problem. What we’ve done to solve that problem is to ID the upper layers first; retrieve that; then work on retrieves within the lower layers. The approach lends itself well to recursion and you end up with some tight code.
I would use a skip / take model and pick a smaller amount for each request. That will help keep the callback open / waiting while the traversal is taking place.
For an API structure… I would expose a single function to retrieve the structure. That function will provide the callback(s) to the calling function and will essentially act as a buffer for the actual traversal functions. Depending upon performance of the listing, you can now easily move this to a parallel traversal model and speed things up. .NET 4 and 4.5 have some pretty slick enhancements for parallelization.
How you want to pass the tree structure back is going to depend upon what you’re using for the UI. Pick a data structure that will play nicely with your UI components.
The search function would be a second call, but it should be able to rely upon | mirror some of the functions you created with the directory listing. Since you have the liberty of choosing your search order now (ie, you’re controlling where it starts from based upon the directory listing) you can preferentially search the closer-to-root locations first.
Here, I would probably create two calls into the API or just overload a single one. The overload would be the check for getting the first, closet to root instance or get all of them. As an aside, consider how you want to handle multiple hits at the same depth away from root.
Finally, depending upon how performant the listing is, you may want to consider some form of caching the directory structures. It shouldn’t be that expensive to keep in memory, but should save you some time by avoiding redundant IO.