By non-contractual parameters, I mean parameters that are not interfaces or service dependencies, something like class Person(string name)
.
I am writing a webpage scraping application, and so far I’ve written it in the wrong order (hence the creation of this question). I made a class to parse HTML documents (from string format) and give me all the urls and images in it.
That class has the following description:
public class PageParser
{
private readonly string _html;
public PageParser(string html) {
_html = html;
}
public IEnumerable<string> GetImages() {
/* not important */
}
public IEnumerable<string> GetLinks() {
/* not important */
}
}
This code works great, unit-tested, 100% coverage, etc… The problem is now how do I write unit-tests for code that uses this class without worry about the constructor of this class. I only need this new class to work with the behavior of the PageParser
.
The pseudo code for this new functionality is as follows:
public Report CreateReport(string url)
{
var html = _webClient.DownloadString(url);
var parser = new PageParser(html);
var images = parser.GetImages();
var links = parser.GetLinks();
var relevantLinks = links.Where(l => l.Contains('something'));
return _reportBuilder.Create(images, relevantLinks);
}
The problem that I have is I don’t want to have a bunch of unit-tests that have epically sized HTML documents for all my various test cases the CreateReport
method can have.
The options I have come up with are:
1) Leave everything as is
Leave the class structure as is, just live with having tests with large setup variables
2) Expose the page parser (+interface) from a Factory
Create an interface for PageParser
, and create a factory with corresponding interface.
interface IPageParserFactory
{
IPageParser Create(string html);
}
class PageParserFactory : IPageParserFactory
{
IPageParser Create(string html)
{
return new PageParser(html);
}
}
Then the resulting pseudo code would look something as follows.
// ... download string
var pageParser = _pageParserFactory.Create(html);
// ... create report
3) Move the parameter html to each method and create an interface (so now the class has no construction dependencies)
interface IPageParser
{
IEnumerable<string> GetImages(string html);
IEnumerable<string> GetLinks(string html);
}
4) Something else entirely
If there’s another pattern that solves this problem I would like to know it.
From a quick glance, it seems that your CreateReport
method is violating the single-responsibility principle, as it does a lot of things:
- it reads HTML from URL via
DownloadString
- it creates the
PageParser
and retrieves its parsing results - it filters the result based on some relevance criterion
- it does the actual creation of a report
Given the method name, one would expect that method to only have that last responsibility. What you have given here instead is something that is highly coupled together with the data retrieval, parsing and filtering logic.
The idea of creating a report is usually based on some sort of already prepared data. So one already starts wondering, why does your method take a URL as input?
From your proposed solutions 1) to 3), none addresses these issues really. When implementing either of these, you will still need to create local files with more-or-less complete HTML contents for your unit tests.
Note that some people even go as far as claiming that no unit test should access the filesystem, because that is generally too slow. Of course, one may disagree with this, but it should at least make you feel a little bit guilty, such that you only do that if you have a very good reason to do so.. and you don’t have one in this case.
So let’s go with option 4) something else entirely. Taking a different look at your problem, we see that it’s really all about dealing with some sort of document data. Your PageParser
is the first one to create such data, so what about changing it to not return individual IEnumerable
s, but instead a single PageData
object, which in turn is able to provide the IEnumerable
s.
Continuing with this line of though, a report generation really only needs a PageData
object. Your _reportBuilder.Create(images, relevantLinks)
becomes a call to _reportBuilder.Create(pageData)
and your CreateReport
takes a PageData
instead of an URL as input. Actually, CreateReport
may get shortened to that one Create
line. Everything else was not its responsibility anyways.
Outside of either of these classes (PageParser
, PageData
, report generator class) you can define an independent PageDataFilter
interface and implementation classes. One of them will be the filter you had in your CreateReport
. Note that the interface is a simple one-method filtering of the sort PageData ApplyFilter(PageData pageData)
. Dealing with filters like that gives you the advantages of easy testability of individual filters, easy composition of filters and it decouples the filters from everything else. The report generator need not care if its given PageData
was filtered in one way or the other. (Note that for the implementation you have to decide if PageData
is immutable, or can be modified by filters in-place. The first option is mathematically sound, easy to reason about, better suitable for parallelization, etc., whereas the latter is more efficient in terms of memory and cpu.)
At this point, you can easily test your PageParser
, your report generator, and your link filter. On top of that you can do all of that testing by simply preparing one straightforward instance of a required class in memory. No more URL loading, no more involved HTML stuff outside of PageParser
tests, no more file system accesses, and no more coupling that needs to be taken care of in your tests.
Addendum: What the above text is missing is how these components are wired together. Let’s say you now have a PageParser
that creates PageData
which you can filter via PageDataFilter
instances and create reports from via ReportCreator
. What you still need is the instantiation and instrumentation of these classes – somewhere further outside. It depends on your application’s design as to where a proper place would be for that. But for simplicity’s sake, you could assume an Application
class with a main
method. In that method you would do the following:
- Download HTML URL to a String
- Instantiate a
PageParser
with this string as constructor argument - Instantiate the desired
PageDataFilter
objects - Apply the
PageDataFilter
objects to thePageData
delivered by thePageParser
- Instantiate the
ReportCreator
and give it the filteredPageData
for creating a report.
Again, it depends on your overall design and architecture, if these things are manually instantiated, dependency-injected, or what-have-you. The major point though is: keep your classes small and limited to their one and only responsibility, then instrument them from the outside. If the instrumentation code gets larger and does too many things, then recursively apply the above ideas to that code as well.
1