I recently discovered (or rather realised how to use) Python’s multiple inheritance, and am afraid I’m now using it in cases where it’s not a good fit. I want to have some starting data source (NewsCacheDB
,TwitterStream
) that gets transformed in various ways (Vectorize
,SelectKBest
,SelectPercentile
).
I found myself writing the following sort of code (Example 1) (the actual code is a bit more complex but the idea is the same). The point being that for ExperimentA
and ExperimentB
I can define exactly what self.data
is, by just relying on class inheritance. Is this really a useful way of achieving the desired behaviour?
I could also use decorators (Example 2). Using the decorators would be less code.
Which approach is preferable? I’m not looking for arguments of the “I like writing decorators better” kind, but rather arguments about
- readability
- maintainability
- testability
- pythonicity (yes it’s a word).
EXAMPLE 1
class NewsCacheDB(object):
"""Play back cached news articles from a database"""
def __init__(self):
super(NewsArticleCache, self).__init__()
@property
def data(self):
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
class TwitterCacheDB(object):
"""Play back cached tweets from a database"""
def __init__(self):
super(TwitterCache, self).__init__()
@property
def data(self):
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
class TwitterStream(object):
def __init__(self):
super(TwitterStream, self).__init__()
@property
def data(self):
# setup access to live twitter stream
while stream.isalive():
yield stream.next()
class Vectorize(object):
"""Turn raw data into numpy vectors"""
def __init__(self):
super(Vectorize, self).__init__()
@property
def data(self):
for item in super(Vectorize, self).data:
transformed = vectorize(item) # slight simplification here
yield transformed
class SelectKBest(object):
"""Select K best features based on some metric"""
def __init__(self):
super(SelectKBest, self).__init__()
@property
def data(self):
for item in super(SelectKBest, self).data:
transformed = select_kbest(item) # slight simplification here
yield transformed
class SelectPercentile(object):
"""Select the top X percentile features based on some metric"""
def __init__(self):
super(SelectPercentile, self).__init__()
@property
def data(self):
for item in super(SelectPercentile, self).data:
transformed = select_kbest(item) # slight simplification here
yield transformed
class ExperimentA(SelectKBest, Vectorize, TwitterCacheDB):
# lots of control code goes here
class ExperimentB(SelectKBest, Vectorize, NewsCacheDB):
# lots of control code goes here
class ExperimentC(SelectPercentile, Vectorize, NewsCacheDB):
# lots of control code goes here
EXAMPLE 2
def multiply(fn):
def wrapped(self):
return fn(self) * 2
return wrapped
def twitter_cacheDB(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
return wrapped
def twitter_live(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while stream.isalive():
yield stream.next() # slight simplification here
return wrapped
def news_cacheDB(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
return wrapped
def vectorize(fn):
def wrapped(self):
for item in fn():
transformed = do_vectorize(item) # slight simplification here
yield transformed
yield wrapped
def select_kbest(fn):
def wrapped(self):
for item in fn():
transformed = do_selection(item) # slight simplification here
yield transformed
yield wrapped
class ExperimentA():
@property
@select_kbest
@vectorize
@twitter_cacheDB
def a(self):
return 'me','123' # return user and pass to connect to DB
class ExperimentB():
@property
@select_kbest
@vectorize
@news_cacheDB
def a(self):
return 'me','123' # return user and pass to connect to DB
5
Less code, as long as it’s readable is better than more code
From a code size point of view I always go with the solution that requires the least amount of code that is still readable and maintainable. Less code means less chance for defects and less code to maintain.
Multiple Inheritance is not a good choice for Composition
From a design stand point I would not use multiple inheritance the way you describe for the following reasons:
- attribute/method overloading
You are changing the way data
is behaving in the different classes. While it doesn’t directly violate the Open/Closed Principle of OO with the initial implementation, any changes in the future have a good chance of modifying the behaviors in one or more locations. You are also relying on behavior pulled through super
which will only works correctly if you have the base classes ordered correctly in the class definition.
- fragile tight (vertical) coupling
Relying on the class definition to specify the correct ordering of classes create a fragile system. It’s fragile because you can’t choose classes that have particular interfaces defined, you actually have to know the implemented logic so the super
calls get executed in the correct order. It’s also an extremely tight coupling as a result. Since it’s using class inheritance we also get vertical coupling which basically means there are implicit dependencies not just in individual methods, but potentially between the different layers (classes).
- multiple inheritance pitfalls
Multiple inheritance in any language often has many pitfalls. Python does some work to fix some issues with inheritance, however there are numerous ways of unintentionally confusing the method resolution order (mro) of classes. These pitfalls always exist, and they are also a prime reason to avoid using multiple inheritance.
Alternatives
Alternatively I would leave data source specific logic in the classes (ie. *_CacheDB). Then use either decorator or functional composition to add the generalized logic to automatically apply the transformations.
1
In general, inheritance is overused. Remember: inheritance is all about the “is a” relationship; modeling hierarchical relationships. Therefore, ask yourself, “Is an ExperimentA a SelectKBest?”. That question is nonsensical. SelectKBest doesn’t even name a thing; it’s an imperative phrase (or the jamming together of one). But let’s just say you changed the name to something like TopSelector. Then, the question becomes “Is an ExperimentA a TopSelector?”. Again, that doesn’t make sense (to me). Without knowing more about your app, it very much seems to be a categorical error. These types have nothing to do with each other. Therefore, inheritance is the wrong thing to use.
This does not mean that decorators are right either though.
Not really sure what decorator best practices are. I’d be wary of stacking lots of decorators. I suppose if they do things that are totally independent, then it’s ok. E.g.
__all__ = []
def export(f):
__all__.append(f.func_name)
return f
def author(name):
def decorate(f):
f.author = name
return f
return decorate
@author('allyourcode')
@export
def SaveTheWorld():
# Left as an exercise to the reader.
One thing that tells us that export and author are independent is that you can apply them in any order, and the result is the same: ‘SaveTheWorld’ is appended to __all__
and SaveTheWorld.author == ‘allyourcode’.
I seem to recall hearing that Guido likes this rule: if decorators stop working when you apply them in a different order, then it’s bad.
In EXAMPLE2, order is very significant; any other ordering will at best, give you different behavior. More likely, it breaks.
What you are trying to do is create a pipeline. Python has a very simple mechanism for doing that: call expressions. Here’s what that would look like:
def a(self):
return select_kbest(vectorize(twitter_cache(('me', '123'))))
Or if that line grows too long, use dummy variables (still use meaningful names though!) to store intermediate results. Don’t be afraid to go with a simple solution!
“Simple is better than complex.”
–The Zen of Python
People say that the advantage of decorators is that they consolidate knowledge, reduce repetition, but regular functions do the same thing, and (when applicable) are often simpler. Also, since you can do foo(bar(…)) on a single line, it can (and usually does) result in fewer lines. The main difference is that with decorators, the additional code goes before the def keyword instead of after. Is that really an advantage? I tend to think not.
In the case of author and export, the same things cannot be accomplished using code in the def body, because such code doesn’t get executed until the function is called. Whereas, decorators get executed when the function is defined.
I think logging is is closer to crossing the line into “inapprorpiate use of decorators” territory, but I think it’s still ok: they do change behavior, but the difference is minor (e.g. it wouldn’t reasonably break any existing tests), and you (generally) still get order independence.
Pre- and post- condition checkers (e.g. arg 1 is of type Foo) get even closer to the line (and perhaps cross it). If only good calls ever occur, then, they have no effect, and you generally get order independence. But the behavior change is more significant than just logging.
Then, there are what I call “prepare” decorators, which take things one step further. E.g.
def require_login(handler):
@functools.wraps(handler)
def decorated(request):
session = decode_session(request.cookies['session'])
if not session.user_is_logged_in:
raise HttpError(403)
# Warning: side effect!
request.session = session
return handler(request)
return decorated
require_login is sort of like a pre-condition checker, in that it raises an exception if the input fails to meet some condition. But it also does some work on behalf of handler: it sets the session attribute on request before forwarding request to handler. This makes require_login harder to understand. No longer does the original function take a regular Request: it takes a request with a session tacked on. Furthermore, The same thing can be accomplished without decorators:
def handle(request):
session = require_login(request.cookies['session'])
# if require_login did not raise an HttpError, then session must be that
# of a logged in user. Proceed as before.
As with decorators, this solution requires only one additional line, but only uses basic call technology.