Question:
Which design pattern do I use to help me refactor legacy SQL code into separate classes?
Goal
Bring in concepts of data separation
and database separation
and also latest modern techniques I may not yet be aware of (if any), to bring quality to spaghetti-coded test-less legacy PHP code base.
Details:
I am looking for a good model to help me refactor out DB code in a large legacy php project. I thought at first to use TableModule, but seeing how my SQL is complex, I think DomainModel may be better. But still I do not know how to implement it as description is too abstract. Maybe Service Layer is what I need… but to be honest, I am kind of lost.
Sample SQL that I deal with (one of many):
SELECT
db1.item.quantity,
db1.sheet.serial,
db2.work.id AS work,
db3.part.id AS partid,
db3.part.number,
db3.part.description,
db3.part.current_inv
FROM db1.item
LEFT JOIN db1.partial ON db1.partial.id = db1.item.db1_partial_id
LEFT JOIN db1.sheet ON db1.sheet.id = db1.partial.db1_sheet_id
LEFT JOIN db2.work ON db2.work.id = db1.sheet.db2_work_id
LEFT JOIN db3.part ON db3.part.id = db1.item.db3_part_id
WHERE
db1.sheet.status != 'VOID' AND
db1.item.id = "$itemid" AND
db1.item.record = 0
It ties up 3 databases and multiple tables. Considering that I want to decouple SQL from my app, and decouple app from data handling, and having complex SQL queries like above, what do you suggest I do?
My current sample plan is
- find next available stray SQL inside legacy code
- determine what it is trying to do – i.e. save/update what data on what object or model.
- create or add to the class that best encompasses SQL. For example, above SQL checks if inventory has been used up for an item. So I can place it into some class called
CheckInventoryData
, and be happy with that.. until maybe I learn something that makes me want to change the name of the class, for example. i.e.HandleItem
, and put that class in some directory structure for database handling, i.e./db_handling_code
- have a function in that class saying
isInventoryUsedUpForItem($item)
, or the like - repeat….
While I think this will work, I do not feel too confident about my plan. I am hoping to find solidification of it in a more modern/industry accepted practice. And finding that plan is my question — is there a better plan available. — is there a model I should aspire towards?
5
I’ll attempt if not answer your question, at least give some hints. As a side note, I’ve contributed to refactoring of several medium-sized projects (although not as big as yours), especially the ones which had data access directly mixed with business code, so the answer is partially based on my personal experience.
Identify the problem
The first thing I note in your question and comments is that it doesn’t seem clear whether you have an actual problem you want to solve:
Right now, I am looking to more so add modern concepts to my code. Things like looser coupling, data injection, database injection, and whatever else modern techniques relating to data codification are.
Adding modern concepts shouldn’t be your goal. Neither do lose coupling or data injection. The first thing to do is to identify the actual problem, i.e. something which blocks you right now and should be solved (eventually through lose coupling, data injection, or anything else).
That the actual problems might be:
-
Problem 1■ the difficulty to unit test the code. Whenever you want to test a simple function, you can’t do it in isolation, because it relies heavily on the database.
-
Problem 2■ the difficulty to find the relevant SQL query, because instead of following a clear architecture, queries are buried deep in business logic.
Thus, I’ll address those two problems below. Remember that it is imperative, before you search for the solution, to identify the problem. Otherwise, you may find yourself doing a lot of unneeded work or making things worse.
Study the environment
(Identify the scope of the problem)
Don’t overly focus on the location of SQL queries, but watch the overall architecture of the application. It might be that the problem is particularly visible when it comes to SQL queries, but applies to the overall code base.
Three years ago, I had to maintain a small application. It had SQL queries mixed with business logic, but the previous developer did a great work defining the architecture of the application. I don’t know why he refused to use n-tier, but the fact is that every time I had to find an SQL query, it was a matter of seconds, because the code was clear, classes were separated enough and naming conventions were excellent. While Cart
class was dealing at the same time with data access, business logic and presentation, it was easy to guess that the SQL query which changes the quantity of a product in a cart would be in a function similar to $cart->changeQuantity($customer, $product, $delta)
.
If you have a good grasp on the scope of the problem, you may target better the parts of the code base which should be changed. In your case, if you’re concerned particularly about the location of the SQL queries (Problem 2■), but have found that the scope is larger (i.e. the lack of architecture concerns the whole code base), you would be better refactoring everything, not just the database-related code.
Don’t overly rely on a tool
When searching for a solution, you’ll find promising tools, as well as people recommending a specific tool, pattern or technique which will magically solve your problems.
Just use ORM! It will make data access much simpler!
Just use <some pattern here>, and your code will become easy to maintain!
While those tools, patterns or techniques are helpful, they won’t magically solve anything. At best, they will push the code base in a good direction with little impact. In worst case, they will require weeks or months of hard work and will have practically no positive impact.
Take an ORM. Yes, this will make your code simpler. In many cases, at least. But this doesn’t make the code easier to test (Problem 1■) and may make it even harder to find the location where a specific SQL query is generated (Problem 2■), since you won’t be able to grep
the query itself any longer. It some cases, it might even have a negative impact of having two completely different ways of accessing data: through an ORM, for simple cases, and directly, for more difficult ones.
Solve the problem pragmatically
Once you identified the problem and its scope, the goal is now to solve the problem pragmatically, without overly relying on a single tool, pattern or technique (while using tools, patterns and techniques when needed).
The way I approached the problem in the past is the following in order to solve the two problems identified above.
-
Find next available stray SQL inside legacy code.
-
Determine what it is trying to do, i.e. save/update what data on what object or model.
Those two steps are exactly the same as the ones in your question. -
Create either functional or integration and system tests corresponding to this part of the code. Functional, integration and system tests usually run on a test database, specifically filled with test data.
This will serve as regression tests when you start making changes. Since those tests are at a high level (for example they may do an HTTP request to a web app instead of dealing with the underlying classes and interfaces), they should not be affected by the later changes in the code itself.
-
Check the coherence of the architecture/design of the class where the concerned SQL query was found. For example, if I find a query which changes a name of the product from
Cart
class, that’s not right. Refactoring, in this case, is mandatory. (Problem 2■) -
Ensure the query is doing what it is supposed to do. Untested legacy code contains lots of surprises, and it is not unusual to find functions with misleading names, selection queries with side effects which always return an empty result, or update queries which don’t change anything. (Problem 2■)
This step also involves dealing with legacy horror such as dynamically generated queries (should be replaced by parametrized queries).
-
Ensure the query is as simple as it can be. Make it simpler, if possible.
-
Ensure you use all capabilities of your database engine you need to make things as simple as possible. For example, as I previously commented on your question, the query from your question might be replaced by a view.
-
Study the queries refactored in the past in order to identify potential code duplication. This may be as obvious as finding the exact same query somewhere else, or may be more difficult if only a part of a query is duplicated (for example same query but with an additional
where
criteria). (Problem 2■)If duplication found, deduplicate.
-
Find a name for the query. (Problem 2■)
-
(Create a type which represents what the query returns or what it needs. The type should basically have one property per column in
select
. Since you’re using a dynamic language, you may not need to do this step.) -
Extract the query into a function and locate a class which should contain this function.
-
If the class doesn’t exist:
-
Find an explicit name. You may use a naming convention such as
...Repository
(ProductsRepository
,CustomersRepository
). -
Create the class.
-
Create the corresponding interface. (Problem 1■)
-
Create the default mock which implements the interface. This mock returns zero rows in data retrieval functions and does nothing in data altering functions.
-
-
Create one or several mocks and/or stubs relevant in the context of the business code which uses the SQL query. Create the corresponding unit tests. They should be red at this stage, since your business code doesn’t use the stubs/mocks yet. (Problem 1■)
-
Change the business code to use the newly created function instead of a direct query. Unit tests should be green now.
I wouldn’t care about design patterns here (well, that would be a repository pattern, but again, I don’t care). Once you get your 2 200 SQL queries out of your business code, you can start playing with ORMs, design patterns, etc. The main benefit is that you’ll have your business logic completely separated from data access, and also covered by unit tests. With data access layer (DAL) and business layer (BL) communicating through interfaces, you have more room for experimentation.
A practical example is that I can now create a different implementation of DAL which uses an ORM while my team continues working with the previous implementation. Then, moving from one to another should consist of a tiny change of configuration (if dependency injection is used).
More importantly, experimentation becomes easier at this step. You can start implementing an ORM, find it to be a mistake, and abandon this particular implementation without having to revert code base to previous revision.
1