I am new to databases and trying to understand the basic concepts. I have learned how to delete data in a database. But one of my friends told me that you should never delete data in a database. Rather, when its no longer needed, it’s better to simply mark it or flag it as ‘not in use’.
Is that true? If so, how would a big company like IBM handle their data for a hundred or more years?
11
As with all these things the answer is “it depends”.
If the user is ever likely to want the data back then your friends are right – you don’t really delete just mark the record as “deleted”. This way when the user changes their mind you can recover the data.
However, if the deleted data is more than a certain time period old (a year for example) you might decide to really delete it from the live tables but keep it in either an archive table or even just on back up should the user ever want it back. In this way you can keep the amount of data (live and recently deleted) to a minimum.
However, if the data is ephemeral or easily recreated you may well decide to actually delete the data.
There is one class of data that you have to delete – and that’s personal data that the user doesn’t want you to hold any more. There may be local laws (e.g. in the EU) that makes this a mandatory requirement (thanks Gavin)
Equally there may be rules that require you not to delete data, so before deciding anything check with any regulatory authorities on what you need to do to comply with the law.
3
This is actually a significant problem for a lot of companies. There’s no way to cleanly determine what data is actually in use, so it just sits in the database. Data deletion and archiving needs to be a part of every large system design, but it rarely is. Most companies just live with it, buying bigger disks and tweaking their queries and indexes to maintain performance, until they change systems and then they go through a significant amount of effort to identify current data and then only migrate those records to their new system.
Yes, you should delete data from your database, but it’s often not simple to tell what and when.
2
There have already been plenty of good answers to this that pretty much boil down to “Depends on the circumstance”, and I can’t add anything to those.
One thing that hasn’t been mentioned, however, that I think needs to be mentioned, is that you should never ever reuse primary keys that have been generated by a sequence or an AUTO_INCREMENT system.
When you delete an item that had been assigned a primary key by such a system there will be gaps in the primary key column, left by the deleted data. There is a great temptation to reassign those gaps to new items as they’re added, or even worse, to shuffle the existing data to give it a new ID to remove the gaps, but doing so will give rise to issues that you’d never have to deal with if you just left the keys alone.
Say you’re keeping a database of printers for managing reordering consumables. Printer 13, an old laser printer, breaks down beyond economic repair so you throw it out. Meanwhile, for an unrelated reason, someone orders a new thermal printer for doing barcode printing in the warehouse, and that printer happens to arrive before the replacement for printer 13. The administrator logs that new printer into the database and, because 13 is now free and you’re recycling IDs, the new thermal printer gets allocated 13 as its ID.
Now someone tells you that printer 13 is almost out of ink. You remember that printer 13 is a laser printer so you don’t bother looking it up in the database, and you place an order for a toner cartridge. Only you actually needed to order a thermal ink pack because printer 13 isn’t a laser printer anymore. When the toner cartridge arrives you can’t use it because it’s the wrong ink refill for the printer, you can’t print out any more bar codes and you can’t ship any orders waiting to be dispatched.
Even worse, what happens if you delete printer 13 and shuffle all the printers that come after it down to fill the gap? Printer 14 (some decrepit old dot matrix) becomes printer 13, printer 15 becomes printer 14 and so on.
All the printers have labels on them so they can be cross-referenced with the database, but now all the labels are out of date. You’ll have to go round, locate every printer in the business (which could run into hundreds!) and relabel them. That’s hardly an effective use of time. And it’s also an error-prone process, and what happens if it just never gets done? Someone calls in to say printer 14 has broken down and needs fixing urgently, so you look it up and find that printer 14 is an inkjet printer in Reception. Only because you’ve shuffled the IDs around, it’s actually the dot matrix printer that needs fixing urgently. The guy who called in the problem is left hanging, whilst the receptionist has a tech support guy she never called for turn up to fix a printer which wasn’t broken.
You should think of IDs assigned by an auto-increment system as permanent, they’re immutable and can’t be reused, even if the thing that the ID refers to ceases to exist. Some people claim that they don’t want to have to worry about IDs running out, but even with 32 bit systems and signed IDs, there’s still 2 billion or so IDs available. If you can make the ID column unsigned then this doubles to 4 billion, and on 64 bit systems the number of available IDs is literally greater than the number of stars in the sky. You’re not going to run out of IDs.
4
I generally don’t remove user data in my databases. I flag them to be hidden. All too often a user deletes something accidentally and needs it replaced easily. It also helps retain referential integrity for related data. This works for small to moderate size databases. In systems where performance is heavily impacted by this decision it’s handled in special ways, e.g. archive tables, automated backups, etc.
We do discard backend data as necessary, e.g. expired web site session data and old log infomation. There’s no point at all in keeping them forever.
As usual, though, the exact answer really depends on the specific situation.
Lots of good answers here already. I just want to add one situation that no one has mentioned yet:
Sensitive data. If the user deletes it, then you better actually delete it!
One very common situation that comes to mind is change/reset password. You wouldn’t want to store old passwords (even though they’re hashed, salted etc) in your database. Users might be using their old (and bad) passwords on other sites.
Also, when it comes to laws regarding how long you are allowed to store certain types of data then of course soft deletes wont do. You have to actually delete it.
So I would ask myself: will the user (or someone else, the government for example) be mad if I make them believe the data has been deleted, but in fact I’ve still got it and can restore it at any time?
5
In the majority of cases you should keep data just in case it is needed in the future. The business you work for may want to look at the historical data to base their decisions on which will steer the comapany in a certain direction.
You should add ‘Date_Time_Removed’ columns to each table and then instead of physically deleting the row(s) you set a date and time that the row has been virtually deleted. Then in your stored procedures or sql you would factor in the ‘Date_Time_Removed’ column e.g.
select blah
from table1
where date_time_removed is null
Of course rows that have been accidently added to a database should be removed permanently, especially test data.
By keeping all legit data you also have to option of using your database for warehousing in the future.
Another situation than others presented is when data is deleted, but logs of operations done in the database (deletion included) are stored in archives for a long period of time. The main scope of this is implementing a rollback system to past dates, but it can also be used to store in some way deleted data (which is deleted from database, but stored in archives).
Storing archives of deleted data wouldn’t be such a big deals. Big companies may also store versions of code and many more information (not to talk about non-technical related stuff) so in the end storing large data is something usual for them.
I’ve been working on an Foreign exchange application for a couple of years where this came up. The data that the application collected over the years had an impact on performance (say exponentional).
After we’ve done what we could in terms of code we proposed to the managegement to archive data older than a year. They verified the concept (legal issues) and luckily we were able to do it. So we deleted but we also archived the data so business could still run their reports etc.