I’m working on a data migration of several hundred nodes from a Drupal 6 to a Drupal 7 site. I’ve got the data exported to the new site and I want to check it. Harkening back to my statistics classes, I recall that there is some way to figure out a random number of nodes to check to give me some percentage of confidence that the whole process was correct. Can anyone enlighten me as to this practical application of statistics? For any given number of units, how big must the sample be to have a given confidence interval?
2
I found this sample size calculator. For my population of 215 items, if I want a 95% confidence with +/- 5% confidence interval, I’ll need to randomly sample 138 items.
Edit: Here’s the actual formula that I was looking for.
Computers are really good at doing repetitive tasks and comparisons, it wouldn’t take long to write a small app that verifies all the data, and unless the amount of data you transferred was huge it could be done overnight or over the weekend and get 100% certainty. Stats rely on getting a true random sample of the entire population, and failing to get a valid set makes that confidence worthless and both humans and computers are really bad at true random. If something does come up its a lot better to be 100% certain than 95%.
2