My question is about “big data”. Basically, big data involves the analysis of a large amount of data to make meaningful insights from it.
I would like to know:
Whether or not large amounts of data can be pre-processed? (like say for example you are running some matching service for people, so you take all the information you have on the people and you process it at a certain point for use later on)
If pre-processing is possible, how would you normally go about doing this?
To help narrow the scope of my question, please look at this hypothetical scenario.
Say I have a customer database and my company is a global retailer
that is using some type of points system to reward the shoppers (for
arguments sake, the points are tallied up on a type of electronic card
or mobile app).So based on my rewards system, I am now able to fully aware of exactly
what a shopper is purchasing and when they normally make purchases of
recurring items.My database is growing all the time with this information and I would
now like to make recommendations (or send notifications) to shoppers
about special offers of products they buy or related products that may
interest them, when they enter 1 of the stores.Instead of processing all the accumulated data when a shopper enters
the store, I would like to continually process the data-stream as the
data comes in (meaning from previous shopping experiences), so that
when it comes time to make a recommendation (for the next time a
shopper walks into the store), it is simply a matter of retrieving the
recommendations and providing a list of it to the shopper.With this method in mind, I can easily space out my CPU-intensive
tasks, instead of say: processing all customer data on a busy day when
foot-traffic is at peak volumes.
By asking how I would do this, I would be referring to common methods available for achieving this. This can include any special databases or programming techniques or even specialized software that can carry out these timed calculations that can “pre-process” the data at specific times, in order to balance out CPU-intensive tasks.
You can consider the customer-recommendation scenario as the “situation”. It is the best example scenario I could think of that would explain why “pre-processing” (or calculating the recommendations at specific times) would make sense.
4
Typically I’ve heard of this being handled via the OLTP vs. OLAP model. Essentially the T in OLTP means “transactional”, so this is the typical databased used for day-to-day operations. Then you write some kind of translational logic that transforms the OLTP database into an OLAP database (the A stands for analytical).
Basically you’re talking about the same data represented 2 different ways. The OLTP database focuses on normalization but the OLAP database is structured in more of a “star” pattern with a lot more data repetition. It’s read-only and optimized for querying.
Then the engineering is in figuring out how to do the translation from OLTP to OLAP, how often to do it, and if you can do it incrementally so the OLAP database isn’t too far behind “real-time”.
In a past job, I was a DBA for a global solutions company where databases with millions and billions of rows were the norm.
As datasets got larger, it became more and more problematic to turn around complex queries in a timely manner.
Among many strategies we adopted, 4 spring to mind:
-
Result sets for common queries were stored in what we called “strips”. These were basically index-organised tables that stored keys to stop repeated joins in subsequent queries
-
Denormalising tables yielded huge benefits to reduce the number of joins
-
Tables were partitioned in line with common queries e.g. postcode/zip code etc
-
Whilst all data was available in the repository, only fully formed data and cleansed data was allowed thru to the mart for querying
On top of this you can overlay pre-calculated segments. For example – rather than try to pull say, all blue collar workers in the country, you can use segmentation to drill down only in those areas which are predominantly blue collar.
EDIT (following Joe’s update)
In that case you might want a reporting mart in addition to the mart and repository I described above which is lean and mean and optimised for fast queries and MI reports.
Sure that’s what an incremental map reduce is for. Essentially you perform an operation on the collection which processes the existing documents and puts them in a new collection, and as you add new documents you merge those rows into the derived collection.
1