Big Data: Can it be pre-processed?

My question is about “big data”. Basically, big data involves the analysis of a large amount of data to make meaningful insights from it.

I would like to know:

Whether or not large amounts of data can be pre-processed? (like say for example you are running some matching service for people, so you take all the information you have on the people and you process it at a certain point for use later on)

If pre-processing is possible, how would you normally go about doing this?

To help narrow the scope of my question, please look at this hypothetical scenario.

Say I have a customer database and my company is a global retailer
that is using some type of points system to reward the shoppers (for
arguments sake, the points are tallied up on a type of electronic card
or mobile app).

So based on my rewards system, I am now able to fully aware of exactly
what a shopper is purchasing and when they normally make purchases of
recurring items.

My database is growing all the time with this information and I would
now like to make recommendations (or send notifications) to shoppers
about special offers of products they buy or related products that may
interest them, when they enter 1 of the stores.

Instead of processing all the accumulated data when a shopper enters
the store, I would like to continually process the data-stream as the
data comes in (meaning from previous shopping experiences), so that
when it comes time to make a recommendation (for the next time a
shopper walks into the store), it is simply a matter of retrieving the
recommendations and providing a list of it to the shopper.

With this method in mind, I can easily space out my CPU-intensive
tasks, instead of say: processing all customer data on a busy day when
foot-traffic is at peak volumes.

By asking how I would do this, I would be referring to common methods available for achieving this. This can include any special databases or programming techniques or even specialized software that can carry out these timed calculations that can “pre-process” the data at specific times, in order to balance out CPU-intensive tasks.

You can consider the customer-recommendation scenario as the “situation”. It is the best example scenario I could think of that would explain why “pre-processing” (or calculating the recommendations at specific times) would make sense.

Typically I’ve heard of this being handled via the OLTP vs. OLAP model. Essentially the T in OLTP means “transactional”, so this is the typical databased used for day-to-day operations. Then you write some kind of translational logic that transforms the OLTP database into an OLAP database (the A stands for analytical).

Basically you’re talking about the same data represented 2 different ways. The OLTP database focuses on normalization but the OLAP database is structured in more of a “star” pattern with a lot more data repetition. It’s read-only and optimized for querying.

Then the engineering is in figuring out how to do the translation from OLTP to OLAP, how often to do it, and if you can do it incrementally so the OLAP database isn’t too far behind “real-time”.

In a past job, I was a DBA for a global solutions company where databases with millions and billions of rows were the norm.

As datasets got larger, it became more and more problematic to turn around complex queries in a timely manner.

Among many strategies we adopted, 4 spring to mind:

Result sets for common queries were stored in what we called “strips”. These were basically index-organised tables that stored keys to stop repeated joins in subsequent queries
Denormalising tables yielded huge benefits to reduce the number of joins
Tables were partitioned in line with common queries e.g. postcode/zip code etc
Whilst all data was available in the repository, only fully formed data and cleansed data was allowed thru to the mart for querying

On top of this you can overlay pre-calculated segments. For example – rather than try to pull say, all blue collar workers in the country, you can use segmentation to drill down only in those areas which are predominantly blue collar.

EDIT (following Joe’s update)

In that case you might want a reporting mart in addition to the mart and repository I described above which is lean and mean and optimised for fast queries and MI reports.

Sure that’s what an incremental map reduce is for. Essentially you perform an operation on the collection which processes the existing documents and puts them in a new collection, and as you add new documents you merge those rows into the derived collection.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 10:42

Thẻ: analytics, big-data, data

Big Data: Can it be pre-processed?

My question is about “big data”. Basically, big data involves the analysis of a large amount of data to make meaningful insights from it.

I would like to know:

If pre-processing is possible, how would you normally go about doing this?

To help narrow the scope of my question, please look at this hypothetical scenario.

Say I have a customer database and my company is a global retailer
that is using some type of points system to reward the shoppers (for
arguments sake, the points are tallied up on a type of electronic card
or mobile app).

So based on my rewards system, I am now able to fully aware of exactly
what a shopper is purchasing and when they normally make purchases of
recurring items.

My database is growing all the time with this information and I would
now like to make recommendations (or send notifications) to shoppers
about special offers of products they buy or related products that may
interest them, when they enter 1 of the stores.

Instead of processing all the accumulated data when a shopper enters
the store, I would like to continually process the data-stream as the
data comes in (meaning from previous shopping experiences), so that
when it comes time to make a recommendation (for the next time a
shopper walks into the store), it is simply a matter of retrieving the
recommendations and providing a list of it to the shopper.

With this method in mind, I can easily space out my CPU-intensive
tasks, instead of say: processing all customer data on a busy day when
foot-traffic is at peak volumes.

In a past job, I was a DBA for a global solutions company where databases with millions and billions of rows were the norm.

As datasets got larger, it became more and more problematic to turn around complex queries in a timely manner.

Among many strategies we adopted, 4 spring to mind:

Result sets for common queries were stored in what we called “strips”. These were basically index-organised tables that stored keys to stop repeated joins in subsequent queries
Denormalising tables yielded huge benefits to reduce the number of joins
Tables were partitioned in line with common queries e.g. postcode/zip code etc
Whilst all data was available in the repository, only fully formed data and cleansed data was allowed thru to the mart for querying

EDIT (following Joe’s update)

In that case you might want a reporting mart in addition to the mart and repository I described above which is lean and mean and optimised for fast queries and MI reports.

Filed under: softwareengineering - @ 10:42

Thẻ: analytics, big-data, data

Thiết kế website giá rẻ

Danh mục

Big Data: Can it be pre-processed?

Big Data: Can it be pre-processed?