Best practice or design patterns for retrieval of data for reporting and dashboards in a domain-rich application

First, I want to say this seems to be a neglected question/area, so if this question needs improvement, help me make this a great question that can benefit others! I’m looking for advice and help from people who have implemented solutions that solve this problem, not just ideas to try.

In my experience, there are two sides of an application – the “task” side, which is largely domain driven and is where the users interact richly with the domain model (the “engine” of the application) and the reporting side, where users get data based on what happens on the task side.

On the task side, it’s clear that an application with a rich domain model should have business logic in the domain model and the database should be used mostly for persistence. Separation of concerns, every book is written about it, we know what to do, awesome.

What about the reporting side? Are data warehouses acceptable, or are they bad design because they incorporate business logic in the database and the very data itself? In order to aggregate the data from the database into data warehouse data, you must have applied business logic and rules to the data, and that logic and rules didn’t come from your domain model, it came from your data aggregating processes. Is that wrong?

I work on large financial and project management applications where the business logic is extensive. When reporting on this data, I will often have a LOT of aggregations to do to pull the information required for the report/dashboard, and the aggregations have a lot of business logic in them. For performance sake, I have been doing it with highly aggregated tables and stored procedures.

As an example, let’s say a report/dashboard is needed to show a list of active projects (imagine 10,000 projects). Each project will need a set of metrics shown with it, for example:

total budget
effort to date
burn rate
budget exhaustion date at current burn rate
etc.

Each of these involves a lot of business logic. And I’m not just talking about multiplying numbers or some simple logic. I’m talking about in order to get the budget, you have to apply a rate sheet with 500 different rates, one for each employee’s time (on some projects, other’s have a multiplier), applying expenses and any appropriate markup, etc. The logic is extensive. It took a lot of aggregating and query tuning to get this data in a reasonable amount of time for the client.

Should this be run through the domain first? What about performance? Even with straight SQL queries, I’m barely getting this data fast enough for the client to display in a reasonable amount of time. I can’t imagine trying to get this data to the client fast enough if I am rehydrating all these domain objects, and mixing and matching and aggregating their data in the application layer, or trying to aggregate the data in the application.

It seems in these cases that SQL is good at crunching data, and why not use it? But then you have business logic outside your domain model. Any change to the business logic will have to be changed in your domain model and your reporting aggregation schemes.

I’m really at a loss for how to design the reporting/dashboard part of any application with respect to domain driven design and good practices.

I added the MVC tag because MVC is the design flavor du jour and I am using it in my current design, but can’t figure out how the reporting data fits into this type of application.

I’m looking for any help in this area – books, design patterns, key words to google, articles, anything. I can’t find any information on this topic.

EDIT AND ANOTHER EXAMPLE

Another perfect example I ran across today. Customer wants a report for Customer Sales Team. They want what seems like a simple metric:

For each Sales person, what is their annual sales to date?

But that is complicated. Each sales person participated in multiple sales opportunities. Some they won, some they didn’t. In each sales opportunity, there are multiple sales people who are each allocated a percentage of credit for the sale per their role and participation. So now imagine going through the domain for this…the amount of object rehydration you would have to do to pull this data from the database for every Sales person:

Get all the SalesPeople->
For each get their SalesOpportunities->
For each get their percentage of the sale and calculate their Sales amount
then Add up all their SalesOpportunity Sales amount.

And that’s ONE metric. Or you can write a SQL query which can do it quickly and efficiently and tune it to be fast.

EDIT 2 – CQRS Pattern

I’ve read about the CQRS Pattern and, while intriguing, even Martin Fowler says it’s not tested. So how has this problem BEEN solved in the past. This has to have been faced by everyone at some point or another. What is an established or well-worn approach with a track record of success?

Edit 3 – Reporting Systems/Tools

Another thing to consider in this context is reporting tools. Reporting Services/Crystal Reports, Analysis Services and Cognoscenti, etc. all expect data from SQL/database. I doubt your data will come through your business later for these. And yet they and others like them are a vital part of the reporting in a lot of large systems.
How is the data for these properly handled where there is even business logic in the data source for these systems as well as possibly in the reports themselves?

This is a very glib answer, but getting right to the heart of the matter:

In terms of DDD maybe think of reporting as a Bounded Context?, so rather than thinking in terms of “THE” domain model, you should be willing to think that it’s okay to have more than one model. So yes it’s okay if the reporting domain has reporting business logic in it, just as it’s okay for the transactional domain to have transactional business logic in it.

As to the question of, say SQL stored procedures vs. domain model in application code, the same pros and cons apply for the reporting system as for the transactional system.

Since I see that you added a bounty to the question, I read the question again and noticed that you are asking for specific resource on this, so I thought I’d start with suggesting that you look at other Stack Overflow questions on the matter, and I found this one https://stackoverflow.com/questions/11554231/how-does-domain-driven-design-handle-reporting

The general gist of that one is to use CQRS as a pattern for your system, which is consistent with DDD, and rely on the query side responsibilities as a way to get reporting done, but I’m not sure that is a helpful answer in your case.

I also found this http://www.martinfowler.com/bliki/ReportingDatabase.html, which I found linked from here: http://groups.yahoo.com/neo/groups/domaindrivendesign/conversations/topics/2261

Here’s an interesting article from ACM on the matter: http://dl.acm.org/citation.cfm?id=2064685 but it’s behind a paywall so I can’t actually read it (not an ACM member 🙁 ).

There’s also this here answer on a similar question: https://stackoverflow.com/questions/3380431/cqrs-ddd-synching-reporting-database

and this one: http://snape.me/2013/05/03/applying-domain-driven-design-to-data-warehouses/

Hope this helps!

My understanding from your question is this:
Application for daily task has

View >> Controller >> Model (BL) >> Database (Data)

Application for reporting purpose

View >> Controller >> Model >> Database (Data + BL)

So change in BL for ‘task application‘ will lead to change in ‘reporting‘ BL too. That is your real problem right? Well thats OK to do changes twice, that pain you have to take anyway. The reason is both the BLs are separated by their respective concerns. One is for fetching data and one for aggregrating data. Also, your original BL and aggregrating BL will be written in different technology or language (C#/java and SQL proc). There is no escape to it.

Lets take another example not related to reporting specifically. Suppose a company XXX tracks mails of all users for interpretation and sell that info to marketing companies. Now it will have one BL for interpretation and one BL for aggregrating data for marketing companies. Concerns are different for both the BLs. Tomorrow if their BL changes such that mails coming from Cuba should be ignored, then business logic will be changed at both sides.

Reporting is a bounded context, or a subdomain, to speak loosely. It solves a business need of gathering/aggregating data and processing it to gain business intelligence.

How you implement this subdomain will likely be a balance between the (most) architecturally correct way you can do this, and what your infrastructure will allow. I like to start on the former side, and move towards the latter only as necessary.

You can probably split this up into two primary problems that you are solving:

Aggregating, or warehousing data. This should process some data source, and combine the information in such a way that it is stored in another data source.
Querying the aggregated datasource to provide business intelligence.

Neither of those problems reference any specific database or storage engine. Your domain layer should just be dealing with interfaces, implemented in your infrastructure layer by various storage adapters.

You may have various workers or some scheduled job run, that is split into a few moving parts:

Something to query
Something to aggregate
Something to store

Hopefully you can see some of the CQRS shine through there.

On the reporting side, it should only need to do querying, but never directly into the database. Go through your interfaces and through your domain layer here. This is not the same problem domain as your primary tasks, but there should still be some logic here you want to adhere to.

As soon as you dive directly into the database, you depend on it more, and it may eventually interfere with your original application’s data needs.

Also, at least for me, I definitely prefer writing tests and developing in code rather than queries or stored procedures. I also like not locking myself into specific tools until absolutely necessary.

It’s 4 years later and I just found this question again, and I have what is, to me, the answer.

Depending on your application and it’s specific needs, your domain/transaction database and your reporting can be separate “systems” or “engines”, or they can be serviced by one system. They should, however, be logically separate – which means they use different means of retrieving and providing data to the UI.

I prefer them to be physically separate (in addition to being logically separate), but a lot of times you start them out together (physically) and then, as the application matures, you separate them.

Either way, again, they should be logically different. It’s ok to duplicate business logic in the reporting system. The important thing is that the reporting system get to the same answer as the domain system – but chances are it will get there via different means. For example, your domain system will have a ton of very strict business rules implemented in procedural code (likely). The reporting system could implement those same rules when it reads the data but would do so via SET based code (eg. SQL).

Here’s what an evolution of your application’s architecture might realistically look like as it evolves:

Level 1 – Logically separated domain and reporting systems, but still in the same codebase and database

Level 2 – Logically separated domain and reporting systems, but separate databases now, with syncing.

Level 3 – Logically and Physically Separated domain and reporting systems, and separate databases with syncing.

The main idea is that reporting and domain have radically different needs. Different data profiles (reads vs writes and updates frequency), different performance requirements, etc. So they need to be differently implemented, and that necessitates some duplication of business logic.

It’s up to your business to devise a way to keep the business logic of the domain and reporting systems up to date with each other.

First some terminology, what you call the task side is known as Transactional and the Reporting side is Analytics.

You’ve already mentioned CQRS which is a great approach but there is little documented practical application of the approach.

What has been heavily tested is supplementing your Transactional processing with an Analytical processing engine. This is sometimes referred to as Data Warehousing or Data Cubes. The biggest gotcha with regard to analytics is that attempting to run queries against your transactional data in real-time is inefficient at best because it’s really only possible to optimize a database for reading or writing. For transactions, you want high write speeds to avoid delays in processing/doing things. For reporting, you want high read speeds so decisions can be made.

How to account for these issues? The simplest approach to understand is using a flattened schema for your reporting and ETL (extract transform load) to shuttle data from the normalized transactional schema to the denormalized analytical schema. The ETL is run via an agent on a regular basis and preloads the analytics table so that it’s ready for a quick read from your reporting engine.

A great book to get up to speed on data warehousing is the Data Warehouse Toolkit by Ralph Kimball. For a more hands on approach. Download the trial of SQL Server and pick up the Microsoft Data Warehouse toolkit which takes the general discussion of the first book but shows how to apply the concepts using SQL Server.

There are several linked books from those pages that go into more details about ETL, Star Schema Design, BI, Dashboards and other topics to help you go along.

The quickest way to get from where you are to where you want to be, is to hire a BI Expert and shadow him while he implements what you need.

It’s typical to separate operational/transactional data stores from reporting. The latter may have requirements to keep data for legal reasons (e.g. seven years of financial data for financial auditing), and you don’t want all that in your transactional data store.

So you’ll partition your transactional data by some time measure (weekly, monthly, quarterly, yearly) and move older partitions out into your reporting/history data store via ETL. It may or may not be a data warehouse with a star schema and dimensions. You’d use data warehousing reporting tools to do ad hoc queries and roll ups and batch jobs to generate periodic reports.

I would not recommend reporting against your transactional data store.

If you prefer to press on, here are more thoughts:

“Best” is subjective and what works.
I’d buy a reporting product rather than write them myself.
If you’re using a relational database, then SQL is the only game in town.
Stored procedures depend on whether you have the skills to write them.

Project management software that you use in house? I’d buy before I’d build. Something like Rally and Microsoft Project.

What about the reporting side? Are data warehouses acceptable, or are
they bad design because they incorporate business logic in the
database and the very data itself?

I don’t think you are talking about business logic, this is more reporting logic. What do users do with the information on this screen, is it simply for status updates? Your domain model is used to model transactional operations, reporting is a different concern. Pulling the data from SQL Server or putting it into a data warehouse is fine for reporting scenarios.

Your domain model should enforce the invariants of your domain such as a project member can’t book to the same project at the same time, or can only book x number of hours a week. Or you can’t book to this project as it is complete etc etc the state of your domain model (the data) can be copied for reporting separately.

To improve query performance you can use a materialised view. When an operation is committed against your model (e.g. Book 4 hour of this persons time to project x) and it is successful it can throw an event that you can then store in a reporting database and make the necessary calculates for your report. It will then be very fast to query on it.

Keep your transaction and reporting contexts separate, a relational database was built for reporting a domain model was not.

EDIT

Useful blog post on the subject http://se-thinking.blogspot.se/2012/08/how-to-handle-reporting-with-domain.html

report/dashboard is needed to show a list of active projects

State of each projects should be stored as static, per-calculated and well-formatted information in the database and any simulations should be handled on the client as WebApp.

budget exhaustion date at current burn rate

this type of projection should not be run on demand. Managing this information upon request, like perform calculations on resources, rate, tasks, milestones, etc, will result in an extensive use of the calculation layer without any re-use of these results for future calls.

Imagining a distributed environment (private or public cloud), you will get the enormous costs in the layer of computation, low use of database, and total lack of cache.

Should this be run through the domain first? What about performance?

The design of your software should include the ability to perform the normalization of the calculations necessary to obtain the required result during “data input”, not during the reading. This approach greatly reduces the use of computing resources, and above all creates tables that could be considered as “read only” by the customer. This is the first step to create a caching mechanism solid and simple.

So a search first, before completing the software architecture, it could be Distributed Cache System.

(request:aggregation) != 1:1

My consideration is therefore (for both first and second example), try to understand when it is proper to normalize the data, having as objective to reduce the aggregations per client requests. Which can not be 1:1 (request: aggregation) if one goal is obtain a sustainable system.

Distribute the calculation on the client

Another question, before finishing the design of the software, it could be, how much normalization, we want to delegate the client’s browser?

It was named MV*, it is true that it is fashionable today, in addition to this, one of its purposes is to create WebApp (single page app), which can be considered the present of many complex applications (and fortunately for bills that we pay to the cloud provider, these are executed in the client).

My conclusion is therefore to:

Understanding how many operations are really necessary to carry out the presentation of the data;
Analyze how many of these can be done in background (and then distributed through a cache system, after their normalization);
Understanding how many operations can be executed in the client, getting the configuration of the projects, running it on Views on the WebApp and thus reduce the calculation performed in back-end;

Use cache for query, use domain for caching.

There is a feature called “top users” on stackoverflow. You may find a line on the buttom of top users page, says”Only non community-wiki questions and answers are included in these totals (updated daily)”. This indicates the data is cached.

But why?

For performance issue maybe. Maybe they have the same concern with leaking domain logic(“Only non community-wiki questions and answers are included in these totals” in this case).

How?

I don’t really know how they did this, so here is just a guess 🙂

First, we need to find target question/answers. A scheduling task could work, just fetch all potential targets.

Second, let’s look at only one question/answer. Is it a non community-wiki one? Is it within 30 days? It’s quite easy to answer with domain models. Count the votes and cache them if satisfied.

Now we have the cache, they’re the output of domain derivations. The query is fast and easy because there is only simple criteria to be applied.

What if the results need to be more “real time”?

Events may do the help. Instead of triggering the caching with a scheduling task, we may split the process into many sub-processes. For example, when someone votes on hippoom’s answer, we publish an event triggering the update of the hippoom’s top users cache. In this case, we may see frequent quick small tasks.

Is CQRS neccessary?

Neither with scheduling task approach nor with events approach. But cqrs does have an advantage. The cache is usually highly displaying oriented, if some items are not required at first, we may not calculate and cache them at all. CQRS with event sourcing helps reconsititute cache for historical data by replaying events.

Some related questions:
1.https://stackoverflow.com/questions/21152958/how-to-handle-summary-report-in-cqrs
2.https://stackoverflow.com/questions/19414951/how-to-use-rich-domain-with-massive-operations/19416703#19416703

Hope it helps 🙂

Retrieving large amounts of information over wide area networks, including the Internet, is problematic due to issues arising from latency of response, lack of direct memory access to data serving resources, and fault tolerance.

This question describes a design pattern for solving the issues of handling results from queries that return large amounts of data. Typically these queries would be made by a client process across a wide area network (or Internet), with one or more middle-tiers, to a relational database residing on a remote server.

The solution involves implementing a combination of data retrieval strategies, including the use of iterators for traversing data sets and providing an appropriate level of abstraction to the client, double-buffering of data subsets, multi-threaded data retrieval, and query slicing.

Disclaimer:
I’m quite inexperienced in applications with domain models.
I understand all the concepts, and I’ve already been thinking for a long time about how to apply these concepts to the applications I’m working on (which ARE domain-rich, but lack OO, actual domain models etc.).
This question is one of the key problems that I faced as well. I have an idea how to solve this, but as I just said…it’s just an idea that I came up with.
I didn’t implement it in an actual project yet, but I don’t see a reason why it shouldn’t work, though.

Now that I’ve made that clear, here’s what I came up with – I’ll use your first example (the project metrics) to explain:

When someone edits a project, you’re loading and saving it via your domain model anyway.
In this moment, you have all the information loaded to calculate all your metrics (total budget, effort to date etc.) for this project.

You can calculate this in the domain model, and save it to the database with the rest of the domain model.
So the Project class in your domain model will have some properties like TotalBudget, EffortToDate etc., and there also will be columns with those names in the database tables where your domain model is stored (in the same tables, or in a separate table…doesn’t matter).

Of course, you need to do a one-time run to calculate the value for all existing projects as you are starting with this. But after that, the data is automatically updated with the current calculated values each time a project is edited via the domain model.

So whenever you need any kind report, all the required data is already there (pre-calculated) and you can just do something like this:

select ProjectName, TotalBudget, EffortToDate from Projects where TotalBudget > X

It doesn’t matter whether you get the data directly from the tables where the domain model is stored, or if you somehow extract the data to a second database, to a data warehouse or whatever:

If your reporting store is different from your actual data store, you can just copy the data from the “domain model tables”
If you directly query your actual data store, the data is already there and you don’t need to calculate anything

In either case, the business logic for the calculations is in exactly one place: the domain model.
You don’t need it anywhere else, so it’s not necessary to duplicate it.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 13:38

Thẻ: architecture, data-warehouse, domain-driven-design, mvc, reporting

Thiết kế website giá rẻ

Danh mục

Best practice or design patterns for retrieval of data for reporting and dashboards in a domain-rich application

(request:aggregation) != 1:1

Distribute the calculation on the client