I have roughly 4 million records per day and have to keep 7 years worth online, so we are looking at 10.2 billion records that I need to be able to search through. The users are expecting that the search will be quick enough for a UI, results in 3-5s
Due to politics out of my control, I can’t use an off the shelf database solution because it means I will have to give the database to another team to manage (don’t ask) which means I loose the ability to optimize hardware and software as they have a one-size-fits-all service for databases and charge (internally) by the GB. I’m sure I am going to get comments suggesting I make the point, I already have and management understand what they are asking me to do is ridiculous.
I have been looking at using Lucene as the crux of my solution. Storing the actual data partitioned by type and by day in flat files. Then using a Lucene document to index some of the fields that are searched by, with the only “Stored” field being the id of the record (so that I can read it from the flat file)
I am not exactly clued up on Lucene or hard drives, but as per my understanding, there will be initial IO/seeking time for searching the index, then when I have all the Lucene document IDs, I read the documents which will incur further IO/seeking time, then I read the actual record from the flat flat…I can’t imagine, given the size of the dataset, that this will be very quick, which I am slightly worried about?
Lucene has a max document size of 2.1billion per index, so I will require multiple indicies here.
Does this approach, on the face of it, look like it could work?
The data I am storing is event-action data. Most of the queries will be grouping by event id and getting the last event-action details for a particular event. Some of the queries will be analysing large sets events and their individual event-actions.
12
You haven’t said how big the data is, how big the individual fields are, or what budget you’ve got.
Regardless of what indexing system you choose, consider throwing hardware at the problem. You shouldn’t need to search the disks for anything. Index all the data, using a scheme that’s very fast to traverse (perhaps a sorted list or tree). Store the index on disk, but then cache the whole index in RAM. You may need tens, or even hundreds of gigabytes of RAM to do that.
If the individual fields are big, or variable size, consider indexing hashes of them.
The price for the server to do that could be scary.
Ignoring all the technical details this is an organizational/management problem and needs to be solved by the management of your organization.
Your manager has to be willing to kick the problem upstairs and/or get his users to raise the problem at a high level.
At your level put together or request an estimate for doing this with Oracle and Oracle hardware. Then put together a realistic estimate for a Hadoop cluster.
Despite the hype these clusters do not come cheap (You probably need something like 18 8 processor nodes with 64GB memory and 4 x 2 TB disks spread over three racks then another 4 nodes for the catalog etc.). DO NOT underestimate ; if you win you will have to implement it.
So, first let’s clearly re-state the problem in terms of its requirements:
- The system shall store a minimum of 4M records per day.
- The system shall provide a search interface to the user
2.1 The search capability shall return results in a maximum of 3s - The system shall be capable of searching a minimum of 10.2 billion records
- The system shall use a custom-designed database
4.1 The system shall have hardware and software optimized for the database to be developed
There are probably additional non-functional requirements, as well as details on how big the individual records are, which are probably relevant to your situation.
The short answer is that you have a requirements problem. If you look at these requirements, three of them (the first three) apply correctly to the system to define its function and behavior. The last requirement is not a valid requirement from a purist perspective, but I have seen these types of requirements put into statements of work.
So, the way this problem is solved is to estimate the cost of the 4th requirement, given the other three. Once you do that, present that as your solution cost. Management will panic and immediately ask you why the problem can’t be solved for a reasonable price. That is the entry point for your discussion on what needs to happen. Have an affordable alternative available and ready to present.
This is in contrast to what you are doing right now, which is assuming the other three can’t be met given the last one. Management doesn’t get it, because all they see are dollar signs.
If I was in your shoes, I would begin with a very reasonable, by-the-book implementation, using nothing but a regular RDBMS, embedded within the application, so that they do not feel as if they have to support something. SQLite, H2, or any alternative embedded database should do: No special flat files, no exotic indexes, no nothing: just a straightforward application of standard practices to solve the problem at hand, for the most part disregarding the immensity of the data. (I would, of course, choose a large enough integer as a key, and that’s all, pretty much.)
While working at it, a couple of more ideas would probably occur to me, as to how to make it work faster without resorting to anything exotic.
Then, I would test this to see how it performs, and I would demonstrate the results, together with the working solution, to the “powers of be” in your organization.
-
There is a chance that your straightforward implementation will perform within the required constraints, so you will be fine right there, no need to do anything else, zero resources wasted.
-
If the performance of the straightforward implementation is outside, but not too far out of, the required constraints, the “powers of be” might say “well, this is close enough, we don’t want to do anything else about it, so that’s what it will be.” Again, zero resources wasted.
-
If the performance of the straightforward implementation is outside, but within the same order of magnitude, of the required constraints, I would tell them to just buy better, bigger, faster hardware. Most chances are they will do that and case closed.
-
If they don’t want to buy better, bigger, faster hardware, then I would recommend that they rethink their requirement to refrain from using a large, scalable RDBMS. If they are reasonable, and you have shown that you are reasonable too, chances are they will rethink it.
-
If the powers of be don’t want to follow any of the reasonable avenues, and instead they want you to play the role of a magician, then and only then I’d start worrying about exotic solutions. Many chances are, things will not reach that point. But even if they do, the amount of work that you will have done in vain up until that point will be relatively small, and well worth the gamble that it might just suffice.
Thinking from the front end …
If you separate your lookup types in the UI, you might be able to have more reasonable constraints.
It sounds like one lookup type is recent event-action data on an event, which allows you to isolate by time in your data search. This perhaps gives a much smaller set of data, with the likely expectation of a user that it be retrieved somewhat soon.
Other kinds of lookup, wherein large data set or old time frame searches need to be accomplished can be given a different UI (or several UI’s), with a nice spinner to indicate … thinking now. As this can be understood by the user to be a more laborious set of requirements, patience might be reasonably expected. And of course, realistically necessary.
I don’t know if you have any ability to impact the front-tend design, but if you can convey the constraints you are working with, hopefully those who handle the user interaction will respond with understanding (at least some).