We are planning to build a travel website in which we will be integrating multiple APIs (eg. DOTW, GTA, Expedia) for Hotels. I have initially tried to use MySQL but since there are huge amounts of data in hotels and it may contain numerous “one to many” relationships with Images, Amenities and Rooms, the search becomes very slow when we have data for around 200000 Hotels. Even fetching all details for just one hotel may results in a JOIN query from at least four tables, and scanning over all hotels records. So we are planning to migrate our product schema to any NoSQL database to make our search as fast as possible.
Also sometimes we need to run certain schedulers on our database for eliminating duplicates from our database and also updating the newly added hotels which are sent by our providers.
Our tech stack is basically on Java, J2EE along with Springs and Hibernate.
I have read about about MongoDB, Cassandra, Redis and ElasticSearch but I am now confused if simply using these tools can optimize the website search performance. If so then what features differ between these tools that could help me make a determination?
0
I think that your search results can greatly improve through a number of techniques or database design approaches that will improve performance in your typical RDBMS. I suggest looking into and possibly prototyping the following improvements to see if they help you in performance testing first before you commit to an entirely new database technology that will require a great deal of new learning and experience to master.
Essentially you want to avoid the mindset of a “Magic Bullet”. There is a misconception that NoSQL can somehow magically solve all of our problems and performance issues with the RDBMS and that might be true sometimes but you really should try improving your database design first.
Identify your Non-Functional Requirements
Specifically identify your acceptable non-functional requirements for performance. Determine what the maximum average query wait time and use that as your goal. If you can tweak your database design to achieve this then you do not need to rearchitect your software to a NonSQL solution.
Avoid binary columns
It sounds like with the Image
table that you have binary type media being stored within your database tables. While this is highly dependent on your chosen database implements binary columns, it is generally accepted that binary columns can hurt the performance of your queries. Binary columns typically invalidate the benefits that an index on a table column can provide. If you don’t believe me, join to the Image
table and run an explain plan and notice how the index is probably not being used.
Use a Content Delivery Network
Instead of storing images and media in database records, store a URL that an application can use to retrieve that image, maybe in a browser. That URL can point to a unique image that is being stored and managed in a Content Delivery Network. There are a number of cloud services that can provide this or you can build your own with a number of tools. This should make all aspects of your application much more efficient.
Evaluate your table indexes
Make sure that if you are not using indexes that you build them for columns that you typically filter on or join against. To be honest, 4 tables is not a great deal of joins for a table so if you follow these guidelines you should see at least a modest improvement of performance.
If you follow these guidelines and you still cannot achieve your performance requirements then perhaps you can evaluate various NoSQL solutions and look for features that might help you.
2
Even fetching all details for just one hotel may results in a JOIN
query from at least four tables, and scanning over all hotels records.
A four-join query is absolutely trivial if you have the appropriate indexes for all joins.
The second part of this question is far more troubling. Why the scan over all records? Is is because of missing indexes? or do you need some aggregate data? maybe compare to an average, give a ranking index, something like that? If so, switching to NoSQL won’t help; what you need is to precalculate those aggregates so you can derive the data for each hotel quickly.
1
NoSQL is generally not very good with relational data. NoSQL is often great for non-relational but structured data like documents or time series.
Your “one to many” relationships may look quite like a document: e.g a “hotel” document may carry all its images, room info, etc stored together and fetched with one operation.
On the other hand, if you see a need for an SQL join
, do not hesitate and use an SQL database. These are damn efficient with joins, and are quite good at pulling large amounts of data in one query.
WRT “searches become very slow”, it’s hard to tell what happened without seeing your DB structure first. Usually adding a relevant (or dropping an irrelevant) index can speed things up. Reworking the schema to make the few most important queries efficient is also known to work.
I don’t think that NoSQL will help you at search too much — at least, not before I could see the types of searches you’re going to run.
2
Why either/or?
I’ve worked very successfully with a hybrid approach, using a relational db (SQL Server, but pick your favourite) to hold data that needs a relational structure – most of this is IDs linking all the various domain objects, very little textual data and certainly no blobs – and a nosql db (Dynamo) to hold large relatively unstructured data, typically JSON documents picked up from third party sources. Obviously coding is more complex but it enables you to get the best of both worlds.
Of course it may be that a pure approach is ultimately best for you, but the hybrid may also help out in step-by-step refactoring.
1