Suggested Web Application Framework and Database for Enterprise, “Big-Data” App?

I have a web application that I have been developing for a small group within my company over the past few years, using Pipeline Pilot (plus jQuery and Python scripting) for web development and back-end computation, and Oracle 10g for my RDBMS. Users upload experimental genomic data, which is parsed into a database, and made available for querying, transformation, and reporting.

Experimental data sets are large and have many layers of metadata. A given experimental data record might have a foreign key relationship with a table that describes this data point’s assay. Assays can cover multiple genes, which can have multiple transcript, which can have multiple mutations, which can affect multiple signaling pathways, etc. Users need to approach this data from any point in those layers in the metadata. Since all data sets for a given data type can run over a billion rows, this results in some large, dynamic queries that are hard to predict.

New data sets are added on a weekly basis (~1GB per set). Experimental data is never updated, but the associated metadata can be updated weekly for a few records and yearly for most others. For every data set insert the system sees, there will be between 10 and 100 selects run against it and associated data. It is okay for updates and inserts to run slow, so long as queries run quick and are as up-to-date as possible.

The application continues to grow in size and scope and is already starting to run slower than I like. I am worried that we have about outgrown Pipeline Pilot, and perhaps Oracle (as the sole database). Would a NoSQL database or an OLAP system be appropriate here? What web application frameworks work well with systems like this? I’d like the solution to be something scalable, portable and supportable X-years down the road.

Here is the current state of the application:

  • Web Server/Data Processing: Pipeline Pilot on Windows Server + IIS
  • Database: Oracle 10g, ~1TB of data, ~180 tables with several billion-plus row tables
  • Network Storage: Isilon, ~50TB of low-priority raw data

1

Since the data is mostly immutable, have you looked into possible denormalizations? The goal would be to find values that could be essentially duplicated but reduce query complexity.

If queries regulary chain joins to connect to pieces of data, you can create a duplicate foreign key relationship directly between the two tables.

If there is a calculation performed by several queries, perform it once and save the result in the appropriate table. For example, some property of the assays that is calculated when needed can be calculated when inserted and added the the assay table.

This is ultimately what a Data Warehouse type solution does, but on a much smaller scale.

Not sure if you finalized on a solution, but my two cents for those who stumble on this question:
There are two parts (1) Database (2) Web App framework.

On Database, Did you explore Hadoop? Following specs of your environment makes Hadoop an attractive platform for data processing.

  1. ~1TB of data, ~50TB of low-priority raw data
  2. several billion-plus row tables
  3. New data sets are added on a weekly basis (~1GB per set)
  4. Experimental data is never updated, but the associated metadata can be updated weekly/yearly
  5. Insert to Select ratio is 10 to 100 times
  6. all data sets for a given data type can run over a billion rows

Following specs are of concern though:

  1. large, dynamic queries that are hard to predict.
  2. (okay for updates and inserts to run slow,) so long as queries run quick

Hadoop is insanely scalable, but Hadoop performs the best with batch processing. For online queries YMMV. Unless you try out it will be hard to predict if you will be better of or worse. You have to experiment with Hive, Cloudera Impala etc. This Article has some introductory overview on Impala. It also mentions some other options.

If Hive/Impala are not giving you right performance, there are variations you can explore based on your environment

  1. Since Disk space is comparably cheap, generate a lot more “summarized intermediate” tables, that could speed up queries.
  2. Pre join meta data, if that can reduce number of joins in the queries
  3. Use some hybrid approach of Oracle + Hadoop (but with increased overall complexity).

You probably would not have to concern yourself with a particular web framework. Just ensure that your apps are readily disributable/clusterable (avoid specific file system dependencies, ensure it can run on any system with a single build, …) and that your systems are set up to handle multiple nodes.

As far as the backend it really depends on your functionality; you want to choose a datastore(s) that makes sense for your app. You would want to look at the various categories of “NoSQL” type data stores (document, graph, key-value, …) to see if one or many can be appropriate for your needs. You will also want to ensure that Oracle or another RDBMS actually has limitations your business cannot live with (ex/ have you looked at RAC?).

In short, there is no good answer to this question. Understand your needs, isolate your bottlenecks, know your options, and don’t be afraid to mix solutions where appropriate.

3

There are two ways to scale – up (bigger/better/faster boxes) and out (more boxes). Up is good to a point, but no one is making off the shelf petaflop boxes yet. Out takes more work, but if you have removed the bottlenecks there’s no significant limit on how many boxes can be run in parallel.

Based on your comments it sounds like you have some bottlenecks to remove so you’re clusterable etc.

As for whether or not you’ve outgrown Pipeline Pilot, that’s a question that the vendor can best answer.

2

The requirement of searching for arbitrary data and dynamic queries is a good use case for an in memory database.

Disclaimer for the rest of the answer: I work for SAP, so I am most familiar with SAP’s products.

SAP already solved a similar use case (genome analysis) already with it’s in memory database HANA, so it works. Read more about it here:
http://www.saphana.com/docs/DOC-1799

Programming in HANA is mostly done using stored procedures. HANA has a build in Javascript runtime to implement the application layer. Or you can expose the stored procedures as services and use any other app server on top (e.g. Java).

You can find tutorials and trial development environments here:
http://scn.sap.com/community/developer-center/cross-technology
More information:
http://scn.sap.com/community/hana-in-memory

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật