Why don’t RDBMSes return joined tables in a nested format?

For example, say I want to fetch a User and all of his phone numbers and email addresses. The phone numbers and emails are stored in separate tables, One user to many phones/emails. I can do this quite easily:

SELECT * FROM users user 
    LEFT JOIN emails email ON email.user_id=user.id
    LEFT JOIN phones phone ON phone.user_id=user.id

The problem* with this is that it’s returning the user’s name, DOB, favorite color, and all the other information stored in the user table over-and-over again for each record (users×emails×phones records), presumably eating up bandwidth and slowing down the results.

Wouldn’t it be nicer if it returned a single row for each user, and within that record there was a list of emails and a list of phones? It would make the data much easier to work with too.

I know you can get results like this using LINQ or perhaps other frameworks, but it seems to be a weakness in the underlying design of relational databases.

We could get around this by using NoSQL, but shouldn’t there be some middle ground?

Am I missing something? Why doesn’t this exist?

* Yes, it’s designed this way. I get it. I’m wondering why there isn’t an alternative that is easier to work with. SQL could keep doing what it’s doing but then they could add a keyword or two to do a little bit of post-processing that returns the data in a nested format instead of a cartesian product.

I know this can be done in a scripting language of your choice, but it requires that the SQL server either sends redundant data (example below) or that you to issue multiple queries like SELECT email FROM emails WHERE user_id IN (/* result of first query */).

Instead of having MySQL return something akin to this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "[email protected]",
    },
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "[email protected]",
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "email": "[email protected]",
    }
]

And then having to group on some unique identifier (which means I need to fetch that too!) client-side to reformat the result set how you want it, just return this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "emails": ["[email protected]", "[email protected]"]
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "emails": ["[email protected]"],
    }
]

Alternatively, I can issue 3 queries: 1 for the users, 1 for the emails, and 1 for the phone numbers, but then the email and phone number result sets need to contain the user_id so that I can match them back up with the users I previously fetched. Again, redundant data and needless post-processing.

Deep down, in the guts of a relational database, its all rows and columns. That is the structure that a relational database is optimized to work with. Cursors work on individual rows at a time. Some operations create temporary tables (again, it needs to be rows and columns).

By working with only rows and returning only rows, the system is able to better deal with memory and network traffic.

As mentioned, this allows for certain optimizations to be done (indexes, joins, unions, etc…)

If one was to want a nested tree structure, this requires that one pulls all the data at once. Gone are the optimizations for the cursors on the database side. Likewise, the traffic over the network becomes one big burst that can take much longer than the slow trickle of row by row (this is something that is occasionally lost in today’s web world).

Every language has arrays within it. These are easy things to work with and interface with. By using a very primitive structure, the driver between the database and program – no matter what language – can work in a common way. Once one starts adding trees, the structures in the language become more complex and more difficult to traverse.

It isn’t that hard for a programing language to convert the rows returned into some other structure. Make it into a tree or a hash set or leave it as a list of rows that you can iterate over.

There is also history at work here. Transferring structured data was something ugly in the days of old. Look at the EDI format to get an idea of what you might be asking for. Trees also imply recursion – which some languages didn’t support (the two most important languages of the old days didn’t support recursion – recursion didn’t enter Fortran until F90 and of the era COBOL didn’t either).

And while the languages of today have support for recursion and more advanced data types, there isn’t really a good reason to change things. They work, and they work well. The ones that are changing things are the nosql databases. You can store trees in documents in a document based one. LDAP (its actually oldish) is also a tree based system (though its probably not what you’re after). Who knows, maybe the next thing in nosql databases will be one that returns back the query as a json object.

However, the ‘old’ relational databases… they’re working with rows because thats what they’re good at and everything can talk to them without trouble or translation.

In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.

From RFC 1925 – The Twelve Networking Truths

It’s returning exactly what you asked for: a single record set containing the Cartesian product defined by the joins. There are plenty of valid scenarios where that’s exactly what you would want, so saying that SQL is giving a bad result (and thus implying that it would be better if you changed it) would actually screw a lot of queries up.

What you’re experiencing is known as “Object/Relational Impedance Mismatch,” the technical difficulties that arise from the fact that the object-oriented data model and the relational data model are fundamentally different in several ways. LINQ and other frameworks (known as ORMs, Object/Relational Mappers, not coincidentally,) don’t magically “get around this;” they just issue different queries. It can be done in SQL too. Here’s how I’d do it:

SELECT * FROM users user where [criteria here]

Iterate the list of users and make a list of IDs.

SELECT * from EMAILS where user_id in (list of IDs here)
SELECT * from PHONES where user_id in (list of IDs here)

And then you do the joining client-side. This is how LINQ and other frameworks do it. There’s no real magic involved; just a layer of abstraction.

You could use a built in function to concatenate the records together. In MySQL you can use the GROUP_CONCAT() function and in Oracle you can use the LISTAGG() function.

Here is a sample of what a query might look like in MySQL:

SELECT user.*, 
    (SELECT GROUP_CONCAT(DISTINCT emailAddy) FROM emails email WHERE email.user_id = user.id
    ) AS EmailAddresses,
    (SELECT GROUP_CONCAT(DISTINCT phoneNumber) FROM phones phone WHERE phone.user_id = user.id
    ) AS PhoneNumbers
FROM users user

This would return something like

username    department       EmailAddresses                        PhoneNumbers
Tim_Burton  Human Resources  [email protected], [email protected], [email protected]   231-123-1234, 231-123-1235

The problem with this is that it’s returning the user’s name, DOB, favorite color, and all the other information stored

The problem is that you are not being selective enough. You asked for everything when you said

Select * from...

…and you got it (including DOB and favourite colours).

You probably should been a little more (ahem) …selective, and said something like:

select users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

It’s also possible you might see records that look like duplicates because a user might join to multiple email records, but the field that distinguishes these two is not in your
Select statement, so you might want to say something like

select distinct users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

…over-and-over again for each record…

Also, I notice you’re doing a LEFT JOIN. This will join all records on the left of the join (i.e. users) to all records on the right, or in other words:

A left outer join returns all the values from an inner join plus all values in the left table that do not match to the right table.

(http://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join)

So another question is do you actually need a left join, or would an INNER JOIN have been sufficient? They are very different types of joins.

Wouldn’t be nicer if it returned a single row for each user, and within that record there was a list of emails

If you actually want a single column within the result set to contain a list that is generated on-the-fly, that can be done but it varies depending on which database you’re using. Oracle has the listagg function.

Ultimately, I think that your problem might be solved if you rewrite your query close to something like this:

select distinct users.name, users.id, emails.email_address, phones.phone_number
from users
  inner join emails on users.user_id = emails.user_id
  inner join phones on users.user_id = phones.user_id

Queries always produce a rectangular (un-jagged) tabular set of data. There are no nested sub-sets within a set. In the world of sets everything is a pure un-nested rectangle.

You can think of a join as putting 2 sets side-by-side. The “on” condition is how the records in each set are matched up. If a user has 3 phone numbers, then you’ll see a 3-time duplication in the user info. A rectangular un-jagged set must be produced by the query. It’s simply the nature of joining sets with a 1-to-many relationship.

To get what you want, you must use a separate query like Mason Wheeler described.

select * from Phones where user_id=344;

The result of this query is still a rectanglar un-jagged set. As is everything in the world of sets.

The concept of relational closure basically means that the result of any query is a relation which can be used in other queries as if it was a base table. This is an powerful concept because it makes queries composable.

If SQL allowed you to write queries which output nested data structures, you would break this principle. A nested data structure is not a relation, so you would need a new query language, or complex extensions to SQL, in order to query it further or to join it which other relations.

Basically you would build a hierarchical DBMS on top of a relational DBMS. It will be much more complex for a dubious benefit, and you lose the advantages of a consistently relational system.

I understand why it would sometimes be convenient to be able output hierarchically structured data from SQL, but the cost in the added complexity throughout the DBMS to support this is definitely not worth it.

You have to decide where the bottlenecks exist. The bandwidth between your database and application is usually pretty fast. There’s no reason most databases couldn’t return 3 separate datasets within one call and no joins. Then you get to join it all together in your app if you want.

Otherwise, you want the database to put this dataset together and then remove all the repeated values in each row that are the result of the joins and not necessarily the rows themselves having duplicate data like two people with the same name or phone number. Seems like a lot of over-head to save on bandwidth. You would be better off focusing on returning less data with better filtering and removing the columns you don’t need. Because Select * is never used in production-well that depends.

Very simply, don’t join your data if you want distinct results for a user query and a phone number query, otherwise as others have pointed out the “Set” or data will contain extra fields for every row.

Issue 2 distinct queries instead of one with a join.

In the stored procedure or inline parameterized sql craft 2 queries and return the results of both back. Most database and languages support multiple result sets.

For example, SQL Server and C# accomplish functionality this by using IDataReader.NextResult().

You are missing something. If you want to denormalize your data, you have to do it yourself.

;with toList as (
    select  *, Stuff(( select ',' + (phone.phoneType + ':' + phone.PhoneNumber) 
                    from phones phone
                    where phone.user_id = user.user_id
                    for xml path('')
                  ), 1,1,'') as phoneNumbers
from users user
)
select *
from toList

Pls refer to the usage of STUFF function which groups multiple rows (phone numbers) of a column (contact) that can be extracted as a single cell of delimited values of a row (user).

Today we extensively using this but facing some high CPU and performance issues. XML data type is another option but is a design change not a query level one.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 10:40

Thẻ: design, rdbms, sql

Why don’t RDBMSes return joined tables in a nested format?

SELECT * FROM users user 
    LEFT JOIN emails email ON email.user_id=user.id
    LEFT JOIN phones phone ON phone.user_id=user.id

Wouldn’t it be nicer if it returned a single row for each user, and within that record there was a list of emails and a list of phones? It would make the data much easier to work with too.

I know you can get results like this using LINQ or perhaps other frameworks, but it seems to be a weakness in the underlying design of relational databases.

We could get around this by using NoSQL, but shouldn’t there be some middle ground?

Am I missing something? Why doesn’t this exist?

Instead of having MySQL return something akin to this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "[email protected]",
    },
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "email": "[email protected]",
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "email": "[email protected]",
    }
]

And then having to group on some unique identifier (which means I need to fetch that too!) client-side to reformat the result set how you want it, just return this:

[
    {
        "name": "John Smith",
        "dob": "1945-05-13",
        "fav_color": "red",
        "emails": ["[email protected]", "[email protected]"]
    },
    {
        "name": "Jane Doe",
        "dob": "1953-02-19",
        "fav_color": "green",
        "emails": ["[email protected]"],
    }
]

By working with only rows and returning only rows, the system is able to better deal with memory and network traffic.

As mentioned, this allows for certain optimizations to be done (indexes, joins, unions, etc…)

It isn’t that hard for a programing language to convert the rows returned into some other structure. Make it into a tree or a hash set or leave it as a list of rows that you can iterate over.

However, the ‘old’ relational databases… they’re working with rows because thats what they’re good at and everything can talk to them without trouble or translation.

In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.

From RFC 1925 – The Twelve Networking Truths

SELECT * FROM users user where [criteria here]

Iterate the list of users and make a list of IDs.

SELECT * from EMAILS where user_id in (list of IDs here)
SELECT * from PHONES where user_id in (list of IDs here)

And then you do the joining client-side. This is how LINQ and other frameworks do it. There’s no real magic involved; just a layer of abstraction.

You could use a built in function to concatenate the records together. In MySQL you can use the GROUP_CONCAT() function and in Oracle you can use the LISTAGG() function.

Here is a sample of what a query might look like in MySQL:

SELECT user.*, 
    (SELECT GROUP_CONCAT(DISTINCT emailAddy) FROM emails email WHERE email.user_id = user.id
    ) AS EmailAddresses,
    (SELECT GROUP_CONCAT(DISTINCT phoneNumber) FROM phones phone WHERE phone.user_id = user.id
    ) AS PhoneNumbers
FROM users user

This would return something like

username    department       EmailAddresses                        PhoneNumbers
Tim_Burton  Human Resources  [email protected], [email protected], [email protected]   231-123-1234, 231-123-1235

The problem with this is that it’s returning the user’s name, DOB, favorite color, and all the other information stored

The problem is that you are not being selective enough. You asked for everything when you said

Select * from...

…and you got it (including DOB and favourite colours).

You probably should been a little more (ahem) …selective, and said something like:

select users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

select distinct users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...

…over-and-over again for each record…

Also, I notice you’re doing a LEFT JOIN. This will join all records on the left of the join (i.e. users) to all records on the right, or in other words:

A left outer join returns all the values from an inner join plus all values in the left table that do not match to the right table.

(http://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join)

So another question is do you actually need a left join, or would an INNER JOIN have been sufficient? They are very different types of joins.

Wouldn’t be nicer if it returned a single row for each user, and within that record there was a list of emails

Ultimately, I think that your problem might be solved if you rewrite your query close to something like this:

select distinct users.name, users.id, emails.email_address, phones.phone_number
from users
  inner join emails on users.user_id = emails.user_id
  inner join phones on users.user_id = phones.user_id

Queries always produce a rectangular (un-jagged) tabular set of data. There are no nested sub-sets within a set. In the world of sets everything is a pure un-nested rectangle.

To get what you want, you must use a separate query like Mason Wheeler described.

select * from Phones where user_id=344;

The result of this query is still a rectanglar un-jagged set. As is everything in the world of sets.

Basically you would build a hierarchical DBMS on top of a relational DBMS. It will be much more complex for a dubious benefit, and you lose the advantages of a consistently relational system.

Issue 2 distinct queries instead of one with a join.

In the stored procedure or inline parameterized sql craft 2 queries and return the results of both back. Most database and languages support multiple result sets.

For example, SQL Server and C# accomplish functionality this by using IDataReader.NextResult().

You are missing something. If you want to denormalize your data, you have to do it yourself.

;with toList as (
    select  *, Stuff(( select ',' + (phone.phoneType + ':' + phone.PhoneNumber) 
                    from phones phone
                    where phone.user_id = user.user_id
                    for xml path('')
                  ), 1,1,'') as phoneNumbers
from users user
)
select *
from toList

Pls refer to the usage of STUFF function which groups multiple rows (phone numbers) of a column (contact) that can be extracted as a single cell of delimited values of a row (user).

Today we extensively using this but facing some high CPU and performance issues. XML data type is another option but is a design change not a query level one.

Filed under: softwareengineering - @ 10:40

Thẻ: design, rdbms, sql

Thiết kế website giá rẻ

Danh mục

Why don’t RDBMSes return joined tables in a nested format?

Why don’t RDBMSes return joined tables in a nested format?