For example, say I want to fetch a User and all of his phone numbers and email addresses. The phone numbers and emails are stored in separate tables, One user to many phones/emails. I can do this quite easily:
SELECT * FROM users user
LEFT JOIN emails email ON email.user_id=user.id
LEFT JOIN phones phone ON phone.user_id=user.id
The problem* with this is that it’s returning the user’s name, DOB, favorite color, and all the other information stored in the user table over-and-over again for each record (users×emails×phones records), presumably eating up bandwidth and slowing down the results.
Wouldn’t it be nicer if it returned a single row for each user, and within that record there was a list of emails and a list of phones? It would make the data much easier to work with too.
I know you can get results like this using LINQ or perhaps other frameworks, but it seems to be a weakness in the underlying design of relational databases.
We could get around this by using NoSQL, but shouldn’t there be some middle ground?
Am I missing something? Why doesn’t this exist?
* Yes, it’s designed this way. I get it. I’m wondering why there isn’t an alternative that is easier to work with. SQL could keep doing what it’s doing but then they could add a keyword or two to do a little bit of post-processing that returns the data in a nested format instead of a cartesian product.
I know this can be done in a scripting language of your choice, but it requires that the SQL server either sends redundant data (example below) or that you to issue multiple queries like SELECT email FROM emails WHERE user_id IN (/* result of first query */)
.
Instead of having MySQL return something akin to this:
[
{
"name": "John Smith",
"dob": "1945-05-13",
"fav_color": "red",
"email": "[email protected]",
},
{
"name": "John Smith",
"dob": "1945-05-13",
"fav_color": "red",
"email": "[email protected]",
},
{
"name": "Jane Doe",
"dob": "1953-02-19",
"fav_color": "green",
"email": "[email protected]",
}
]
And then having to group on some unique identifier (which means I need to fetch that too!) client-side to reformat the result set how you want it, just return this:
[
{
"name": "John Smith",
"dob": "1945-05-13",
"fav_color": "red",
"emails": ["[email protected]", "[email protected]"]
},
{
"name": "Jane Doe",
"dob": "1953-02-19",
"fav_color": "green",
"emails": ["[email protected]"],
}
]
Alternatively, I can issue 3 queries: 1 for the users, 1 for the emails, and 1 for the phone numbers, but then the email and phone number result sets need to contain the user_id so that I can match them back up with the users I previously fetched. Again, redundant data and needless post-processing.
23
Deep down, in the guts of a relational database, its all rows and columns. That is the structure that a relational database is optimized to work with. Cursors work on individual rows at a time. Some operations create temporary tables (again, it needs to be rows and columns).
By working with only rows and returning only rows, the system is able to better deal with memory and network traffic.
As mentioned, this allows for certain optimizations to be done (indexes, joins, unions, etc…)
If one was to want a nested tree structure, this requires that one pulls all the data at once. Gone are the optimizations for the cursors on the database side. Likewise, the traffic over the network becomes one big burst that can take much longer than the slow trickle of row by row (this is something that is occasionally lost in today’s web world).
Every language has arrays within it. These are easy things to work with and interface with. By using a very primitive structure, the driver between the database and program – no matter what language – can work in a common way. Once one starts adding trees, the structures in the language become more complex and more difficult to traverse.
It isn’t that hard for a programing language to convert the rows returned into some other structure. Make it into a tree or a hash set or leave it as a list of rows that you can iterate over.
There is also history at work here. Transferring structured data was something ugly in the days of old. Look at the EDI format to get an idea of what you might be asking for. Trees also imply recursion – which some languages didn’t support (the two most important languages of the old days didn’t support recursion – recursion didn’t enter Fortran until F90 and of the era COBOL didn’t either).
And while the languages of today have support for recursion and more advanced data types, there isn’t really a good reason to change things. They work, and they work well. The ones that are changing things are the nosql databases. You can store trees in documents in a document based one. LDAP (its actually oldish) is also a tree based system (though its probably not what you’re after). Who knows, maybe the next thing in nosql databases will be one that returns back the query as a json object.
However, the ‘old’ relational databases… they’re working with rows because thats what they’re good at and everything can talk to them without trouble or translation.
- In protocol design, perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.
From RFC 1925 – The Twelve Networking Truths
3
It’s returning exactly what you asked for: a single record set containing the Cartesian product defined by the joins. There are plenty of valid scenarios where that’s exactly what you would want, so saying that SQL is giving a bad result (and thus implying that it would be better if you changed it) would actually screw a lot of queries up.
What you’re experiencing is known as “Object/Relational Impedance Mismatch,” the technical difficulties that arise from the fact that the object-oriented data model and the relational data model are fundamentally different in several ways. LINQ and other frameworks (known as ORMs, Object/Relational Mappers, not coincidentally,) don’t magically “get around this;” they just issue different queries. It can be done in SQL too. Here’s how I’d do it:
SELECT * FROM users user where [criteria here]
Iterate the list of users and make a list of IDs.
SELECT * from EMAILS where user_id in (list of IDs here)
SELECT * from PHONES where user_id in (list of IDs here)
And then you do the joining client-side. This is how LINQ and other frameworks do it. There’s no real magic involved; just a layer of abstraction.
5
You could use a built in function to concatenate the records together. In MySQL you can use the GROUP_CONCAT()
function and in Oracle you can use the LISTAGG()
function.
Here is a sample of what a query might look like in MySQL:
SELECT user.*,
(SELECT GROUP_CONCAT(DISTINCT emailAddy) FROM emails email WHERE email.user_id = user.id
) AS EmailAddresses,
(SELECT GROUP_CONCAT(DISTINCT phoneNumber) FROM phones phone WHERE phone.user_id = user.id
) AS PhoneNumbers
FROM users user
This would return something like
username department EmailAddresses PhoneNumbers
Tim_Burton Human Resources [email protected], [email protected], [email protected] 231-123-1234, 231-123-1235
2
The problem with this is that it’s returning the user’s name, DOB, favorite color, and all the other information stored
The problem is that you are not being selective enough. You asked for everything when you said
Select * from...
…and you got it (including DOB and favourite colours).
You probably should been a little more (ahem) …selective, and said something like:
select users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...
It’s also possible you might see records that look like duplicates because a user
might join to multiple email
records, but the field that distinguishes these two is not in your
Select
statement, so you might want to say something like
select distinct users.name, emails.email_address, phones.home_phone, phones.bus_phone
from...
…over-and-over again for each record…
Also, I notice you’re doing a LEFT JOIN
. This will join all records on the left of the join (i.e. users
) to all records on the right, or in other words:
A left outer join returns all the values from an inner join plus all values in the left table that do not match to the right table.
(http://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join)
So another question is do you actually need a left join, or would an INNER JOIN
have been sufficient? They are very different types of joins.
Wouldn’t be nicer if it returned a single row for each user, and within that record there was a list of emails
If you actually want a single column within the result set to contain a list that is generated on-the-fly, that can be done but it varies depending on which database you’re using. Oracle has the listagg
function.
Ultimately, I think that your problem might be solved if you rewrite your query close to something like this:
select distinct users.name, users.id, emails.email_address, phones.phone_number
from users
inner join emails on users.user_id = emails.user_id
inner join phones on users.user_id = phones.user_id
4
Queries always produce a rectangular (un-jagged) tabular set of data. There are no nested sub-sets within a set. In the world of sets everything is a pure un-nested rectangle.
You can think of a join as putting 2 sets side-by-side. The “on” condition is how the records in each set are matched up. If a user has 3 phone numbers, then you’ll see a 3-time duplication in the user info. A rectangular un-jagged set must be produced by the query. It’s simply the nature of joining sets with a 1-to-many relationship.
To get what you want, you must use a separate query like Mason Wheeler described.
select * from Phones where user_id=344;
The result of this query is still a rectanglar un-jagged set. As is everything in the world of sets.
The concept of relational closure basically means that the result of any query is a relation which can be used in other queries as if it was a base table. This is an powerful concept because it makes queries composable.
If SQL allowed you to write queries which output nested data structures, you would break this principle. A nested data structure is not a relation, so you would need a new query language, or complex extensions to SQL, in order to query it further or to join it which other relations.
Basically you would build a hierarchical DBMS on top of a relational DBMS. It will be much more complex for a dubious benefit, and you lose the advantages of a consistently relational system.
I understand why it would sometimes be convenient to be able output hierarchically structured data from SQL, but the cost in the added complexity throughout the DBMS to support this is definitely not worth it.
You have to decide where the bottlenecks exist. The bandwidth between your database and application is usually pretty fast. There’s no reason most databases couldn’t return 3 separate datasets within one call and no joins. Then you get to join it all together in your app if you want.
Otherwise, you want the database to put this dataset together and then remove all the repeated values in each row that are the result of the joins and not necessarily the rows themselves having duplicate data like two people with the same name or phone number. Seems like a lot of over-head to save on bandwidth. You would be better off focusing on returning less data with better filtering and removing the columns you don’t need. Because Select * is never used in production-well that depends.
3
Very simply, don’t join your data if you want distinct results for a user query and a phone number query, otherwise as others have pointed out the “Set” or data will contain extra fields for every row.
Issue 2 distinct queries instead of one with a join.
In the stored procedure or inline parameterized sql craft 2 queries and return the results of both back. Most database and languages support multiple result sets.
For example, SQL Server and C# accomplish functionality this by using IDataReader.NextResult()
.
You are missing something. If you want to denormalize your data, you have to do it yourself.
;with toList as (
select *, Stuff(( select ',' + (phone.phoneType + ':' + phone.PhoneNumber)
from phones phone
where phone.user_id = user.user_id
for xml path('')
), 1,1,'') as phoneNumbers
from users user
)
select *
from toList
0
Pls refer to the usage of STUFF function which groups multiple rows (phone numbers) of a column (contact) that can be extracted as a single cell of delimited values of a row (user).
Today we extensively using this but facing some high CPU and performance issues. XML data type is another option but is a design change not a query level one.
2