I’m implementing tags into my question and answer application. Users will be limited to the tags available by the app, so they can’t choose their own. So maybe only 30-50 tags will be available. This could change in the future though but for now that’s the deal.
I’m considering two options:
Option 1:
questions table: id, title, etc
tags table: id, name, question_id
Here the questions-tags relationship is questions has many tags, and tags belongs to questions. This way, only one join is required. However, this tags table may have many duplicates if more than one user chooses e.g. “hotels” tag. In which case, when fetching a list of tags I’ll do a GROUP BY in the SELECT statement. This way, only one table join is required.
Options 2:
questions table: id, title, etc
questions_tags table: id, question_id, tag_id
tags table: id, name
This way will eliminate duplication I guess, so if two users select “hotels” tables then “hotels” isn’t stored twice.. but, it requires two joins: questions (join) questions_tags (join) tags.
From a performance point of view (and without having to spend time to create tests using thousands of questions/ tags for both cases), is one preferred over the other? Is GROUP BY with only one table join generally going to be faster than two table joins where there may be thousands of questions, each with a few tags of their own?
I would go with Option #2 to eliminate the duplication, as you mentioned. I am not sure of the performance implications of joins and grouping here but I do believe it’s a cleaner representation of the data.
That said, I would consider using a graph database like Neo4J or Titan. It simplifies querying these types of relationships.