TLDR
The following is a scenario where I’m taking a bunch of raw documents that represent a graph data structure and trying to create transformed documents where various pieces of data are interleaved into each document.
I apologize for the length of this post. There are questions at the very bottom of this post if you want to skim through the post and just get to the questions.
I imagine something like a graph database or some kind of NoSQL database that makes it easy to pivot the data on certain fields would be useful here. However, I’m looking for a low-cost solution. I’m not looking to permanently store the data in tables somewhere if it means that I’m incurring cost while the data is mostly just sitting. The only thing I need to be permanent are the transformed documents, which is roughly between 5-10 gigs of data at the moment. I am open to using various tools/databases if it’s low-cost and is well-suited for this scenario.
I want to be able to create these transformed documents on a daily basis, and I’d like to be able to cache the results (memoize) unless a raw document has changed. (caching/memoizing isn’t part of the question really, just describing the scenario)
Details
I am creating a bunch of documents that contain data. All of this data can be represented as a graph. Each document needs to contain all of the metadata about a parent data type, and a subset of the metadata about other parent types associated with that parent node.
There are three parent types of data:
- Person
- Post
- Vote
I need to create documents for Person items and Post items.
One person can cast many votes and can create many posts.
One post can belong to more than one person, and can have many votes associated with it.
One vote is associated with one person and one post.
The Person documents need to contain all of the votes cast by that person, and each vote needs to contain a subset of metadata about the post that the vote is related to.
Here is an example of a transformed Person document.
{
"id": "1d38bccd",
"name": "batman",
"kind": "hero",
"location": "gotham",
"posts": [
{
"id": "745a1cb5",
"title": "Where is my cape?",
"body": "Have you seen my cape? Where did it go?"
},
{
"id": "3ebcc03a",
"title": "Seriously guys where is my cape?",
"body": "If you find it please return it to the lost and found"
}
],
"votes": [
{
"id": "a6203080"
"postId": "9d8122b8",
"vote": "yes"
}
]
}
Here is an example of a transformed Post document.
{
"id": "745a1cb5",
"title": "Where is my cape?",
"personId": "1d38bccd",
"personName": "batman",
"body": "Have you seen my cape? Where did it go?"
"votes": [
{
"id": "daea1243",
"personId": "cac06433"
"vote": "no"
},
{
"id": "3c1e61e4",
"personId": "3a88ad35"
"vote": "yes"
}
]
}
I am starting off with some raw documents:
- A single document that contains all of the metadata about all Person items except the Votes that they have cast and the Posts that they have made.
- A document for every single Post that was made. Each of these documents has the ID(s) of the Person items who created it.
- A document for each Person that contains all Votes cast by that Person item.
Proposed Solutions
I can think of a couple approaches to creating the transformed documents.
- Scan through all raw documents and create/enrich the transformed documents as we go.
Ex: Go through the raw Person document and generate all individual transformed Person documents. Then go through all of the individual raw Votes-Per-Person documents and enrich the transformed Person documents with these votes. Then go through all of the raw Post documents and create the transformed Post documents. Also re-scan all of the Person documents and enrich these with Post metadata for which they’ve created, and also enrich all of their Votes with Post metadata related to each of the Votes. Then re-scan all of the Votes-Per-Person documents to enrich the now-created Post documents with their associated Votes.
- Scan through all of the raw documents and create a map of Item IDs to a list of other items associated with that item.
[
{
"type": "vote"
"id": "348c9892"
"associated": ["d0bc4af7", "6b319386", "a480f98c", "73a7537c"]
}
]
Questions
How would a graph database handle a request for a parent datum and enriching it with some metadata from associated nodes?
Are there other more elegant ways to transform these documents?
Are there any low-cost tools that could make doing something like this easy?
Are there any books/videos/resources that go over this type of problem and how best to approach it?
Is json-ld useful for something like this? Seems like it’s more specific to web use cases, but would it have any utility here?