Storing in-text metadata in a discrete data structure

I am developing an application which will need to store inline, intext metadata. What I mean by that is the following: let’s say we have a long text, and we want to store some metadata connected with a specific word, or sentence of the text.

What would be the best way to store this information?

My first thought was to include in the text some kind of Markdown syntax that would then be parsed on retrieving. Something looking like this:

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit,

sed diam __nonummy nibh__[@note this sounds really funny latin]

euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

</code>

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam __nonummy nibh__[@note this sounds really funny latin] euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. </code>

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam __nonummy nibh__[@note this sounds really funny latin]
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

This would introduce two problems I can think of:

A relatively small one, is that if said syntax happen to be fortuitously on the said text, it can mess with the parsing.
The most important one is that this doesn’t maintain this metadata separate from the text itself.

I would like to have a discrete data structure to hold this data, such a different DB Table in which these metadatas are stored, so that I could use them in discrete ways: querying, statistics, sorting, and so on.

EDIT: Since the answerer deleted his answer, I think it might be good to add his suggestion here, since it was a workable suggestion that expanded on this first concept. The poster suggested to use a similar syntax, but to link the metadata to the PRIMARY KEY of the metadata database table.

Something that would look like this:

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit,

sed diam __nonummy nibh__[15432]

euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

</code>

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam __nonummy nibh__[15432] euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. </code>

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam __nonummy nibh__[15432]
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Where 15432 would be the ID of a table row containing the necessary, queriable information, as per example below.

My second thought was to store information of this kind in a DB Table looking like this:

<code>TABLE: metadata

ID TEXT_ID TYPE OFFSET_START OFFSET_END CONTENT

1 lipsum note 68 79 this sounds really funny latin

</code>

<code>TABLE: metadata ID TEXT_ID TYPE OFFSET_START OFFSET_END CONTENT 1 lipsum note 68 79 this sounds really funny latin </code>

TABLE: metadata

ID    TEXT_ID    TYPE    OFFSET_START    OFFSET_END    CONTENT
1     lipsum     note    68              79            this sounds really funny latin

In this way the metadata would have a unique id, a text_id as a foreign key connected to table storing the texts and it would connect the data with the text itself by using a simple character offset range.

This would do the trick of keeping the data separated from the metadata, but a problem that I can immediately see with this approach is that the text would be fundamentally not editable. Or, if I wanted to implement the editing of the text after the assignation of metadata, I would basically have to calculate characters additions, or removal compared to the previous version, and check whether each of this modifications adds or remove characters before or after each of the associated metadata.

Which, to me, sounds like a really unelegant approach.

Do you have any pointers or suggestions for how I could approach the problem?

Edit 2: some XML problems

Adding another case which would make quite necessary for this separation of data and metadata to happen.

Let’s say I want to make it possible for different users to have different metadata sets of the same text, with or without the possibility of each user actually displaying the other user metadata.

Any solution of the markdown kind (or HTML, or XML) would be difficult to implement at this point. The only solution in this case that I could think about would be to have yet another DB Table which would contain the single user version of the original text, connecting to the original text table by the use of a FOREIGN KEY.

Not sure if this is very elegant either.

XML has a hierarchical data model: any element which happens to be within the borders of another element is considered as its child, which is most often not the case in the data model I’m looking for; in XML any children element must be closed before the parent tag can be closed, allowing for no overlapping of elements.

Example:

<note content="the beginning of the famous placeholder"> Lorem ipsum
dolor sit <comment content="I like the sound of amet/elit"> amet </note>,
consectetuer adipiscing elit </comment> , <note content="adversative?"> sed
diam <note content="funny latin"> nonummy </note> nibh </note> euismod
tincidunt ut laoreet dolore magna aliquam erat volutpat.

Here we have two different problems:

Different elements overlapping: The first comment starts within the first note, but ends after the end of the first note, i.e. it’s not its child.
Same elements overlapping: The last note and the boldfaced note overlap; however, since they are the same kind of element, the parser would close the lastly opened element at the first closure, and the first opened element at the last closure, which, in this circumstance, is not what is intended.

I’d go for a mix of your solutions, but instead, I’d use a standard : XML. You’d have a syntax like this one

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit,

sed diam <note content="It sound really funny in latin">nonummy nibh</note>

euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

</code>

<code>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam <note content="It sound really funny in latin">nonummy nibh</note> euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. </code>

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam <note content="It sound really funny in latin">nonummy nibh</note>
euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Why XML

If you think about it, it’s exactly how the whole web is structured : content (actual text) that carries semantic – what you’re calling metadata – through html tags.

This way you have a really cool world that opens :

Free parser
Battle tested way to add metadata to content
Ease of use (depending on which users you are targeting)
You can easily extract the raw text, without the metadata, as it’s a standard features on XML parsers. That is very useful to have an indexable version of your content, so Lorem <note>ipsum</note> is raised when you are searching for lorem ips* for example.

Why XML over Markdown

A website like stackexchange uses markdown as the semantics its content convey is rather basic : emphasis, links/urls, image, header etc. It seems the semantic you’re adding to your content is

More complex
Subject to change or must be extensible

Thus I sense Markdown wouldn’t be a really good idea. Also Markdown isn’t really standardized, and parsing/dumping it might be a pain in the ass, even more a markdownish syntax see Jeff Atwood’s post about the WTF he met on parsing Markdown.

On separation between data and metadata

Per se, such separation isn’t mandatory. I assume you are looking for the advantage it brings:

Possibility to have the raw content without the metadata
Separation of concerns: I don’t want to have side-effect/complexity overhead when manipulating metadata because of the data, and otherwise.

All these concerns are cleared by the use of XML. From the XML, you can easily dump any tag-stripped content, and data/metadata are separated, just like attribute and actual text is separated in XML.

Also I don’t think you can really have your metadata totally not bound to your data. From what you describe, your metadata are a composition of your data, ie deleting the data leads to metadata deletion. This is where you metadata diverge from the usual HTML/CSS. CSS doesn’t disapear when an html element is removed, because it can be applied to other elements. I don’t feel this is the case in your metadata.

Having metadata close to the data, as in XML or Markdown, allow an easy understanding (and maybe debugging) of the datas. Also, the example you give on your second thought add some complexity, because for each data I’m reading, I need to query the metadata table to get these. If the relation between your data and your metadata is 1:1 or 1:N, then it’s IMO clearly useless, and only brings complexity (a good case of YAGNI).

The Solution Use Case

I disagree with some of the other answers, simply because, while great solutions, they are probably not your solution. Yes XML has the word markup in it’s acronym, but it is probably not ideal for your situation. It is way too complex, it offers little assistance in keeping the meta data separate from the original text. Essentially it will turn everything into a form of metadata, creating one overweight data-set.

Since there is likely no absolutely correct solution or approach, the best solution answers the question:

How will the data be used by the system?

Also, if you try and ask, how a solution design could inherently add to the value of the system, in the way that it will be used, then you are closer to finding your elegant answer.

Understanding the problem

Ok enough commentary, let’s dig into the problem.
This is the problem as I understand it is (obviously adding to this will be beneficial):

There is an original text
- Assumptions about this original text:
- This text, may or may not be made up of several independent documents
- This text, may or may not be edited by one or more users
- This text, contains related information. By that I am assuming (correct me if I am wrong) that the metadata is related and not descriptive. So it stores information related to the original text, and not information that describes the text. So it will store notes about the original text, and not by example describe that the text is a heading that is bold and is a link to a website, etc.
- The text should be easily filtered distinct from the metadata
- The text should be protected from being corrupted by, and corrupting the metadata
There should be a means of storing information related to the original text (metadata)
- This metadata also needs it’s own (meta)metadata, that would hold information such as which user’s (or groups?) the meta data is relevant for, such as a description of the metadata, say weather it is a note, or comment, or description etc.
- This metadata (and it’s (meta)metadata) need to withstand alterations in the original text, alterations of the metadata and alterations of the (meta)meta data
- The metadata (+ Meta-Metadata) needs to be structured well and easily queried, and indexed or even joined in a relational way to other datasets. The relational nature of the metadata should not just be limited to Queries, but also facilitate updates or write back and alteration of the metadata as a result of the relational data activities.
- The value of the metadata (+ Meta-Metadata) is in it’s very related nature. It becomes immediately counter productive the moment it looses it’s relation to the original text. Thus the integrity of it’s relation to the original text is a mandatory design imperative.
Other assumptions about the nature of the problem and how it will be used are:
- Concurrent heterogeneous system access. That is to say that the user’s may wish to view the text and edit metadata, at the same time as the administrator (or another process) is performing relational data queries on the structured metadata.
- The system will have several users
- The system is modern. That is to say that it is not constrained by storage space, or processing speed, or real-time imperatives. The integrity and purpose focused functionality is a higher priority than physical computing resource limitations.
- There is a (albeit low) chance that the uses and functionality of the system may evolve or change somewhat, as the system is used.

Building the solution design

Understanding the problem as I have outlined it above, I will now start to suggest possible solutions and approaches that aims to solve the above problem.

Components

So I would see that there would need to be a custom built user access system. It would filter relevant and irrelevant metadata from the original text. It would facilitate the editing and viewing of metadata into the text. It would ensure the integrity of the relationship between the metadata and it’s original text. It would structure the metadata and offer a data source to a relational data system. It will most likely provide a host of other purpose driven functions.

Structure

So since it is important to keep the metadata’s integrity to the original text, the best way of ensuring this, is to keep the metadata inline with the original text. This will offer the benefit that the original data can be confidently edited without breaking this integrity.

The concerns with this approach are the corruption of the metadata by the original data and vice versa. The adequate indexing and structuring of the metadata and it’s (meta)metadata in a way that allows for queries and updates and efficient access. The easy filtering of metadata from the original text.

With this in mind, I would suggest that a portion of the solution be based on the approach of using ESCAPE CHARACTERS within the original text. This is not the same as designing your own Mark-up Language or using an existing Markup Language such as XML or HTML. It is easy to design an ESCAPE CHARACTER that has a zero, or near zero chance of existing in the original text.

My advice to you in this regard would be to carefully consider the
original data, and try and determine the nature of the code-page that
it is stored in and then look for an ideal CHARACTER or
CHARACTER SEQUENCE that is unlikely or impossible to occur. For example in ASCII there are literally built-in control characters with
byte values that are never used in standard user interfaces. The same
can be said for a font based or relational data based information
system. Just be careful with binary data codecs. Depending on the nature of the
original data it may be valuable to build a parser that confirms the discovery of a
control sequence, perhaps by looking at the data that is escaped and verifying it’s
integrity, either with a simple inspection of the structure of the escaped data, or
even by including a control character that is calculated for each escaped data
sequence.

Example Data With Escape Sequences

This is a story of a man. >>>>(#)Why is this story about a man not a woman?(#)()userid::77367()Manager’s Comment()DataID::234234234>>>> A man who went to mow a meadow, went to mow a meadow. The man went with his dog >>>>(#)Ask the client if the story would be better with a cat instead(#)>>>> to mow the meadow. So now this is the story of a man and his dog who went to mow a meadow.

One man and his dog, went to mow a meadow, went to mow a meadow, a meadow reached over the mountain. >>>>(#)This sounds alot better with a forest(**)Suggestion Note(#)>>>>

The man and his dog and his mission, to mow a meadow, a meadow reached over the mountain is only reached when crossing the river.

Example Data Without Escape Sequences

This is a story of a man. A man who went to mow a meadow, went to mow a meadow. The man went with his dog to mow the meadow. So now this is the story of a man and his dog who went to mow a meadow.

One man and his dog, went to mow a meadow, went to mow a meadow, a meadow reached over the mountain.

The man and his dog and his mission, to mow a meadow, a meadow reached over the mountain is only reached when crossing the river.

Obviously this is easily parsed, not complex as an entire Mark-up language and easily adaptable to your purpose.

Solved Yet?
Well, I would say no. Our solution still has some holes. The indexing and structured access of this data is poor. Also, it would not be reasonable to query this file (or several files) at the same time as editing it.

How could we solve that problem?

I would suggest a DATA ALLOCATION TABLE as a document header. I would also suggest implementing a TRANSACTIONAL TABLE UPDATE QUEUE. Let me explain. The designers of a file system, particularly a rotational disk file system, faced similar design challenges to the ones you have described above. They needed to embed information about the files on the disk with, along with the data. A great solution to the relationship integrity of this data, was to DUPLICATE it in a File Allocation Table (FAT).

This means that for each individual Metadata Item, there is a corresponding entry in the Data Allocation Table. So it is fast, structured and relational, and independent of the original data. If queries or joins or updates need to be performed on the metadata, then it is easily done by simply accessing the Data Allocation Table.

Obviously care must be given to ensure that the original in-line metadata is a true reflection of the Data Allocation Table data. That is where a Transactional Table Update Queue comes in. Every change, addition or removal of metadata, is made not on the data it’self, but rather on the queue. the queue will then ensure that either all the changes are made to both the in-line and table data, or no change is made at all. It also allows for asynchronous updates to be performed, for example, all the metadata of a certain user can be deleted by running a delete command on the queue. If the inline metadata was locked and in use, the queue would not perform any changes until it could do it to both the Table data and the inline data.

This is a typical kind of engineering question in that all your options have different tradeoffs, and which is best depends on what is important to you. Unfortunately, you haven’t given enough information to make the determination.

You also haven’t appeared to consider an important semantic problem. Lets say the original text is

My friend Bob lent me five dollars

Someone adds a comment around “Bob” saying

Bob is a complete idiot

Then the original text is edited to

Jane lent Bob five dollars which he later lent to me

You might make some sense of this particular case using a text matching algorithm such as what is used to show a diff file, but character offsets are going to make the metadata attach to the “Jan” in “Jane”.

Worse is if the text is edited to

My friend Steve lent me five dollars

You could manage to figure out how to attach the metadata to “Steve”, but how do you know if it applies?

Also, have you decided if the metadata itself can have metadata? That might change your implementation.

Beyond semantic issues, it isn’t very clear what you are doing with the data. I thought perhaps it was very inconvenient to have the original text “polluted” with any markup, but then you were sort of OK with having ID values in it. Which doesn’t make a lot of sense if the metadata applies to a section of text instead of being inserted into a point in the text.

My guess is that for most purposes storing marked up text is easier, or, second choice, going all SQL and having the text and markup represented by a node hierarchy – basically a DOM in table form. If your data is hierarchical than it may be easier to use XML and get existing parsers for free, versus writing your own.

It is quite possible that there is some fairly simple solution that is good enough for your exact situation, but I can’t tell you what that is because it really depends on just what you are trying to do, in detail.

I would strongly suggest you encapsulate whatever strategy you choose as much as you can, though this is fairly hard to do if much of your implementation needs to be visible to many SQL queries.

Sorry that the reply is so scattered and so full of “it depends”, but real world design questions are like that.

I think the suggestion from the previous answerer, the one you mentions on you question) is a very good one.

It would behave the same way we post links on the StackExchange sites, but the info data would be on another table. The benefits are, you have the data separated, and therefore queryable and indexable. On edit the text, you can check for deleted metadata IDs and clean the metadata table.

The only small problem like you said is the parsing, but you can deal with it pretty easily.

Lets say I have a text:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

I add the note like this:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam [@123,#456,2w]nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

[@123,#456,2w] means: user_id=123, note_id=456, and the text marked by this note spans for the next 2 words (could be chars (c), sentences (s), paragraps (p) or whatever). Exact syntax may be different, of course.

In plain text editors notes’ text can be easily stored at the end of the document, just like with Markdown footnotes.

In rich text editors this kind of note can be displayed in the text as an icon, and marked text can be highlighted in some way. User can then delete such notes just as normal characters with Del or Backspace, and edit them with some kind of special editing mode. I imagine resizing noted areas with a mouse and editing note text with popup window.

Pros:

Goes nicely with “intersections” since you mark an offset (implicitly by note’s position in the text) and a length for each note.
Supports multiuser environment. (In fact, this needs some deeper research and you’d probably have to deal with something like Google Wave operational transformations, which my brain cannot handle.)
Can be edited with both rich and plain text editors.
You can easily handle revisions, since all markers are in-place – when you edit the text before a marker, marker just shifts along with other text.
Easy to parse.
No need for external DB, but you can still use one if you want.
Can be mixed with Markdown or XML if you choose some unobtrusive syntax.

Cons for plain text editing:

You can’t see areas in text marked with notes (unless you highlight plaintext, which is an option too), but just the places where notes start. This is compensated by the ability to choose arbitrary length units: chars, words, sentences, paragraphs.
You can edit the text under a note without noticing, especially if a note spans quite long (e.g. 2+ paragraphs). Can be compensated by revison control mechanism which compares a text under each note with it’s previous version and notifying a user if it was changed.

General cons:

Troubles with multiple users editing the same text, but I think it is inavoidable anyway. I’m not an expert in this field.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 17:40

Thẻ: concepts, data-structures, markup, separation-of-concerns

Thiết kế website giá rẻ

Danh mục

Storing in-text metadata in a discrete data structure

Edit 2: some XML problems

Why XML

Why XML over Markdown

On separation between data and metadata

The Solution Use Case

Understanding the problem

Building the solution design

Components

Structure