It all depends on data volumes and content to tag distribution and density ratios
If you have a low tag distribution and density ratio (typical human generated data) you can simply generate an unique id or hash for each possible collection of tags in use by the data. Associate the 'tag collection' id with each data instance with those tags
This can work surprisingly well for many forms of human generated data
e.g. Stackoverflow has ~500,000 questions, and ~20,000 tags (too many dupe-ish tags!). Most questions have less than five tags. At worst case scenario you will have 500,000 'tag collection' id's to associate , but more realistically you will have several thousand
You also will either have to have instance tracking or garbage collection on the 'tag collection' collection as specific combination of tags fall out of use
e.g.
- Tag: id, tagName
- TagCollection: id, instanceCount
- TagCollectionTag: tagCollectionIId, tagId
- Data: id, title, content, tagCollectionId
Inserting tags is fast if a hash is used (hash on all tags of the collection). Otherwise you have to search the TagCollection and TagCollectionTag collections, but this should not be too large anyway
Searching is fast; search TagCollectionTag for instances containing the specific set of tags, and then find data rows with any of those tagCollectionId's
Hope that wasn't too confusing :-)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…