I’m very much struggling to put into words what I’m trying to do, but hoping the examples suffice.
Foreword:
None of this is mission critical in any way, it’s just a curiosity on if there’s yet another way to accomplish a task. I’m on a “try to do everything you know how to do and convert to set based queries”, it is SQL after all. Examples are enhancing basic functions you’d create for cleansing the non-printable characters from input strings. We’ve all been there, but a function with a loop that replaces each char <32 will suffice. Last year I learned to use Tally tables to split the strings into individual characters, and select the characters you want and string_agg the characters back together. While it sounds wildly unintuitive, splitting 500k rows of strings into individual characters, selecting the ones you want to keep, and string_agg’ing them back together shocked me. I’m sure it also depends on available compute and memory one has at their disposal, but it was much faster.
Anyhow, lets pretend I have a few rows like:
[ID], [string], [exprNum], [search_expression]
1, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 0, ‘Brown’
1, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 1, ‘Jump’
1, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 2, ‘Over’
2, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 0, ‘Quick’
2, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 1, ‘F’
2, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, 2, ‘Lazy’
I knew it wasn’t going to work because replace isn’t an aggregate function, but I tried it anyways just for kicks:
SELECT [ID]
,[string]
,[NewString] = REPLACE([string], [search_expression], '')
FROM wherever
GROUP BY [ID]
,[string]
The expected result would be (note, stackoverflow is eating the leftover double spacing we’d expect, so I replaced with double underscores):
1, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, ‘The Quick__Fox s__the Lazy Dog’
2, ‘The Quick Brown Fox Jumps Over the Lazy Dog’, ‘The__Brown ox Jumps Over the__Dog’
But it got the gears grinding, and wondering is there a trick to doing something like this, outside of the usual suspects:
- Cursor
- While
- Dynamic SQL to pivot the values into [0],[1],[2] columns and nest 3:N replaces
- Recursion
- Undo what I did to get the [search_expression] that way in the first place, etc.
- CLR function
- This isn’t what SQL is for
Somewhat of an “invert selection”, select the words that are left over in the [string] if you remove the words in the [search_expression] rows.
Again, I’m well aware that I should be using another tool, this isn’t what SQL is for, and if someone has a trick that works it likely isn’t as performant, etc., I just had the idea and am wondering if it’s possible.
The real goal is somewhat of a string similarity function. My brain got stuck on the replace and started wondering if someone more creative than I had a quick stuff+for-xml-path type idea. If I take the matching words out of the two strings, all non-alphanumeric characters, whitespace deduped, etc.; what % of characters are left.