I have a column named MAT_DESC in a table that contains material descriptions in a free-text format. Here are some sample values from the MAT_DESC column:
RTYU 31655, 240+, 6 in, 50 Discs/Roll, 6 Rolls/Case
QWERTYUI PN-DR, Coarse, TR, 1-1/2 in, 50/Carton, 200 ea/Case, Dispenser Pack
2841 PC GREY AS/AF (20/CASE)
CI-1A, up to 35 kV, Compact/Solid, Stranded, 10/Case
MT53H7A4410WS5 WS WEREDSS PMR45678 ERTYUI HEERTYUIND 10/case
TYPE.2 86421-K40-F000, 1 Set/Pack, 100 Packs/Case
Clear, 1 in x 36 yd, 4.8 mil, 24 rolls per case
I’m trying to extract specific patterns of substrings from the MAT_DESC column, such as the quantity and unit information (e.g., “50 Discs/Roll”, “200 ea/Case”, “10/Case”,50/Carton, 200 ea/Case etc.).
I’m currently using the following SQL query to attempt this:
SELECT MAT_DESC,
CASE
WHEN PATINDEX('%[A-Za-z]/[A-Za-z]%', MAT_DESC) > 0
THEN CAST(PATINDEX('%[A-Za-z]/[A-Za-z]%', MAT_DESC) AS VARCHAR)
ELSE 'No X'
END AS Unit_Index
FROM TEMP_TABLE;
This query finds the pattern index of substrings like “Discs/Roll” or “ea/Case” using the PATINDEX function. Then, I planned to find the nearest comma indices before and after the pattern index and extract the substring using those indices.
However, this approach works for some scenarios but fails in others, especially when the material description contains additional information or is structured differently.
Is there a more robust way to extract specific patterns of substrings (like quantity and unit information) from a free-text material description column? Perhaps using regular expressions or other string manipulation techniques? I’m open to solutions in SQL