I have a list named MAT_DESC that contains material descriptions in a free-text format. Here are some sample values from the MAT_DESC column:
QWERTYUI PN-DR, Coarse, TR, 1-1/2 in, 50/Carton, 200 ea/Case, Dispenser Pack
2841 PC GREY AS/AF (20/CASE)
CI-1A, up to 35 kV, Compact/Solid, Stranded, 10/Case
MT53H7A4410WS5 WS WEREDSS PMR45678 ERTYUI HEERTYUIND 10/case
TYPE.2 86421-K40-F000, 1 Set/Pack, 100 Packs/Case
Clear, 1 in x 36 yd, 4.8 mil, 24 rolls per case
3M™ Victory Series™ Bracket MBT™ 017-873, .022, UL3, 0T/8A, Hk, 5/Pack
3M™ BX™ Dual Reader Protective Eyewear 11458-00000-20, Clear Anti-Fog Lens, Silver/Black Frame, +2.0 Top/Bottom Diopter, 20 ea/Case
4220VDS-QCSHC/900-000/A CABINET EMPTY
3M™ Bumpon™ Protective Product SJ5476 Fluorescent Yellow, 3.000/Case
3M™ Bumpon™ Protective Products SJ61A2 Black, 10,000/Case
Material Desc | String to be Extracted |
---|---|
QWERTYUI PN-DR, Coarse, TR, 1-1/2 in, 50/Carton, 200 ea/Case, Dispenser Pack | 50/Carton, 200 ea/Case |
2841 PC GREY AS/AF (20/CASE) | 20/CASE |
TYPE.2 86421-K40-F000, 1 Set/Pack, 100 Packs/Case | 1 Set/Pack, 100 Packs/Case |
RTYU 31655, 240+, 6 in, 50 Discs/Roll, 6 Rolls/Case | 50 Discs/Roll, 6 Rolls/Case |
Clear, 1 in x 36 yd, 4.8 mil, 24 rolls per case | 24 rolls per case |
3M™ Victory Series™ Bracket MBT™ 017-873, .022, UL3, 0T/8A, Hk, 5/Pack | 5/Pack |
3M™ BX™ Dual Reader Protective Eyewear 11458-00000-20, Clear Anti-Fog Lens, Silver/Black Frame, +2.0 Top/Bottom Diopter, 20 ea/Case | 20 ea/Case |
4220VDS-QCSHC/900-000/A CABINET EMPTY | No units |
3M™ Bumpon™ Protective Product SJ5476 Fluorescent Yellow, 3.000/Case | 3.000/Case |
3M™ Bumpon™ Protective Products SJ61A2 Black, 10,000/Case | 10,000/Case |
I’m trying to extract specific patterns of substrings from the MAT_DESC column, such as the quantity and unit information (e.g., “50 Discs/Roll”, “200 ea/Case”, “10/Case”,50/Carton, 200 ea/Case etc.).
I’m currently using the following PYTHON to attempt this:
pattern = r"(d+)s*(w+)/(w+)"
results = []
for desc in material_descriptions:
matches = re.findall(pattern, desc)
unit_strings = []
if matches:
for match in matches:
quantity, unit1, unit2 = match
unit_string = f"{quantity} {unit1}/{unit2}"
unit_strings.append(unit_string)
if unit_strings:
unit_info = ", ".join(unit_strings)
results.append((desc, unit_info))
for material_desc, unit_info in results:
print(f"Material Description: {material_desc}")
print(f"Unit Information: {unit_info}")
print()
Python script fails in the below listed scenarios
Material Desc | String to be Extracted |
---|---|
3M™ Victory Series™ Bracket MBT™ 017-873, .022, UL3, 0T/8A, Hk, 5/Pack | 5/Pack |
3M™ BX™ Dual Reader Protective Eyewear 11458-00000-20, Clear Anti-Fog Lens, Silver/Black Frame, +2.0 Top/Bottom Diopter, 20 ea/Case | 20 ea/Case |
4220VDS-QCSHC/900-000/A CABINET EMPTY | No units |
3M™ Bumpon™ Protective Product SJ5476 Fluorescent Yellow, 3.000/Case | 3.000/Case |
3M™ Bumpon™ Protective Products SJ61A2 Black, 10,000/Case | 10,000/Case |
Is there a way to achieve this ?