I have a specific docx table and i want to turn it into markdown using python
Hi. I have a docx file with this table:
i want to turn it into markdown using python. i tried this:
from docx import Document
from tabulate import tabulate
def extract_table_from_docx(docx_path):
doc = Document(docx_path)
table_data = []
# Assuming the first table in the document
table = doc.tables[0]
for row in table.rows:
row_data = []
for cell in row.cells:
# Replace newline characters with a space or any delimiter to keep the cell content together
cell_text = ' '.join(cell.text.strip().split('n'))
row_data.append(cell_text)
table_data.append(row_data)
return table_data
def convert_table_to_markdown(table_data):
# Use the first row as headers
headers = table_data[0]
rows = table_data[1:]
return tabulate(rows, headers=headers, tablefmt="github")
# Extract table data from DOCX file
path = "file.docx"
table_data = extract_table_from_docx(path)
# Convert table data to Markdown
markdown_table = convert_table_to_markdown(table_data)
# Print the Markdown table
print(markdown_table)
but i get this:
| p/p | Category/Technical | Subcategory | Equipment, materials | Type of equipment, destination | Priority manufacturer | |
|-------|------------------------|-------------------|-----------------------------------|------------------------------------------------|---------------------------------------------|----|
| p/p | policy | Subcategory | Equipment, materials | application | Priority manufacturer | |
| | policy | | | application | | |
| 1 | 2 | 3 | 4 | 5 | 6 | |
| | | | Transformers (low voltage) | OSM type transformers | Tula transformer plant | |
| | | | | Off.Av., Triggers, UZO, UZIP | Сhint, Hyundai, КЭАЗ | |
| | | | | Contacts | Electrical technician, KEAZ | |
| | | | | Circuit breakers for direct current (including | KEAZ, Hint, Akel | |
| | | | | to the current station for RZ) | KEAZ, Hint, Akel | |
| | | | | to the current station for RZ) | | |
| | | Low voltage (not | | Low voltage switches of type | DKC, КЭАЗ, IEK | |
| | | Low voltage (not | Other equipment (not available | ПКУ, УП | DKC, КЭАЗ, IEK | |
| | | Low voltage (not | Other equipment (not available | ПКУ, УП | | |
| | | explosion proof) | Other equipment (not available | ПКУ, УП | | |
| | | explosion proof) | Other equipment (not available | Drawers type YA5000 NKU drawers | | |
| | | explosion proof) | explosion proof) | Drawers type YA5000 NKU drawers | Promeltech, IEK | |
| | | | explosion proof) | Drawers type YA5000 NKU drawers | Promeltech, IEK | |
| | | | explosion proof) | electric control | Promeltech, IEK | |
| | | | | electric control | | |
| | | | | Control cabinets and alarm cabinets, | | |
| | | | | control stations, boxes | Center "Promservice" on the construct DKC | |
| | | | | connecting and splitting (Not | Center "Promservice" on the construct DKC | |
| | | | | connecting and splitting (Not | | |
| | | | | explosion-proof) | | |
......
as you see column name “Category/Technical policy” is being splitted into several rows, which is wrong. How to solve this issue?