python-docx读取word

2021-06-09 / python word

需求

根据word中的表格生成SQL

内容

由于我使用的是doc文件，报了异常

这里可以通过word，另存为，改成docx类型的文件

表格模板

序号	编码	名称	英文描述	长度	说明
xx	xx	xx	xx	xx	xx

这里英文描述为数据库字段，类型全部是String，名称和说明都是关于字段的一些描述。

所以在生成DDL语句的时候，名称和说明合并在一块了。

安装第三方模块：

pip install python-docx

python代码

from docx import Document
import re


def get_table_name(str):
    regex = re.compile(r'（(.*\+?)）')
    return re.findall(regex, str)[0]


def genereate_sql_file(str):
    with open('./generate_ddl.sql', 'w', encoding='UTF-8') as f:
        f.write(str)


if __name__ == '__main__':
    f = Document('./test.docx')
    tables = []
    titles = []
    # 获取表名
    for p in f.paragraphs:
        if p.style.name == '二级条标题':
            titles.append(p.text)
            tables.append(get_table_name(p.text))
    special_str = '{xxx_#%……&*table_（*&……%xxx}'
    ddl_str = []
    # 生成 建表语句
    # 获取所有的表格对象
    for table in f.tables:
        ck = []
        ck.append("""
        CREATE TABLE """ + special_str + """ ( 
        """)
        for row in table.rows:
            if row.cells[0].text == '序号':
                continue
            cells = row.cells
            # 去除多余空格 和 回车
            cells[2].text = re.sub(r'(\\r|\\n|\\t)+', '', cells[2].text)
            cells[5].text = re.sub(r'(\\r|\\n|\\t)+', '', cells[5].text)
            ck.append(f' `{cells[3].text}` String COMMENT \'{cells[2].text}, {cells[5].text}\',\n')
        ck[-1] = ck[-1].replace(',\n', '\n')
        ck.append(""")
        ENGINE = MergeTree()
        ORDER BY COLLECT_TARGET_ID \t\t\t ; \n
        """)
        ddl_str.append(''.join(ck))
    content = []
    # 替换表名, 这里由于 第二个 二级标题 有两个三级标题，导致出错
    tables = tables[2:]
    ddl_str = ddl_str[4:]
    titles = titles[2:]
    print(len(tables))
    print(len(ddl_str))
    print(len(titles))
    for table, ddl, title in zip(tables, ddl_str, titles):
        content.append(f'\n -- {title} \n')
        content.append(ddl.replace(special_str, table))
    genereate_sql_file(''.join(content))

    """
        收集完之后，放到 sql工具 进行格式化
    """

后记

本来是几天的工作量，一天就给搞定了。

python 牛批

中文文档：https://www.geek-share.com/detail/2769406894.html

英文文档：https://python-docx.readthedocs.io/en/latest/index.html#

标题：python-docx读取word
作者：gitsilence
地址：https://blog.lacknb.cn/articles/2021/06/09/1623228009266.html