最近有个功能需求是: 识别pdf文件转化为html页面形式
解决方式:
方法1:开始是寻找的PHP插件pdfparser,详见github链接: https://github.com/smalot/pdfparser ,官网示例文档:https://www.pdfparser.org/documentation
摘录示例如下
- <?php
- // Include Composer autoloader if not already done.
- include 'vendor/autoload.php';
- // Parse pdf file and build necessary objects.
- $parser = new \Smalot\PdfParser\Parser();
- $pdf = $parser->parseFile('document.pdf');
- $text = $pdf->getText();
- echo $text;
- ?>
常规的pdf可以进行识别,但是对于里面含有数据表格,就无法识别出里面的表格的样式。推荐程度一般。
方法2:Linux命令,pdf2htmlEX
1, 安装pdf2htmlEX,过程如下:
- sudo yum install -y cmake gcc gnu-getopt libpng-devel fontforge-devel cairo-devel poppler-devel libspiro-devel freetype-devel poppler-data libjpeg-turbo-devel git make gcc-c++ pango-devel
- sudo yum install -y libjpeg-turbo.x86_64 libjpeg-turbo-devel libjpeg-turbo-devel.x86_64 libtiff.x86_64 libtiff-devel openjpeg-devel.x86_64 openjpeg giflibgiflib-devel libxml2.x86_64 libxml2-devel libspiro.x86_64 libspiro-devel libuninameslist-devel.x86_64 libtool-ltdl-devel
2,在github上 下载软件
https://github.com/coolwanglu/pdf2htmlEX
https://github.com/coolwanglu/fontforge/tree/pdf2htmlEX
3,解压缩文件
- cd 到压缩文件目录下
- tar zxvf pdf2htmlEX-0.14.6.tar.gz
- cd pdf2htmlEX-0.14.6
4,编辑环境变量文件
- 1.将如下两条加入到/etc/profile文件底部
- export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
- export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
- 2.使其生效
- source /etc/profile
5,在fontforge目录下面执行
- 命令: sh autogen.sh
- 命令: sh configure
- 命令: make
- 命令: make install
6,在pdf2htmlEX目录下面执行,出现文件夹权限问题,可以chmod 777 文件夹
cmake . && make && sudo make install
7,安装完成,使用 pdf2htmlEX document.pdf 即可生成对应的html文件
比较推荐,就是需要进行环境的安装,稍有麻烦,但是pdf中的表格和图片都能进行转化
方法3 Linux 命令pdftohtml
安装方式,CentOS为例:sudo yum install poppler-utils.x86_64
安装完成,使用命令 pdftohtml -c -s document.pdf 即可
详细命令如下
- [root@localhost ~]# pdftohtml --help
- pdftohtml version 0.26.5
- Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
- Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
- Copyright 1996-2011 Glyph & Cog, LLC
- Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
- -f <int> : first page to convert
- -l <int> : last page to convert
- -q : don't print any messages or errors
- -h : print usage information
- -? : print usage information
- -help : print usage information
- --help : print usage information
- -p : exchange .pdf links by .html
- -c : generate complex document
- -s : generate single document that includes all pages
- -i : ignore images
- -noframes : generate no frames
- -stdout : use standard output
- -zoom <fp> : zoom the pdf document (default 1.5)
- -xml : output for XML post-processing
- -hidden : output hidden text
- -nomerge : do not merge paragraphs
- -enc <string> : output text encoding name
- -fmt <string> : image file format for Splash output (png or jpg)
- -v : print copyright and version info
- -opw <string> : owner password (for encrypted files)
- -upw <string> : user password (for encrypted files)
- -nodrm : override document DRM settings
- -wbt <fp> : word break threshold (default 10 percent)
- -fontfullname : outputs font full name
来自 https://blog.csdn.net/zhangchb/article/details/107179896