Paddle/abbyy等ocr比较：如何将图片生成可选择文字版PDF

PaddleOCR是百度研发并维护字符识别（OCR）项目，

1. 其汉字识别能力强于 abbyy

2 可以识别印章等特殊布局的文字
3. 手写体识别远强于其它 ocr 供应商
4. 表格识别能力一般，与camlot类似，银行流水等一些无框表识别能力弱于部分 ocr 供应商如 pdfflux
5. Paddle 有个导出 word 的版面恢复功能，但效果一般，很多区域都会被嵌入图片

ABBYY有个很好用的功能是将图片pdf转成文字版。但 paddle 一般直接生成 json。
其实可以通过在图片上添加一层透明文字来实现类似功能，转化成文字片pdf的效果：

比如一条 paddle ocr 转化结果

 json_data = {
 "type": "figure",
 "bbox": [286, 360, 978, 727],
 "res": [
 {"text": "客户名称", "text_region": [[573.0, 402.0], [717.0, 402.0], [717.0, 441.0], [573.0, 441.0]]},
 ....
 ],
 "img_idx": 0
}

其中的文字区域坐标为： [[573.0, 402.0], [717.0, 402.0], [717.0, 441.0], [573.0, 441.0]]

注意这个坐标系以图片左上角为原点，而多数生成pdf时的坐标原点为左下角。

可以通过 reportlab 或 node.js 库将json中的文字附加到图片上并生成pdf。

from reportlab.pdfbase import pdfmetrics 
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.colors import Color
from reportlab.lib.units import inch
from PIL import Image


font_path = './fonts/SIMHEI.TTF'
font_name = 'SimHei'
pdfmetrics.registerFont(TTFont(font_name, font_path))


def generate_pdf(json_data, image_path, output_path):
 from PIL import Image
 img = Image.open(image_path)
 width, height = img.size

 print(width, height)


 pdf_canvas = canvas.Canvas(output_path, pagesize=img.size)

 # Add image to PDF
 pdf_canvas.drawImage(image_path, 0, 0, width=width, height=height)

 # bbox = json_data['bbox']
 # box_x, box_y, box_width, box_height = bbox

 # Add text to PDF
 for item in json_data['res']:
 text = item['text']
 region = item['text_region']

 x_coords = region[0][0]
 y_coords = region[0][1]

 line_height = region[2][1] - y_coords

 y_coords = height - y_coords - line_height

 print(x_coords, y_coords, line_height, text)

 pdf_canvas.setFont("SimHei", int(line_height))
 pdf_canvas.setFillColorRGB(0, 0, 0, 0)
 pdf_canvas.drawString(x_coords, y_coords, text)

 pdf_canvas.save()



image_path = "./output/output_0.jpg" # Path to your image
output_path = "output.pdf"

generate_pdf(json_data, image_path, output_path)

里面使用了一种内嵌字体，并根据文字区域的高来设置字体大小。

Paddle/abbyy等ocr比较：如何将图片生成可选择文字版PDF

开源的 OurJS

关注我们