>gfx< |
1.gfx
All functionality of pdf2swf, swftools' PDF to SWF converter, is also exposed by the Python module "gfx". gfx contains a PDF parser (based on xpdf) and a number of rendering backends. In particular, it can extract text from PDF pages, create bitmaps from them, or convert PDF files to SWF. The latter functionality is similar to what is offered by swftools' (http://www.swftools.org) pdf2swf utility, however more powerful- You can also create individual SWF files from single pages of the PDF or mix pages from different PDFs.1.1 Compiling gfx and installing
To install gfx, you first need to download and uncompress one of the archives at http://www.swftools.org/download.html. You then basically have two options:
python setup.py build python setup.py install |
./configure make # substitute the following path with your correct python installation: cp lib/python/*.so /usr/lib/python2.4/site-packages/ |
python -c 'import gfx' |
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") print "Author:", doc.getInfo("author") print "Subject:", doc.getInfo("subject") print "PDF Version:", doc.getInfo("version") |
Using getInfo, You can query the following fields:
title, subject, keywords, author, creator, producer, creationdate, moddate, linearized, tagged, encrypted, oktoprint, oktocopy, oktochange, oktoaddnotes, version
Depending on the PDF file, not all these fields may contain useful information.
Some PDF files may be protected, or even password encrypted. You recognize protected files by the fact that doc.getInfo("encrypted") return "yes". If additionally doc.getInfo("oktocopy") is set to "no", then the file has copy protection enabled, which means that the gfx module won't allow you to extract information from it- extraction of pages (see below) will raise an exception.
If the PDF file is password encrypted, you need the password do display the file. You can pass the password to the open function by appending it to the filename, using '|' as seperator:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "protecteddocument.pdf|mysecretpassword") |
1.3 Reading an Image or SWF file
Reading image files or SWF files is done analogously. You only need to pass a different filetype specifier to the open() function:
#!/usr/bin/python import gfx doc1 = gfx.open("image", "myimage.png") doc2 = gfx.open("swf", "flashfile.swf") |
1.4 Extracting pages from a (PDF/SWF/Image) file
Once the document has been properly opened, you can start working with the content, i.e., the individual pages. You can extract a page from a file using the getPage() function. The resulting Page object gives you additional information about the file. getPage() expects the page number, which starts at 1 for the first page.
The following code lists all pages in a file, along with their size:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") for pagenr in range(1,doc.pages+1): page = doc.getPage(pagenr) print "Page", pagenr, "has dimensions", page.width, "x", page.height |
Note: The size of pages can vary in PDF documents. Don't make the common mistake of querying only the first page for its dimensions and using that for all other pages. |
1.5 Rendering pages to bitmaps
The gfx module contains a number of rendering backends. The most interesting is probably the ImageList renderer, which creates images from pages. The following code extracts the first page of a PDF document as an image:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") img = gfx.ImageList() img.setparameter("antialise", "1") # turn on antialising page1 = doc.getPage(1) img.startpage(page1.width,page1.height) page1.render(img) img.endpage() img.save("thumbnail80x80.png") |
1.6 Extracting text from PDF files
Using the PlainText device, you can extract fulltext from PDF files. The following code snippet demonstrates this behaviour:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") text = gfx.PlainText() for pagenr in range(1,doc.pages+1): page = doc.getPage(pagenr) text.startpage(page.width, page.height) page.render(text) text.endpage() text.save("document_fulltext.txt") |
... gfx.setparameter("zoom", "400") text = gfx.OCR() ... |
1.7 Rendering pages to SWF files
As the gfx module derives from pdf2swf, of course it can also convert PDF files to SWF files. The code needed for this is similar to the previous examples:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") swf = gfx.SWF() for pagenr in range(1,doc.pages+1): page = doc.getPage(pagenr) swf.startpage(page.width, page.height) page.render(swf) swf.endpage() swf.save("document.swf") |
#!/usr/bin/python import gfx gfx.setparameter("bitmap", "1") # or "poly2bitmap" doc = gfx.open("pdf", "document.pdf") ... |
1.8 Putting more than one input page on one SWF page
You don't need to start a new output page for every input page you get. Therefore, you can e.g. put pairs of two pages beside each other:
#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") swf = gfx.SWF() for pagenr in range(doc.pages/2): page1 = doc.getPage(pagenr*2+1) page2 = doc.getPage(pagenr*2+2) swf.startpage(page1.width+page2.width, max(page1.height,page2.height)) page1.render(swf,move=(0,0)) page2.render(swf,move=(page1.width,0), clip=(page1.width,0,page1.width+page2.width,page2.height)) swf.endpage() if doc.pages%2: # for an odd number of pages, render final page page = doc.getPage(doc.pages) swf.startpage(page.width,page.height) page.render(swf) swf.endpage() swf.save("document.swf") |
1.9 Parsing (PDF/Image/SWF) content yourself
If none of the supplied output devices (PlainText, ImageList, SWF) is doing what you need, you can also process the PDF content yourself. The gfx module gives you an easy way to do it, by translating the usually very complex PDF file contents into a number of very simple drawing operations. In order to pass those operations to Python, you need the PassThrough output device, together with a custom class:
import gfx class MyOutput: def setparameter(key,value): print "setparameter",key,value def startclip(outline): print "startclip",outline def endclip(): print "endclip" def stroke(outline, width, color, capstyle, jointstyle, miterLimit): print "stroke",outline def fill(outline, color): print "fill",outline def fillbitmap(outline, image, matrix, colortransform): print "fillbitmap",outline def fillgradient(outline, gradient, gradienttype, matrix): print "fillgradient",outline def addfont(font): print "addfont" def drawchar(font, glyph, color, matrix): print "drawchar" def drawlink(outline, url): print "drawlink", outline, url doc = gfx.open("pdf", "document.pdf") output = gfx.PassThrough(MyOutput()) doc.getPage(1).render(output) |
gfx |