Data Science: from PDF to Searchable DB

Data Science: from PDF to Searchable DB

Have you ever found yourself wanting to be able to process text data from a pdf file and then make that information searchable across files, not just within one document? Common problem, right? Well maybe not too common for the individual but one common example may be to create a searchable archive of reports that have been saved as pdf documents so that you can find reports that match certain interests with greater ease. Or perhaps you want to make emails from legacy systems searchable.

In this post we explore an application of computer vision to extract text from a pdf and then pump the text results into a searchable database so that the information is quickly accessible to an end user application. Think like Google but for documents not available online. The benefits of doing this are that you can make a large number of documents searchable very quickly (e.g. good for inpatient app users) and enable more complex search functionality over the text (e.g. good for picky app users).

In order to solve this problem there are a number of steps we must take, so bear with me across each section. I have tried to split the sections up into stand-alone sections that explain how to achieve part to this larger problem. We will use Python and ImageMagick to pre-process the pdf or image for text extraction, Tesseract to perform the computer vision piece of extracting text from an image, and sqlite as our database solution for creating a searchable repository of the extracted text.

Let’s name our environment before we begin:

  • Windows 10 OS
  • Tesseract 4.0
  • ImageMagick 6.9.10-Q8
  • Sqlite 3.14.2
  • Python 3.6 libraries:
    • Pytesseract 0.2.5
    • PIL 4.2.1
    • Wand 0.4.4
    • Sqlite3 2.6.0 (comes with Python 3.X as standard library)

Some notes on installing Tesseract and ImageMagick

Both Tesseract and ImageMagick are separate software tools that need to be installed on your environment in order to enable pdf-to-image processing, pre-OCR processing, and OCR (text extraction).

The focus here is on a Windows environment rather than Linux but much of the code we produce below will be the same. Open source tools like Tesseract and ImageMagick tend to be easier to load into Linux environments but since I have had to work in both I wanted to perform this in a Windows environment to show that it is possible. A few notes regarding installation in a Windows environment:

ImageMagick

You will want to use ImageMagick 6.9.10-Q8 instead of their latest version. This is because version 6 is their most stable with regard to the Python functions that we will use to leverage this piece of software. You can find the proper dll file here:
http://ftp.icm.edu.pl/packages/ImageMagick/binaries/?C=N;O=D
There you will find both 32 and 64 bit executable files. Because I am running on a 64 bit Windows 10 machine I downloaded the following file:

ImageMagick-6.9.10-11-Q8-x64-dll.exe

Once downloaded you can double click and follow the installer instructions.
Once ImageMagick is installed on your computer you will want to add a new environment variable called MAGICK_HOME with the location path to where your instance installed on your machine. For example, mine installed on the C drive in Program Files so my path variable looks like this:

C:\Program Files\ImageMagick-6.9.10-Q8

Tesseract

To get the installers for your specific Windows environment visit here and download the appropriate executable:
https://github.com/UB-Mannheim/tesseract/wiki
Once downloaded and installed on your machine you will want to add Tesseract to your PATH variables in your environment variables. When you access your environment variables, open your PATH variable and add the location of Tesseract to the list. My path looks like this:

C:\Program Files (x86)\Tesseract-OCR

Now that we have all of the non-python software installed we are ready to get started!

Sqlite

Sqlite is a very lightweight database software that has text indexing capabilities. Because it is the most commonly used database software in the world and it comes with Python we will use it here.

Let’s Get Coding!

First step is to make sure you have all necessary Python libraries installed and brought into memory. If you are using an interactive python console like IPython you can do this using the ‘shebang’ before pip like this !:

!pip install pytesseract==0.2.5
!pip install pillow==4.2.1
!pip install wand==0.4.4

Step 1: Convert a PDF to an Image

Once installed, we will start with the conversion of pdf to image since Tesseract cannot consume pdf’s. In order to create a bunch of pdf files quickly, I used an extension for Chrome called “Save-emails-to-pdf”. It is fast and allows you to save a lot of emails in Gmail to pdf files by simply checking the emails in the checkbox and clicking on the download button:

We will use these pdf files to convert to images, and then perform OCR. If you are thinking ‘hey, why not just use the pdf library in Python to extract the text directly,’ you would be correct in that creating pdf files like this does make the text extractable directly from the pdf code. The problem is that many pdf files do not have text embedded but rather represent images of text. In these instances, the only way to extract the text is to perform OCR from an image file. But I digress, onward…

from wand.image import Image
import wand.image
import wand.api
from os import listdir
path = ‘C:/betacosine/OCR/’
pdf_list = [x for x in listdir(path) if x.endswith('.pdf')]

For this example, I have placed the example pdf emails in the path described in the code above. We then use the listdir method to get a list of all the pdf files in the directory that was specified.

import ctypes
MagickEvaluateImage = wand.api.library.MagickEvaluateImage
MagickEvaluateImage.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_double]

def evaluate(self, operation, argument):
    MagickEvaluateImage(
      self.wand,
      wand.image.EVALUATE_OPS.index(operation),
      self.quantum_range * float(argument))

Before we process the pdf files to images we need to set up our ImageMagick methods and functions that will be used to convert pdf files to images for OCR. I have found that Tesseract performs best on digital text extraction when we convert the pdfs to grayscale and use some level of thresholding to pre-process the image before passing through the OCR engine.

Thresholding is a simple and efficient way of separating the foreground from the background in an image. To complete this using ImageMagick we need to specify some additional pieces of information and so we develop a function that will perform thresholding for us.

There is another software tool called OpenCV that has an easier-to-use Python interface for thresholding and other image pre-processing but to keep things a little simpler here, I just focus on ImageMagick. See this great tutorial on using Tesseract with OpenCV.

 text_list = []

for i in pdf_list: 
    text_list2 = []
    with Image(filename=path+i) as img1:
        num_pgs = len(img1.sequence)
        for p in range(num_pgs):
            with Image(filename=path+i+"["+str(p)+"]",resolution=200)as img:
                img.type = 'grayscale'
                evaluate(img,'threshold',.60)
                img_buffer = np.asarray(bytearray(img.make_blob(format='png')), dtype='uint8')
                bytesio = io.BytesIO(img_buffer)
                text = pytesseract.image_to_string(PIL.Image.open(bytesio))
                text_list2.append(text)
    text_list.append(text_list2)

So, the code above is not the prettiest, I definitely get it. And this is where I show my cards as more of a data scientist than a programmer but the code above does work and it is efficient. Because it is complex, let me walk you through what is happening here. In the first line we are creating an open list container called text_list, which will be where we put our OCR’d text results. At the beginning of the “for loop,” we start by iterating over each of the pdf files in our directory. Because most of the pdf files are multiple pages and we want to OCR each page, we need to iterate our tasks over each page. Thus, we use the first “with” statement to get the total page numbers in each pdf file.

The second “for loop” iterates over each page number and performs the functions contained in the second “with” statement on each page. In that second “with” statement we start by converting the image to grayscale, then perform the thresholding. In the next line that starts with “img_buffer” we are creating a Numpy array out of the binary that we get when we use the “make_blob” method from ImageMagick. We then convert it to a bytes object so that we can open it using the PIL library. All of this is done so that we do not need to spend precious compute resources writing the image to disc and then reading it back into memory to perform OCR. This way we can just pass the object directly along to Tesseract for OCR.

Finally, we append the resulting text to our text_list2, which then gets appended to our text_list. What we are left with is a list of lists:

You will notice that I am only processing 6 emails here.

flat_list = ['\n'.join(map(str, x)) for x in text_list]

In the next line of code above, we simply flatten the resulting list into a single list by joining the multiple pages into a single list rather than a sublist.

At this point you could add some text processing steps that further improve the readability and accuracy of the text that has been extracted. See some example cleaning functions in one of our previous blog posts here.

Step 2: Creating a Searchable Database

Now that we have our text data, we want to enter it into a database that indexes the text field we want to be able to search on.

import sqlite3
sqlite_db = 'email_db.db'
zip_list = list(zip(pdf_list,flat_list))
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

In the next 2 lines of code we are importing the sqlite3 library and providing a name for our database (email_db, original, I know). We then join our list of pdf file names to our list of text results for each email in a tuple, which makes for fast insertion into the database.

In the last two lines of code we create a connection to the database. If the database does not exist, this will create it. Then we activate the cursor to be able to interact with the database.

c.execute('CREATE VIRTUAL TABLE email_table USING fts4(email_title TEXT, email_text TEXT)')
c.executemany('INSERT INTO email_table(email_title,email_text) VALUES(?,?)', zip_list)
conn.commit()

Next, we use Sqlite’s text indexing functionality by creating a table using the FTS4 extension. In Sqlite FTS3 – 6 provide text indexing capabilities that significantly reduce the time it takes to get results back from a query and add additional text searching capabilities. Read more about it here.

Finally, we insert our data from zip_list into the new table and commit it.

Done! You have now created a searchable database of text data that will respond in milliseconds to even the most complex text searches, over millions of rows. I have done this for over 12 million rows of data and get search results in .1 to .2 seconds.

In addition, you can now leverage more complex text querying features available in the FTS extensions for SQLite. For example, you can now search for words that are within a certain number of other words and return results. You can even return results that include additional characters highlighting where in the text your search term appears. I have included some example code below.

import pandas as pd
df = pd.read_sql('''select * from email_table where email_text MATCH 'trump NEAR/5 tweet*';''',conn)
conn.close()

In the above code, I use pandas to pull out a dataframe of results where I search for any rows that say trump and tweet within 5 words of each other. Pretty cool huh!?! Now slap a fancy UI on top and, bingo, a searchable interface with an indexed DB on the backend.

As always, we look forward to any comments, ideas, or feedback this may have inspired in you. Stay tuned for more ideas that enable Ai!

Leave a Reply

Your email address will not be published. Required fields are marked *