Natural Language Processing for PDF/TIFF/Image Documents

Users Guide
High Precision Natural Language Processing for PDF/TIFF/Image Documents
Users Guide, Gap v0.9.2

1 Introduction

The target audience for this users guide are your software developers whom will be integrating the core inner block into your product and/or service. It is not meant to be a complete reference guide or comprehensive tutorial, but a brief get started guide.

To utilize this module, the Gap framework will automatically install:

1.  This Python module.
2.  Python 3.6 or latter
3.  Ghostscript ©(open source from Artifex).    [will auto-install with pip install].
4.  Tesseract ©(open source from Google).       [will auto-install with pip install].
5.  Magick ©(open source from Image Magic).     [will auto-install with pip install].
6.  NLTK Toolkit (open source)                  [will auto-install with pip install].
7.  Unidecode (open source)                     [will auto-install with pip install].
8.  HD5 (open source)                           [will auto-install with pip install].
9.  Numpy (open source)                         [will auto-install with pip install].
10. OpenCV (open source)                        [will auto-install with pip install]. 
11. Imutils (open source)                       [will auto-install with pip install].

2 SPLITTER Module

2.1 Document Loading

To load a PDF document, TIFF facsimile or image captured document you create a Document (class) object, passing as parameters the path to the PDF/TIFF/image document and a path for storing the split pages/text. Below is a code example.

from gapml.splitter import Document, Page
document = Document("yourdocument.pdf", "storage_path")

2.2 Page Splitting

Upon instantiating a document object, the corresponding PDF document or TIFF facsimile is automatically split into the corresponding PDF or TIFF pages, utilizing Ghostscript (PDF) and Magick (TIFF). Each PDF/TIFF page will be stored separately in the storage path with the following naming convention:

<document basename><pageno>.<suffix> , where <suffix> is either pdf or tif

The module automatically detects if a PDF document is a digital (text) or scanned PDF (image). For digital documents, the text is extracted directly from the PDF page using Ghostscript and stored separately in the storage path with the following naming convention:

<document basename><pageno>.txt

2.3 OCR

If the document is a scanned PDF, each page image will be extracted using Ghostscript, then OCR using Tesseract to extract the text content from the page image. The page image and corresponding page text are stored separately in the storage path with the following naming convention:

<document basename><pageno>.png
<document basename><pageno>.txt

If the document is a TIFF facsimile, each page image will be extracted using Magick, then OCR using Tesseract to extract the text content from the page image. The page image and corresponding page text are stored separately in the storage path with the following naming convention:

<document basename><pageno>.tif
<document basename><pageno>.txt

If the document is an image capture (e.g., JPG), the image is OCR using Tesseract to extract the text content from the page image. The page image and corresponding page text are stored separately in the storage path with the following naming convention:

<document basename><pageno>.<suffix> , where <suffix> is png or jpg
<document basename><pageno>.txt

2.4 Image Resolution for OCR

The resolution of the image rendered by Ghostscript from a scanned PDF page will affect the OCR quality and processing time. By default the resolution is set to 300. The resolution can be set for a (or all) documents with the static member RESOLUTION of the Document class. This property only affects the rendering of scanned PDF; it does not affect TIFF facsimile or image capture.

# Set the Resolution of Image Extraction of all scanned PDF pages
Document.RESOLUTION = 150

# Image Extraction and OCR will be done at 150 dpi for all subsequent documents
document = Document("scanneddocument.pdf", "storage_path")

2.5 Page Access

Each page is represented by a Page (class) object. Access to the page object is obtained from the pages property member of the Document object. The number of pages in the document is returned by the len() builtin operator for the Document class.

document = Document("yourdocument.pdf", "storage_path")

# Get the number of pages in the PDF document
npages = len(document)

# Get the page table
pages = document.pages

# Get the first page
page1 = pages[0]

# or alternately
page1 = document[0]

# full path location of the PDF/TIFF or image capture page in storage
page1_path = page1.path

2.6 Adding Pages

Additional pages can be added to the end of an existing Document object using the += (overridden) operator, where the new page will be fully processed.

document = Document("1page.pdf")

# This will print 1 for 1 page
print(len(document))

# Create a Page object for an existing PDF page
new_page = Page("page_to_add.pdf")

# Add the page to the end of the document.
document += new_page

# This will print 2 showing now that it is a 2 page document.
print(len(document))

2.7 Text Extraction

The raw text for the page is obtained by the text property of the page class. The byte size of the raw text is obtained from the size() method of the Page class.

# Get the page table
pages = document.pages

# Get the first page
page1 = pages[0]

# Get the total byte size of the raw text
bytes = page1.size()

# Get the raw text for the page
text = page1.text

The property scanned is set to True if the text was extracted using OCR; otherwise it is false (i.e., origin was digital text). The property additionally returns a second value which is the estimated quality of the scan as a percentage (between 0 and 1).

# Determine if text extraction was obtained by OCR
scanned, quality = document.scanned

2.8 Asynchronous Processing

To enhance concurrent execution between a main thread and worker activities, the Document class supports asynchronous processing of the document (i.e., Page Splitting, OCR and Text Extraction). Asynchronous processing will occur if the optional parameter ehandler is set when instantiating the Document object. Upon completion of the processing, the ehandler is called, where the Document object is passed as a parameter.

def done(d):
    """ Event Handler for when processing of document is completed """
    print("DONE", d.document)

# Process the document asynchronously
document = Document("yourdocument.pdf", "storage_path", ehandler=done)

2.9 NLP Preprocessing of the Text

NLP preprocessing of the text requires the SYNTAX module. The processing of the raw text into NLP sequenced tokens (syntax) is deferred and is executed in a JIT (Just in Time) principle. If installed, the NLP sequenced tokens are access through the words property of the Page class. The first time the property is accessed for a page, the raw text is preprocessed, and then retained in memory for subsequent access.

# Get the page table
pages = document.pages

# Get the first page
page1 = pages[0]

# Get the NLP preprocessed text
words = page1.words

The NLP preprocessed text is stored separately in the storage path with the following naming convention:

<document basename><pageno>.json

2.10 NLP Preprocessing Settings (Config)

NLP Preprocessing of the text may be configured for several settings when instantiating a Document object with the optional config parameter, which consists of a list of one or more predefined options.

document = Document("yourdocument.pdf", "storage_path", config=[options])
# options:
bare                     # do bare tokenization
stem = internal     |    # use builtin stemmer
       porter       |    # use NLTK Porter stemmer
       snowball     |    # use NLTK Snowball stemmer
       lancaster    |    # use NLTK Lancaster stemmer
       lemma        |    # use NLTK WordNet lemmatizer
       nostem            # no stemming
pos                      # Tag each word with NLTK parts of speech
roman                    # Romanize latin-1 character encodings into ASCII

2.11 Document Reloading

Once a Document object has been stored, it can later be retrieved from storage, reconstructing the Page and corresponding Words objects. A document object is first instantiated, and then the load() method is called specifying the document name and corresponding storage path. The document name and storage path are used to identify and locate the corresponding stored pages.

# Instantiate a Document object
document = Document()

# Reload the document's pages from storage
document.load( "mydoc.pdf", "mystorage" )

This will reload pages whose filenames in the storage match the sequence:

mystorage/mydoc1.json  
mystorage/mydoc2.json  
...

2.12 Word Frequency Distributions

The distribution of word occurrences and percentage in a document and individual pages are obtained using the properties: bagOfWords, freqDist, and termFreq.

The bagOfWords property returns an unordered dictionary of each unique word in the document (or page) as a key, and the number of occurrences as the value.

# Get the bag of words for the document
bow = document.bagOfWords
print(bow)

will output:

{ '<word>': <no. of occurrences>, '<word>':  <no. of occurrences>, … }
e.g., { 'plan': 20, 'medical': 31, 'doctor': 2, … }

# Get the bag of words for each page in the document
for page in document.pages:
    bow = page.bagOfWords

The freqDist property returns a sorted list of each unique word in the document (or page), as a tuple of the word and number of occurrences, sorted by the number of occurrences in descending order.

# Get the word frequency (count) distribution for the document
count = document.freqDist
print(count)

will output:

[ ('<word>', <no. of occurrences>), ('<word>':  <no. of occurrences>), … ] 
e.g., [ ('medical', 31), ('plan', 20), …, ('doctor', 2), … ]

# Get the word frequency distribution for each page in the document
for page in document.pages:
    count = page.freqDist

The termFreq property returns a sorted list of each unique word in the document (or page), as a tuple of the word and the percentage it occurs in the document, sorted by the percentage in descending order.

# Get the term frequency (TF) distribution for the document
tf = document.freqDist
print(tf)

will output:

[ ('<word>', <percent>), ('<word>':  <percent>), … ] 
e.g., [ ('medical', 0.02), ('plan', 0.015), … ]

2.13 Document and Page Classification

Semantic Classification (e.g., category) of the document and individual pages requires the CLASSIFICATION module. The classification is deferred and is executed in a JIT (Just in Time) principle. If installed, the classification is access through the classification property of the document and page classes, respectively. The first time the property is accessed for a document or page, the NLP sequenced tokens for each page are processed for classification of the content of individual pages and the first page is further processed for the classification of the content of the entire document.

# Get the classification for the document
document_classification = document.label
# Get the classification for each page
for gapml.page in document.pages:
    classification = page.label

3 SYNTAX Module

3.1 NLP Processing

The Words (class) object does the NLP preprocessing of the extracted (raw) text. If the extracted text is from a Page object (see SPLITTER), the NLP preprocessing occurs the first time the words property of the Page object is accessed.

from gapml.syntax import Words, Vocabulary

# Get the first page in the document
page = document.pages[0]

# Get the raw text from the page as a string
text = page.text

# Get the NLP processed words (Words class) object from the page as a list.
words = page.words

# Print the object type of words => <class 'Document.Words'>
type(words)

3.2 Words Properties

The Words (class) object has four public properties: text, words, bagOfWords, and freqDist. The text property is used to access the raw text and the words property is used to access the NLP processed tokens from the raw text.

# Get the NLP processed words (Words class) object from the page as a list.
words = page.words

# Get the original (raw) text as a string
text = words.text

The words property is used to access NLP preprocessed list of words.

# Get the NLP processed words from the original text as a Python list.
words = words.words

# Print the object type of words => <class 'list'>
type(words)

The bagOfWords and freqDist properties are explained later in the guide.

3.3 Vocabulary Dictionary

The words property returns a sequenced Python list of words as a dictionary from the Vocabulary class. Each word in the list is of the dictionary format:

{ 'word'  : word, # The stemmed version of the word
  'lemma' : word, # The lemma version of the word
  'tag'   : tag   # The word classification
}

3.4 Traversing the NLP Processed Words

The NLP processed words returned from the words property are sequenced in the same order as the original text. All punctuation is removed, and except for detected Acronyms, all remaining words are lowercased. The sequenced list of words may be a subset of the original words, depending on the stopwords properties and may be stemmed, lemma, or replaced.

# Get the NLP processed words from the original text as a Python list.
words = words.words

# Traverse the sequenced list of NLP processed words
for word in words:
    text   = word.word  # original or replaced version of the word
    tag    = word.tag   # syntactical classification of the word
    lemma  = word.lemma # The lemma version of the word

3.5 Stopwords

The properties which determine which words are removed, stemmed, lemmatized, or replaced are set as keyword parameters in the constructor for the Words class. If no keyword parameters are specified, then all stopwords are removed after being stemmed/lemmatized. The list of stopwords is a superset of the Porter list and additionally includes removing additionally syntactical constructs such as numbers, dates, etc. For a complete list, see the reference manual.

If the keyword parameter stopwords is set to False, then all word removal is disabled, while stemming/lemmatization/reducing are still enabled, along with the removal of punctuation. Note in the example below, while stopwords is disabled, the word jumping is replaced with its stem jump.

# No stopword removal
words = Words("The lazy brown fox jumped over the fence.", stopwords=False)
# words => "the", "lazy", "brown", "fox", "jump", "over", "the", "fence"

# All stopword removal
words = Words("The lazy brown fox jumped over the fence.", stopwords=True)
# words => "lazy", "brown", "fox", "jump", "fence"

3.6 Bare

When the keyword parameter bare is True, all stopword removal, stemming/lemmatization/reducing and punctuation removal are disabled.

# Bare Mode
words = Words("The lazy brown fox jumped over the fence.", bare=False)
# words => "the", "lazy", "brown", "fox", "jumped", "over", "the", "fence", "."

3.7 Numbers

When the keyword parameter number is True, text and numeric version of numbers are preserved; otherwise they are removed. Numbers which are text based (e.g., one) are converted to their numeric representation (e.g., one => 1). The tag value for numbers is set to Vocabulary.NUMBER.

# keep/replace numbers
words = Words("one twenty-one 33.7 1/4", number=True)
print(words.words)

will output:

[
{ 'word': '1',  tag: Vocabulary.NUMBER },
{ 'word': '21', tag: Vocabulary.NUMBER },
{ 'word': '33.7', tag: tag: Vocabulary.NUMBER },
{ 'word': '0.25', tag: tag: Vocabulary.NUMBER },
]

If a number is followed by a text representation of a multiplier unit (i.e., million), the number and multiplier unit are replaced by the multiplied value.

words = Words("two million", number=True)
print(words.words)

will output:

[
{ 'word': '2000000',  tag: Vocabulary.NUMBER}, 
]

3.8 Unit of Measurement

When the keyword parameter unit is True, US Standard and Metric units of measurement are preserved; otherwise they are removed. Both US and EU spelling of metric units are recognized (e.g., meter/metre, liter/litre). The tag value for units of measurement is set to Vocabulary.UNIT.

# keep/replace unit
words = Words("10 liters", number=True, unit=True) 
print(words.words)

will output:

[
{ 'word': '10',  tag: Vocabulary.NUMBER }, 
{ 'word': 'liter',  tag: Vocabulary.UNIT },
]

3.9 Standard vs. Metric

When the keyword parameter standard is True, Metric units of measurement are converted to US Standard. When the keyword parameter metric is True, Standard units of measurement are converted to Metric Standard.

# keep/replace unit
words = Words("10 liters", number=True, unit=True standard=True) 
print(words.words)

will output:

[
{ 'word': '2.64172',  tag: Vocabulary.NUMBER }, 
{ 'word': 'gallon',  tag: Vocabulary.UNIT },
]

3.10 Date

When the keyword parameter date is True, USA and ISO standard date representation and text representation of dates are preserved; otherwise they are removed. Dates are converted to the ISO standard and the tag value is set to Vocabulary.DATE.

# keep/replace dates
words = Words("Jan 2, 2017 and 01/02/2017", date=True)
print(words.words)

will output:

[
{ 'word': '2017-01-02',  tag: Vocabulary.DATE }, 
{ 'word': '2017-01-02',  tag: Vocabulary.DATE },
]

3.11 Date of Birth

When the keyword parameter dob is True, date of births are preserved; otherwise they are removed. Date of births are converted to the ISO standard and the tag value is set to Vocabulary.DOB.

# keep/replace dates
words = Words("Date of Birth:  Jan. 2 2017   DOB:  01-02-2017", dob=True)
print(words.words)

will output:

[
{ 'word': '2017-01-02',  tag: Vocabulary.DOB }, 
{ 'word': '2017-01-02',  tag: Vocabulary.DOB },
]

If date is set to True without dob (date of birth) set to True, date of births will be removed while other dates will be preserved.

When the keyword parameter ssn is True, USA Social Security numbers are preserved; otherwise they are removed. Social Security numbers are detected from the prefix presence of text sequences indicating a Social Security number will follow, such as SSN, Soc. Sec., Social Security, etc. Social Security numbers are converted to their single 9 digit value and the tag value is set to Vocabulary.SSN.

# keep/replace dates
words = Words("SSN:  12-123-1234 Social Security 12 123 1234", ssn=True)
print(words.words)

will output:

[
{ 'word': '121231234',  tag: Vocabulary.SSN }, 
{ 'word': '121231234',  tag: Vocabulary.SSN },
]

3.13 Telephone Number

When the keyword parameter telephone is True, USA/CA telephone numbers are preserved; otherwise they are removed. Telephone numbers are detected from the prefix presence of text sequences indicating a telephone number will follow, such Phone:, Mobile Number, etc. Telephone numbers are converted to their single 10 digit value, inclusive of area code, and the tag value is set to one of:

Vocabulary.TELEPHONE
Vocabulary.TELEPHONE_HOME
Vocabulary.TELEPHONE_WORK
Vocabulary.TELEPHONE_OFFICE
Vocabulary.TELEPHONE_FAX

# keep/replace dates
words = Words("Phone: (360) 123-1234, Office Number: 360-123-1234", telephone=True)
print(words.words)

will output:

[
{ 'word': '3601231234',  tag: Vocabulary.TELEPHONE }, 
{ 'word': '3601231234',  tag: Vocabulary.TELEPHONE_WORK},
]

3.14 Address

When the keyword parameter address is True, USA/CA street and postal addresses are preserved; otherwise they are removed. Each component in the address is tagged according to the above street/postal address component type, as follows:

Postal Box (Vocabulary.POB)
Street Number (Vocabuary.STREET_NUM)
Street Direction (Vocabuary.STREET_DIR)
Street Name (Vocabuary.STREET_NAME)
Street Type (Vocabuary.STREET_TYPE)
Secondary Address (Vocabuary.STREET_ADDR2)
City (Vocabulary.CITY)
State (Vocabulary.STATE)
Postal (Vocabulary.POSTAL)

# keep/replace street addresses
words = Words("12 S.E. Main Ave, Seattle, WA", gender=True) 
print(words.words)

will output:

[
{ 'word': '12',  tag: Vocabulary.STREET_NUM }, 
{ 'word': 'southeast',  tag: Vocabulary.STREET_DIR }, 
{ 'word': 'main',  tag: Vocabulary.STREET_NAME }, 
{ 'word': 'avenue',  tag: Vocabulary.STREET_TYPE }, 
{ 'word': 'seattle',  tag: Vocabulary.CITY }, 
{ 'word': 'ISO316-2:US-WA',  tag: Vocabulary.STATE }, 
]

3.15 Gender

When the keyword parameter gender is True, words indicating gender are preserved; otherwise they are removed. Transgender is inclusive in the recognition. The tag value is set to one of Vocabulary.MALE, Vocabulary.FEMALE or Vocabulary.TRANSGENDER.

# keep/replace gender indicating words
words = Words("man uncle mother women tg", gender=True)
print(words.words)

will output:

[
{ 'word': 'man',  tag: Vocabulary.MALE }, 
{ 'word': 'uncle',  tag: Vocabulary.MALE }, 
{ 'word': 'mother',  tag: Vocabulary.FEMALE }, 
{ 'word': 'women',  tag: Vocabulary.FEMALE }, 
{ 'word': 'transgender',  tag: Vocabulary.TRANSGENDER },
]

3.16 Sentiment

When the keyword parameter sentiment is True, word and word phrases indicating sentiment are preserved; otherwise they are removed. Sentiment phrases are reduced to the single primary word indicating the sentiment and the tag value is set to either Vocabulary.POSITIVE or Vocabulary.NEGATIVE.

# keep/replace sentiment indicating phrases
words = Words("the food was not good", sentiment=True)
print(words.words)

will output: [ { 'word': 'food', tag: Vocabulary.UNTAG }, { 'word': 'not', tag: Vocabulary.NEGATIVE}, ]

3.17 Spell Checking

When the keyword parameter spell is set to one of 'en', 'es', 'fr', 'de', or 'it', each tokenized word is looked up in the builtin Norvig speller for the corresponding language (e.g., en = English). If the word is not found (presumed misspelled) and the Norvig recommends a replacement, the word is replaced with the Norvig replacement. The spell check/replacement occurs prior to stemming, lemmatizing, and stopword removal.

# add parts of speech tagging
words = Words("mispelled", spell='en') 
print(words.words)

will output:

[
{ 'word': 'misspell',  'tag': Vocabulary.UNTAG},
]

3.18 Parts of Speech

When the keyword parameter pos is True, each tokenized word is further annotated with it's corresponding NLTK parts of speech tag.

# add parts of speech tagging
words = Words("Jim Smith", pos=True) 
print(words.words)

will output:

[
{ 'word': 'food',  'tag': Vocabulary.UNTAG, 'pos': NN },
{ 'word': 'not',  'tag': Vocabulary.NEGATIVE, 'pos': NN },
]

3.19 Romanization

When the keyword parameter roman is True, the latin-1 character encoding of each tokenized is converted to ASCII.

# Romanization of latin-1 character encodings
words = Words("Québec", roman=True) 
print(words.words)

will output:

[
{ 'word': 'quebec',  'tag': Vocabulary.UNTAG, 
]

3.20 Bag of Words and Word Frequency Distribution

The property bagsOfWords returns an unordered dictionary of each occurrence of a unique word in the tokenized sequence, where the word is the dictionary key, and the number of occurrences is the corresponding value.

# Get the Bag of Words representation
words = Words("Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his crown and Jill came tumbling after.", stopwords=True)
print(words.bagOfWords)

will output:

{ 'pail': 1, 'the': 1, 'a': 1, 'water': 1, 'fetch': 1, 'went': 1, 'and': 2, 'jack': 2, 'jill': 2,
'down': 1, 'come': 1, 'fell': 1, 'up': 1, 'of': 1, 'tumble': 1, 'to': 1, 'hill': 1, 'after': 1 }

The property freqDist returns a sorted list of tuples, in descending order, of word frequencies (i.e., the number of occurrences of the word in the tokenized sequence.

# Get the Word Frequency Distribution
words = Words("Jack and Jill went up the hill to fetch a pail of water. Jack fell down and broke his crown and Jill came tumbling after.", stopwords=True)
print(words.freqDist)

will output:

[ ('jack', 2), ('jill', 2), ('and', 2), ('water', 1), ('the', 1), … ]

4 SEGMENTATION Module

The segmentation module is newly introduced in Gap v0.9 prelaunch. It is in the early stage, and should be considered experimental, and not for commercial-product-ready yet. The segmentation module analyzes the whitespace layout of the text to identify the 'human' perceived grouping/purpose of text, such as paragraphs, headings, columns, page numbering, letterhead, etc., and the associated context.

In this mode, the text is separated into segments, corresponding to identified layout, where each segment is then NLP preprocessed. The resulting NLP output is then hierarchical, where at the top level is the segment identification, and it's child is the NLP preprocessed text.

4.1 Text Segmentation

When the config option 'segment' is specified on a Document object, the corresponding text per page is segmented.

# import the segmentation module
from gapml.segment import Segment
segment = Segment("para 1\n\npara 2")
print(segment.segments)

will output:

[
{ 'tag': 1002, words: [ { 'word': 'para', 'tag': 0}, {'word': 1, 'tag': 1}]},
{ 'tag': 1002, words: [ { 'word': 'para', 'tag': 0}, {'word': 2, 'tag': 1}]}
]

Natural Language Processing for PDF/TIFF/Image Documents

1 Introduction

2 SPLITTER Module

2.1 Document Loading

2.2 Page Splitting

2.3 OCR

2.4 Image Resolution for OCR

2.5 Page Access

2.6 Adding Pages

2.7 Text Extraction

2.8 Asynchronous Processing

2.9 NLP Preprocessing of the Text

2.10 NLP Preprocessing Settings (Config)

2.11 Document Reloading

2.12 Word Frequency Distributions

2.13 Document and Page Classification

3 SYNTAX Module

3.1 NLP Processing

3.2 Words Properties

3.3 Vocabulary Dictionary

3.4 Traversing the NLP Processed Words

3.5 Stopwords

3.6 Bare

3.7 Numbers

3.8 Unit of Measurement

3.9 Standard vs. Metric

3.10 Date

3.11 Date of Birth

3.12 Social Security Number

3.13 Telephone Number

3.14 Address

3.15 Gender

3.16 Sentiment

3.17 Spell Checking

3.18 Parts of Speech

3.19 Romanization

3.20 Bag of Words and Word Frequency Distribution

4 SEGMENTATION Module

4.1 Text Segmentation