• We install the necessary libraries.
  • We use a library called cdqa which stands for Closed Domain Q and A

Below is the high level architecture of BERT

  • There is a Retreiver Model

  • There is a Reader Model

  • A question is sent to both the Models and the Reader gets the answer from retreived list of documents

image.png

!pip install cdqa
from tika import parser
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model
/usr/local/lib/python3.6/dist-packages/tqdm/autonotebook/__init__.py:18: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  " (e.g. in jupyter console)", TqdmExperimentalWarning)
import pandas as pd
import os, sys, re

Download PDFs

In the below cell, we provide a list of URLs to certain PDFs.

This list can be changed as per the requirement to download any PDFs

url_list = 
[
'https://docs.oracle.com/en/cloud/saas/analytics/fawag/administering-oracle-fusion-analytics-warehouse.pdf',
'https://docs.oracle.com/en/cloud/saas/analytics/faiae/implementing-oracle-fusion-erp-analytics.pdf',
'https://docs.oracle.com/en/cloud/saas/analytics/fawug/using-oracle-fusion-analytics-warehouse.pdf'
]
def download_pdf(url_list):
    import os
    import wget
    directory = './content/pdf/'
    models_url = url_list

    if not os.path.exists(directory):
        os.makedirs(directory)
    for url in models_url:
        wget.download(url=url, out=directory)

    print('\nDownloaded PDF files...')
download_pdf(url_list)
Downloaded PDF files...

Convert PDFs to Dataframe

The dataframe should be in the below format for cdqa library objects/classes i.e.

Document Title , List of Paragraphs of that Document

documents_as_dataframe = pdf_converter("./content/pdf", include_line_breaks=False, min_length=200)
2020-12-16 09:42:25,559 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /tmp/tika-server.jar.
2020-12-16 09:42:26,523 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /tmp/tika-server.jar.md5.
2020-12-16 09:42:26,979 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
def delete_pdf():
  import shutil
  directory = './content/pdf/'
  shutil.rmtree(directory) 
delete_pdf()
len(documents_as_dataframe)
3
documents_as_dataframe.head()
title paragraphs
0 implementing-oracle-fusion-erp-analytics [Oracle Cloud, Implementing Fusion ERPAnalytic...
1 administering-oracle-fusion-analytics-warehouse [Administering Oracle Fusion Analytics Warehou...
2 using-oracle-fusion-analytics-warehouse [Using Oracle Fusion Analytics Warehouse , Ora...

Below, we download the pretrained BERT model.

The Retreiver and the Reader are both pretrained on Stanford Question Answer Dataset ie. SQuAD

This is a large corpus based on Wikipedia.

download_model(model='bert-squad_1.1', dir='./models')
Downloading trained model...

Below, we create an instance of a Pipeline Class which does the following:

  • Accepts the question posed by the user

  • Retreives the relevant documents from the list of documents which might have the answer to the question. Retreiver passes these documents to the Reader.

  • Reader will then get the relevant paragraph from the curated list of documents obtained in above step and provide that as prediction

doc_qna_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)
100%|██████████| 231508/231508 [00:00<00:00, 4364386.66B/s]

Here we are fine tuning the Retreiver to work with our PDF documents.

We are not fine tuning the Reader but we can if required.

doc_qna_pipeline.fit_retriever(df=documents_as_dataframe)
QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   token_pattern='(?u)\\b\\w\\w+\\b',
                                   tokenizer=None, top_n=20, verbose=False,
                                   vocabulary=None))

An example of Model's Prediction

query = 'What are KPIs?'
prediction = doc_qna_pipeline.predict(query, n_predictions=1)
prediction[0]
('ameasure and a target',
 'using-oracle-fusion-analytics-warehouse',
 'Organizations seeking to manage and improve performance often define keyperformance indicators (KPIs) to measure their progress. KPIs are comprised of ameasure and a target and usually analyzed by dimensions such as organization,customer, product, and geography. KPIs can help organizations quickly focus onactivities that have the greatest impact on business performance.',
 9.276533050152022)

Predicted Output

print('The content is available in the doc ->  {}'.format((prediction[0])[1]))

print('\n')

print('The content is -> {}'.format((prediction[0])[2]))
The content is available in the doc ->  using-oracle-fusion-analytics-warehouse


The content is -> Organizations seeking to manage and improve performance often define keyperformance indicators (KPIs) to measure their progress. KPIs are comprised of ameasure and a target and usually analyzed by dimensions such as organization,customer, product, and geography. KPIs can help organizations quickly focus onactivities that have the greatest impact on business performance.