QnA Model using BERT for PDF Documents
- We install the necessary libraries.
- We use a library called cdqa which stands for Closed Domain Q and A
Below is the high level architecture of BERT
There is a Retreiver Model
There is a Reader Model
A question is sent to both the Models and the Reader gets the answer from retreived list of documents
!pip install cdqa
from tika import parser
from ast import literal_eval
from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model
import pandas as pd
import os, sys, re
In the below cell, we provide a list of URLs to certain PDFs.
This list can be changed as per the requirement to download any PDFs
url_list =
[
'https://docs.oracle.com/en/cloud/saas/analytics/fawag/administering-oracle-fusion-analytics-warehouse.pdf',
'https://docs.oracle.com/en/cloud/saas/analytics/faiae/implementing-oracle-fusion-erp-analytics.pdf',
'https://docs.oracle.com/en/cloud/saas/analytics/fawug/using-oracle-fusion-analytics-warehouse.pdf'
]
def download_pdf(url_list):
import os
import wget
directory = './content/pdf/'
models_url = url_list
if not os.path.exists(directory):
os.makedirs(directory)
for url in models_url:
wget.download(url=url, out=directory)
print('\nDownloaded PDF files...')
download_pdf(url_list)
documents_as_dataframe = pdf_converter("./content/pdf", include_line_breaks=False, min_length=200)
def delete_pdf():
import shutil
directory = './content/pdf/'
shutil.rmtree(directory)
delete_pdf()
len(documents_as_dataframe)
documents_as_dataframe.head()
download_model(model='bert-squad_1.1', dir='./models')
Below, we create an instance of a Pipeline Class which does the following:
Accepts the question posed by the user
Retreives the relevant documents from the list of documents which might have the answer to the question. Retreiver passes these documents to the Reader.
Reader will then get the relevant paragraph from the curated list of documents obtained in above step and provide that as prediction
doc_qna_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)
doc_qna_pipeline.fit_retriever(df=documents_as_dataframe)
query = 'What are KPIs?'
prediction = doc_qna_pipeline.predict(query, n_predictions=1)
prediction[0]
print('The content is available in the doc -> {}'.format((prediction[0])[1]))
print('\n')
print('The content is -> {}'.format((prediction[0])[2]))