Extractive QA to build structured data

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

Traditional ETL/data parsing systems establish rules to extract information of interest. Regular expressions, string parsing and similar methods define fixed rules. This works in many cases but what if you are working with unstructured data containing numerous variations? The rules can be cumbersome and hard to maintain over time.

This notebook uses machine learning and extractive question-answering (QA) to utilize the vast knowledge built into large language models. These models have been trained on extremely large datasets, learning the many variations of natural language.

Install dependencies

Install txtai and all dependencies.

pip install txtai

Train a QA model with few-shot learning

The code below trains a new QA model using a few examples. These examples gives the model hints on the type of questions that will be asked and the type of answers to look for. It doesn't take a lot of examples to do this as shown below.

import pandas as pd
from txtai.pipeline import HFTrainer, Questions, Labels

# Training data for few-shot learning
data = [
    {"question": "What is the url?",
     "context": "Faiss (https://github.com/facebookresearch/faiss) is a library for efficient similarity search.",
     "answers": "https://github.com/facebookresearch/faiss"},
    {"question": "What is the url", "context": "The last release was Wed Sept 25 2021", "answers": None},
    {"question": "What is the date?", "context": "The last release was Wed Sept 25 2021", "answers": "Wed Sept 25 2021"},
    {"question": "What is the date?", "context": "The order total comes to $44.33", "answers": None},
    {"question": "What is the amount?", "context": "The order total comes to $44.33", "answers": "$44.33"},
    {"question": "What is the amount?", "context": "The last release was Wed Sept 25 2021", "answers": None},
]

# Fine-tune QA model
trainer = HFTrainer()
model, tokenizer = trainer("distilbert-base-cased-distilled-squad", data, task="question-answering")

Parse data into a structured table

The next section takes a series of rows of text and runs a set of questions against each row. The answers are then used to build a pandas DataFrame.

# Input data
context = ["Released on 6/03/2021",
           "Release delayed until the 11th of August",
           "Documentation can be found here: neuml.github.io/txtai",
           "The stock price fell to three dollars",
           "Great day: closing price for March 23rd is $33.11, for details - https://finance.google.com"]

# Define column queries
queries = ["What is the url?", "What is the date?", "What is the amount?"]

# Extract fields
questions = Questions(path=(model, tokenizer), gpu=True)
results = [questions([question] * len(context), context) for question in queries]
results.append(context)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Text"])

	URL	Date	Amount	Text
0	None	6/03/2021	None	Released on 6/03/2021
1	None	11th of August	None	Release delayed until the 11th of August
2	neuml.github.io/txtai	None	None	Documentation can be found here: neuml.github....
3	None	None	three dollars	The stock price fell to three dollars
4	https://finance.google.com	March 23rd	$33.11	Great day: closing price for March 23rd is $33...

Add additional columns

This method can be combined with other models to categorize, group or otherwise derive additional columns. The code below derives an additional sentiment column.

# Add sentiment
labels = Labels(path="distilbert-base-uncased-finetuned-sst-2-english", dynamic=False)
labels = ["POSITIVE" if x[0][0] == 1 else "NEGATIVE" for x in labels(context)]
results.insert(len(results) - 1, labels)

# Load into DataFrame
pd.DataFrame(list(zip(*results)), columns=["URL", "Date", "Amount", "Sentiment", "Text"])

	URL	Date	Amount	Sentiment	Text
0	None	6/03/2021	None	POSITIVE	Released on 6/03/2021
1	None	11th of August	None	NEGATIVE	Release delayed until the 11th of August
2	neuml.github.io/txtai	None	None	NEGATIVE	Documentation can be found here: neuml.github....
3	None	None	three dollars	NEGATIVE	The stock price fell to three dollars
4	https://finance.google.com	March 23rd	$33.11	POSITIVE	Great day: closing price for March 23rd is $33...