27
loading...
This website collects cookies to deliver better user experience
txtai
and all dependencies. We will install the api, pipeline and workflow optional extras packages.pip install txtai[api,pipeline,similarity]
# Get CORD-19 metadata file
!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-11-01/metadata.csv
!head -1 metadata.csv > input.csv
!tail -10000 metadata.csv >> input.csv
url
column as the id and the title
column as the text column. The textcolumns parameter takes a list of columns to support indexing text content from multiple columns. from txtai.pipeline import Tabular
from txtai.workflow import Task, Workflow
# Create tabular instance mapping input.csv fields
tabular = Tabular("url", ["title"])
# Create workflow
workflow = Workflow([Task(tabular)])
# Print 5 rows of input.csv via workflow
list(workflow(["input.csv"]))[:5]
[('https://doi.org/10.1016/j.cmpb.2021.106469; https://www.ncbi.nlm.nih.gov/pubmed/34715516/',
'Computer simulation of the dynamics of a spatial susceptible-infected-recovered epidemic model with time delays in transmission and treatment.',
None),
('https://www.ncbi.nlm.nih.gov/pubmed/34232002/; https://doi.org/10.36849/jdd.5544',
'Understanding the Potential Role of Abrocitinib in the Time of SARS-CoV-2',
None),
('https://doi.org/10.1186/1471-2458-8-42; https://www.ncbi.nlm.nih.gov/pubmed/18234083/',
"Can the concept of Health Promoting Schools help to improve students' health knowledge and practices to combat the challenge of communicable diseases: Case study in Hong Kong?",
None),
('https://www.ncbi.nlm.nih.gov/pubmed/32983582/; https://www.sciencedirect.com/science/article/pii/S2095809920302514?v=s5; https://api.elsevier.com/content/article/pii/S2095809920302514; https://doi.org/10.1016/j.eng.2020.07.018',
'Buying time for an effective epidemic response: The impact of a public holiday for outbreak control on COVID-19 epidemic spread',
None),
('https://doi.org/10.1093/pcmedi/pbab016',
'The SARS-CoV-2 spike L452R-E484Q variant in the Indian B.1.617 strain showed significant reduction in the neutralization activity of immune sera',
None)]
from txtai.embeddings import Embeddings
# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})
# Index subset of CORD-19 data
data = list(workflow(["input.csv"]))
embeddings.index(data)
for uid, _ in embeddings.search("insulin"):
title = [text for url, text, _ in data if url == uid][0]
print(title, uid)
Importance of diabetes management during the COVID-19 pandemic. https://doi.org/10.1080/00325481.2021.1978704; https://www.ncbi.nlm.nih.gov/pubmed/34602003/
Position Statement on How to Manage Patients with Diabetes and COVID-19 https://www.ncbi.nlm.nih.gov/pubmed/33442169/; https://doi.org/10.15605/jafes.035.01.03
Successful blood glucose management of a severe COVID-19 patient with diabetes: A case report https://www.ncbi.nlm.nih.gov/pubmed/32590779/; https://doi.org/10.1097/md.0000000000020844
insulin
. The top results mention diabetes and blood glucose which are a closely associated terms for diabetes.from txtai.workflow import ServiceTask
service = ServiceTask(url="https://hn.algolia.com/api/v1/search", method="get", params={"tags": None}, batch=False, extract="hits")
workflow = Workflow([service])
list(workflow(["front_page"]))[4]
{'_highlightResult': {'author': {'matchLevel': 'none',
'matchedWords': [],
'value': 'makerdiety'},
'title': {'matchLevel': 'none',
'matchedWords': [],
'value': 'Tips For Making a Popular Open Source Project in 2021'},
'url': {'matchLevel': 'none',
'matchedWords': [],
'value': 'https://skerritt.blog/make-popular-open-source-projects/'}},
'_tags': ['story', 'author_makerdiety', 'story_29197806', 'front_page'],
'author': 'makerdiety',
'comment_text': None,
'created_at': '2021-11-12T10:06:45.000Z',
'created_at_i': 1636711605,
'num_comments': 53,
'objectID': '29197806',
'parent_id': None,
'points': 138,
'story_id': None,
'story_text': None,
'story_title': None,
'story_url': None,
'title': 'Tips For Making a Popular Open Source Project in 2021',
'url': 'https://skerritt.blog/make-popular-open-source-projects/'}
url
will be used as the id column and title
as the text to index.from txtai.workflow import Task
# Recreate service applying the tabular pipeline to each result
service = ServiceTask(action=tabular, url="https://hn.algolia.com/api/v1/search", method="get", params={"tags": None}, batch=False, extract="hits")
workflow = Workflow([service])
list(workflow(["front_page"]))[4]
('https://skerritt.blog/make-popular-open-source-projects/',
'Tips For Making a Popular Open Source Project in 2021',
None)
# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})
# Index Hacker News front page
data = list(workflow(["front_page"]))
embeddings.index(data)
for uid, _ in embeddings.search("programming"):
title = [text for url, text, _ in data if url == uid][0]
print(title, uid)
A guide to organizing settings in Django https://apibakery.com/blog/django-settings-howto/
Useful sed scripts and patterns https://github.com/adrianscheff/useful-sed
Tips For Making a Popular Open Source Project in 2021 https://skerritt.blog/make-popular-open-source-projects/
service = ServiceTask(url="http://export.arxiv.org/api/query", method="get", params={"search_query": None, "max_results": 25}, batch=False, extract=["feed", "entry"])
workflow = Workflow([service])
list(workflow(["all:aliens"]))[:1]
[OrderedDict([('id', 'http://arxiv.org/abs/2102.01522v3'),
('updated', '2021-09-06T14:18:23Z'),
('published', '2021-02-01T18:27:12Z'),
('title',
'If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare'),
('summary',
"If life on Earth had to achieve n 'hard steps' to reach humanity's level,\nthen the chance of this event rose as time to the n-th power. Integrating this\nover habitable star formation and planet lifetime distributions predicts >99%\nof advanced life appears after today, unless n<3 and max planet duration\n<50Gyr. That is, we seem early. We offer this explanation: a deadline is set by\n'loud' aliens who are born according to a hard steps power law, expand at a\ncommon rate, change their volumes' appearances, and prevent advanced life like\nus from appearing in their volumes. 'Quiet' aliens, in contrast, are much\nharder to see. We fit this three-parameter model of loud aliens to data: 1)\nbirth power from the number of hard steps seen in Earth history, 2) birth\nconstant by assuming a inform distribution over our rank among loud alien birth\ndates, and 3) expansion speed from our not seeing alien volumes in our sky. We\nestimate that loud alien civilizations now control 40-50% of universe volume,\neach will later control ~10^5 - 3x10^7 galaxies, and we could meet them in\n~200Myr - 2Gyr. If loud aliens arise from quiet ones, a depressingly low\ntransition chance (~10^-4) is required to expect that even one other quiet\nalien civilization has ever been active in our galaxy. Which seems bad news for\nSETI. But perhaps alien volume appearances are subtle, and their expansion\nspeed lower, in which case we predict many long circular arcs to find in our\nsky."),
('author',
[OrderedDict([('name', 'Robin Hanson')]),
OrderedDict([('name', 'Daniel Martin')]),
OrderedDict([('name', 'Calvin McCarter')]),
OrderedDict([('name', 'Jonathan Paulson')])]),
('arxiv:comment',
OrderedDict([('@xmlns:arxiv', 'http://arxiv.org/schemas/atom'),
('#text', 'To appear in Astrophysical Journal')])),
('link',
[OrderedDict([('@href', 'http://arxiv.org/abs/2102.01522v3'),
('@rel', 'alternate'),
('@type', 'text/html')]),
OrderedDict([('@title', 'pdf'),
('@href', 'http://arxiv.org/pdf/2102.01522v3'),
('@rel', 'related'),
('@type', 'application/pdf')])]),
('arxiv:primary_category',
OrderedDict([('@xmlns:arxiv', 'http://arxiv.org/schemas/atom'),
('@term', 'q-bio.OT'),
('@scheme', 'http://arxiv.org/schemas/atom')])),
('category',
[OrderedDict([('@term', 'q-bio.OT'),
('@scheme', 'http://arxiv.org/schemas/atom')]),
OrderedDict([('@term', 'physics.pop-ph'),
('@scheme',
'http://arxiv.org/schemas/atom')])])])]
id
will be used as the id column and title
as the text to index.from txtai.workflow import Task
# Create tablular pipeline with new mapping
tabular = Tabular("id", ["title"])
# Recreate service applying the tabular pipeline to each result
service = ServiceTask(action=tabular, url="http://export.arxiv.org/api/query", method="get", params={"search_query": None, "max_results": 25}, batch=False, extract=["feed", "entry"])
workflow = Workflow([service])
list(workflow(["all:aliens"]))[:1]
[('http://arxiv.org/abs/2102.01522v3',
'If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare',
None)]
# Embeddings with sentence-transformers backend
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/paraphrase-mpnet-base-v2"})
# Index Hacker News front page
data = list(workflow(["all:aliens"]))
embeddings.index(data)
for uid, _ in embeddings.search("alien radio signals"):
title = [text for url, text, _ in data if url == uid][0]
print(title, uid)
Calculating the probability of detecting radio signals from alien
civilizations http://arxiv.org/abs/0707.0011v2
Field Trial of Alien Wavelengths on GARR Optical Network http://arxiv.org/abs/1805.04278v1
Do alien particles exist, and can they be detected? http://arxiv.org/abs/1606.07403v1
# Index settings
writable: true
embeddings:
path: sentence-transformers/nli-mpnet-base-v2
# Tabular pipeline
tabular:
idcolumn: id
textcolumns:
- title
# Workflow definitions
workflow:
index:
tasks:
- task: service
action: tabular
url: http://export.arxiv.org/api/query?max_results=25
method: get
params:
search_query: null
batch: false
extract: [feed, entry]
- action: upsert
!killall -9 uvicorn
!CONFIG=workflow.yml nohup uvicorn "txtai.api:app" &> api.log &
!sleep 30
!cat api.log
INFO: Started server process [921]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
# Execute workflow via API call
!curl -X POST "http://localhost:8000/workflow" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"name\":\"index\",\"elements\":[\"all:aliens\"]}"
[["http://arxiv.org/abs/2102.01522v3","If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare",null],["http://arxiv.org/abs/cs/0306071v1","AliEnFS - a Linux File System for the AliEn Grid Services",null],["http://arxiv.org/abs/physics/0306103v1","AliEn - EDG Interoperability in ALICE",null],["http://arxiv.org/abs/2103.05559v1","Oumuamua Is Not a Probe Sent to our Solar System by an Alien\n Civilization",null],["http://arxiv.org/abs/1403.3979v1","Robust transitivity and density of periodic points of partially\n hyperbolic diffeomorphisms",null],["http://arxiv.org/abs/1712.09210v1","Sampling alien species inside and outside protected areas: does it\n matter?",null],["http://arxiv.org/abs/cs/0306067v1","The AliEn system, status and perspectives",null],["http://arxiv.org/abs/0707.0011v2","Calculating the probability of detecting radio signals from alien\n civilizations",null],["http://arxiv.org/abs/1805.04278v1","Field Trial of Alien Wavelengths on GARR Optical Network",null],["http://arxiv.org/abs/1808.00529v1","Open Category Detection with PAC Guarantees",null],["http://arxiv.org/abs/1206.3640v1","The Study of Climate on Alien Worlds",null],["http://arxiv.org/abs/1203.6805v2","Aliens on Earth. Are reports of close encounters correct?",null],["http://arxiv.org/abs/1604.05078v1","The Imprecise Search for Habitability",null],["http://arxiv.org/abs/1006.2613v1","Resurgence, Stokes phenomenon and alien derivatives for level-one linear\n differential systems",null],["http://arxiv.org/abs/1705.03394v1","That is not dead which can eternal lie: the aestivation hypothesis for\n resolving Fermi's paradox",null],["http://arxiv.org/abs/1701.02294v1","Alien Calculus and non perturbative effects in Quantum Field Theory",null],["http://arxiv.org/abs/1307.0653v1","General and alien solutions of a functional equation and of a functional\n inequality",null],["http://arxiv.org/abs/1801.06180v1","Are Alien Civilizations Technologically Advanced?",null],["http://arxiv.org/abs/1902.05387v1","Simultaneous x, y Pixel Estimation and Feature Extraction for Multiple\n Small Objects in a Scene: A Description of the ALIEN Network",null],["http://arxiv.org/abs/0711.4034v1","The q-analogue of the wild fundamental group (II)",null],["http://arxiv.org/abs/astro-ph/0501119v1","Expanding advanced civilizations in the universe",null],["http://arxiv.org/abs/cs/0306068v1","AliEn Resource Brokers",null],["http://arxiv.org/abs/hep-ph/9403231v2","The Renormalization of Composite Operators in Yang-Mills Theories Using\n General Covariant Gauge",null],["http://arxiv.org/abs/1410.0501v1","Alienation in Italian cities. Social network fragmentation from\n collective data",null],["http://arxiv.org/abs/1606.07403v1","Do alien particles exist, and can they be detected?",null]]
# Run a search
!curl -X GET "http://localhost:8000/search?query=radio&limit=3" -H "accept: application/json"
[{"id":"http://arxiv.org/abs/0707.0011v2","score":0.40350067615509033},{"id":"http://arxiv.org/abs/1805.04278v1","score":0.34062114357948303},{"id":"http://arxiv.org/abs/1902.05387v1","score":0.22262515127658844}]
# Index settings
writable: true
embeddings:
path: sentence-transformers/nli-mpnet-base-v2
# Tabular pipeline
tabular:
idcolumn: id
textcolumns:
- title
# Translation pipeline
translation:
# Workflow definitions
workflow:
index:
tasks:
- task: service
action: tabular
url: http://export.arxiv.org/api/query?max_results=25
method: get
params:
search_query: null
batch: false
extract: [feed, entry]
- action: translation
args: [fr]
- action: upsert
!killall -9 uvicorn
!CONFIG=workflow.yml nohup uvicorn "txtai.api:app" &> api.log &
!sleep 30
!cat api.log
INFO: Started server process [945]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
# Execute workflow via API call
!curl -s -X POST "http://localhost:8000/workflow" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"name\":\"index\",\"elements\":[\"all:aliens\"]}" > /dev/null
# Run a search
!curl -X GET "http://localhost:8000/search?query=radio&limit=3" -H "accept: application/json"
[{"id":"http://arxiv.org/abs/0707.0011v2","score":0.5328004956245422},{"id":"http://arxiv.org/abs/0711.4034v1","score":0.2441330999135971},{"id":"http://arxiv.org/abs/2102.01522v3","score":0.22881504893302917}]
import yaml
from txtai.api import API
with open("workflow.yml") as config:
workflow = yaml.safe_load(config)
app = API(workflow)
# Run the workflow
data = list(app.workflow("index", ["all:aliens"]))
# Run a search
for result in app.search("radio", None):
text = [row[1] for row in data if row[0] == result["id"]][0]
print(result["id"], result["score"], text)
http://arxiv.org/abs/0707.0011v2 0.5328004956245422 Calcul de la probabilité de détection des signaux radio de l'étrangercivilisations
http://arxiv.org/abs/0711.4034v1 0.2441330999135971 Le q-analogue du groupe fondamental sauvage (II)
http://arxiv.org/abs/2102.01522v3 0.22881504893302917 Si les étrangers louds expliquent le début de l'humanité, les étrangers tranquilles sont aussi rares
http://arxiv.org/abs/physics/0306103v1 0.21307508647441864 Alien - EDG Interopérabilité en ALICE
http://arxiv.org/abs/1006.2613v1 0.19786792993545532 Résurgence, phénomène Stokes et dérivés extraterrestres pour le niveau 1 linéairesystèmes différentiels
http://arxiv.org/abs/1403.3979v1 0.1915999799966812 Transitivité robuste et densité des points périodiques en partiedifféomorphismes hyperboliques
http://arxiv.org/abs/1805.04278v1 0.19029255211353302 Essai sur le terrain des longueurs d'onde aliens sur le réseau optique GARR
http://arxiv.org/abs/1606.07403v1 0.17492449283599854 Les particules exotiques existent-elles et peuvent-elles être détectées?
http://arxiv.org/abs/1902.05387v1 0.17426751554012299 Simultanée x, y Pixel Estimation et Extraction de Caractéristiques pour MultiplePetits objets dans une scène : une description du réseau ALIEN
http://arxiv.org/abs/cs/0306071v1 0.17286795377731323 AliEnFS - un système de fichiers Linux pour les services AliEn Grid