20
loading...
This website collects cookies to deliver better user experience
postgres=# create table demo
(id bigint primary key, description text)
USING HEAP;
CREATE TABLE
postgres=# create index demo_index on demo
( length(description) );
CREATE INDEX
postgres=# create index demo_gin on demo
USING GIN
( (to_tsvector('simple',description)) );
CREATE INDEX
postgres=# select relname,reltype,amname
from pg_class left outer join pg_am
on pg_class.relam=pg_am.oid
where relname like 'demo%';
relname | reltype | amname
------------+---------+--------
demo | 16918 | heap
demo_gin | 0 | gin
demo_index | 0 | btree
demo_pkey | 0 | btree
(4 rows)
USING
is visible as Access Method, which is the the name of the extensibility layer for different storage. The default for PostgreSQL is HEAP tables, BTREE indexes, and GIN can be used for text search.simple
text search configuration. You need to specify id to get a deterministic result. If not, you will get: ERROR: functions in index expression must be marked IMMUTABLE
.yugabyte=# create table demo
(id bigint primary key, description text)
;
CREATE TABLE
yugabyte=# create index demo_index on demo
( length(description) );
CREATE INDEX
yugabyte=# create index demo_gin on demo
USING GIN
( (to_tsvector('simple',description)) );
NOTICE: replacing access method "gin" with "ybgin"
CREATE INDEX
yugabyte=# select relname,reltype,amname
from pg_class left outer join pg_am
on pg_class.relam=pg_am.oid
where relname like 'demo%';
relname | reltype | amname
------------+---------+--------
demo | 16805 |
demo_gin | 0 | ybgin
demo_index | 0 | lsm
demo_pkey | 0 | lsm
(4 rows)
USING GIN
clause is transformed to USING YBGIN
, which is the YugabyteDB implementation of it.https://github.com/credativ/omdb-postgresql
project has a procedure to load into PostgreSQL. I can use the same to load into YugabyteDB, but this is not optimal. It is better to define the PRIMARY in the CREATE TABLE statement rather than ALTER TABLE later. What I did was load into PostgreSQL in order to export with pg_dump:git clone https://github.com/credativ/omdb-postgresql.git
cd omdb-postgresql
./download
./import
pg_dump -f omdb.sql omdb
awk '
/^ALTER TABLE ONLY/{last_alter_table=$NF}
/^ *ADD CONSTRAINT .* PRIMARY KEY /{sub(/ADD /,"");sub(/;$/,"");pk[last_alter_table]=$0",";$0=$0"\\r"}
NR > FNR && /^CREATE TABLE/{ print $0,pk[$3] > "omdb-pk.sql" ; next} NR > FNR { print > "omdb-pk.sql" }
' omdb.sql omdb.sql
movies
with movies_abstract_en
and adds the text search vector of words from the abstract (to_tsvector('simple',abstract)
):create or replace view movies_search as
select *
from (
select id as movie_id, name as movie_name from movies
) movies
natural join
(
select movie_id, abstract as movie_abstract,
to_tsvector('simple',abstract) as movie_abstract_ts
from movie_abstracts_en
) movie_abstracts;
websearch_to_tsquery
in order to query it with a google-like syntax:create or replace function find_movie(text)
returns setof movies_search as $sql$
select * from movies_search
where websearch_to_tsquery('simple',$1)
@@ movie_abstract_ts;
$sql$ language sql;
yugabyte=# select * from find_movie(
'Luke and Leia and "George Lucas"
');
movie_id|movie_name |movie_abstract |movie_abstract_ts |
-------------+----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1891|Star Wars: Episode V - The Empire Strikes Back|The Empire Strikes Back is considered the most morally and emotionally complex of the original Star Wars trilogy, continuing creator George Lucas's epic saga where Star Wars: Episode IV - A New Hope left off. Masterful storytelling weaves together multipl|'a':31 'against':46 'and':10,48,59 'archetypal':41 'as':50 'attempts':53 'back':4 'capture':55 'complex':12 'considered':6 'continuing':19 'creator':20 'desperately':52 'emotionally':11 'empire':2 'epic':24 'episode':29 'for':57 'george':21 'han':47 'he':|
10|Star Wars |A series of six films from the Director, Screenwriter and Producer George Lucas. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are now house hold names from one of the most successful film projects of all ti|'a':1 'all':43 'and':10,22 'are':29 'c3po':20 'characters':25 'darth':18 'director':8 'film':28,40 'films':5 'from':6,26,34 'george':12 'hold':32 'house':31 'leia':17 'lucas':13 'luke':14 'many':23 'most':38 'names':33 'now':30 'of':3,36,42 'one':35 'othe|
11|Star Wars: Episode IV – A New Hope |A New Hope was the first Star Wars film from the director, screenwriter, and producer George Lucas, although it is the fourth episode in the series of six. Luke Skywalker, Princes Leia, Darth Vader, C3PO, R2D2 and many other characters from the film are n|'a':1 'all':59 'although':18 'and':14,37 'are':44 'c3po':35 'characters':40 'darth':33 'director':12 'episode':23 'film':9,43,56 'first':6 'fourth':22 'from':10,41,50 'george':16 'hold':48 'hope':3 'house':47 'house-hold-names':46 'in':24 'is':20 'it':19 |
yugabyte=# select websearch_to_tsquery('simple',
'Luke and Leia and "George Lucas"'
);
websearch_to_tsquery
-------------------------------------------------------------
'luke' & 'and' & 'leia' & 'and' & 'george' <-> 'lucas'
(1 row)
<->
is for consecutive words and you can use <3>
to accept them spaced by 2 words). The goal of this post is not to go into all details as it is documented in many places. For common searches, the websearch_to_tsquery
syntax is probably easier.yugabyte=# explain (analyze, verbose)
select * from movies_search where
websearch_to_tsquery('simple',
'Luke and Leia and "George Lucas"'
) @@ movie_abstract_ts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..716.39 rows=1000 width=104) (actual time=20.359..79.051 rows=3 loops=1)
Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
Inner Unique: true
-> Seq Scan on public.movie_abstracts_en (cost=0.00..352.50 rows=1000 width=40) (actual time=19.203..76.525 rows=3 loops=1)
Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
Filter: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
Rows Removed by Filter: 2683
-> Index Scan using movies_pkey on public.movies (cost=0.00..0.11 rows=1 width=40) (actual time=0.766..0.766 rows=1 loops=3)
Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
Index Cond: (movies.id = movie_abstracts_en.movie_id)
Planning Time: 2.722 ms
Execution Time: 79.153 ms
(12 rows)
::tsquery @@ to_tsvector
condition is a Filter after Seq Scan
. A regular index would not help. This is where GIN indexes come into play as they can index the array of words. yugabyte=# create index movie_abstracts_en_ts_vector
on movie_abstracts_en
using ybgin (
( to_tsvector('pg_catalog.simple',abstract) )
);
CREATE INDEX
Index Cond
:QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=24.00..395.90 rows=1000 width=104) (actual time=3.729..5.030 rows=3 loops=1)
Output: movies.id, movies.name, movie_abstracts_en.abstract, to_tsvector('simple'::regconfig, movie_abstracts_en.abstract)
Inner Unique: true
-> Index Scan using movie_abstracts_en_ts_vector on public.movie_abstracts_en (cost=24.00..32.01 rows=1000 width=40) (actual time=3.215..3.486 rows=3 loops=1)
Output: movie_abstracts_en.movie_id, movie_abstracts_en.abstract
Index Cond: ('''luke'' & ''and'' & ''leia'' & ''and'' & ''george'' <-> ''lucas'''::tsquery @@ to_tsvector('simple'::regconfig, movie_abstracts_en.abstract))
Rows Removed by Index Recheck: 7
-> Index Scan using movies_pkey on public.movies (cost=0.00..0.11 rows=1 width=40) (actual time=0.453..0.453 rows=1 loops=3)
Output: movies.id, movies.name, movies.parent_id, movies.date, movies.series_id, movies.kind, movies.runtime, movies.budget, movies.revenue, movies.homepage, movies.vote_average, movies.votes_count
Index Cond: (movies.id = movie_abstracts_en.movie_id)
Planning Time: 4.134 ms
Execution Time: 5.290 ms
(12 rows)
movie_abstracts_en
10 have been selected by the index and filtered further (Rows Removed by Index Recheck: 7
) down to the resulting 3 (rows=3
)create index demo_trgm on movies using gin (name gin_trgm_ops);
select name from movies where name like '%hiker%';