PostgreSQL 全文检索加速快到没有朋友 - RUM索引接口(潘多拉魔盒)-白红宇

PostgreSQL 全文检索加速快到没有朋友 - RUM索引接口(潘多拉魔盒)

作者

digoal

日期

2016-10-19

背景

全文检索，模糊查询在现实的应用中用得非常多，特别是搜索引擎。

通常我们会想到使用搜索引擎来解决，但是需要考虑数据同步到搜索引擎，以及同步延迟，更新，一致性的问题。

并且使用搜索引擎我们还得多维护一个组件。

那么有没有更好的办法呢？

答案是有的，在PostgreSQL中，有内置的全文检索数据类型，以及全模糊查询的索引支持。

效率当然也是杠杠的，比如10亿的TOKEN检索，可以在毫秒级返回。

PostgreSQL 9.6在全文检索这块还做了更多的增强，比如RUM插件，被Oleg称为打开了潘多拉魔盒，在检索效率方面比GIN有极大的提升。

场景描述

我碰到过很多用户这样使用，用逗号将需要检索的元素分割开，当成字符串存储在数据库中，然后使用模糊查询的方法对数据进行检索。

create table test(c1 text);insert into test values ('1,100,2331,344,502,.........');insert .................

比如1000万条这样的记录，然后要根据元素组合进行查询。

select * from test where c1 like '%1%' or c1 like '%502%' and c1 like '%2331%';

这种查询效率非常低下，如果要做到毫秒级的返回，几乎不可想象。

PostgreSQL 数组类型

其实以上场景，在PostgreSQL中，可以使用数组类型来满足。

create table arr_test(c1 int[]);create index idx_arr_test on arr_test using gin(c1);insert into arr_test values(array[1,100,2331,344,502,......]);......

PostgreSQL 数组支持GIN索引，可以实现快速的检索。

例如在1000万记录中检索包含1或2的记录。

postgres=# explain analyze select * from arr_test where c1 && array[1,2] order by c1 offset 19000 limit 100;                                                                QUERY PLAN                                                                 ------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=112837.69..112837.94 rows=100 width=424) (actual time=91.440..91.475 rows=100 loops=1)   ->  Sort  (cost=112790.19..113039.57 rows=99750 width=424) (actual time=82.915..90.477 rows=19100 loops=1)         Sort Key: c1         Sort Method: external merge  Disk: 8440kB         ->  Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.180..37.380 rows=19925 loops=1)               Recheck Cond: (c1 && '{1,2}'::integer[])               Heap Blocks: exact=19605               ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.196..5.196 rows=19925 loops=1)                     Index Cond: (c1 && '{1,2}'::integer[]) Planning time: 0.131 ms Execution time: 93.929 ms(11 rows)

PostgreSQL 全文检索类型

除了使用数组，PostgreSQL还支持全文检索类型，你可以存储为tsvector，使用tsquery进行查询。

postgres=# create table gin_test(c1 tsvector);CREATE TABLEpostgres=# create index idx_gin_test on gin_test using gin (c1) ;CREATE INDEX

全文检索类型同样支持索引，可以加速查询。

例如在1000万记录中检索包含1或2的记录。

潘多拉魔盒RUM

我们看到使用GIN索引时，扫描方式为BITMAP，所以有一个SORT的动作，这个在很大的LIST中是比较耗时的。

9.6的一个插件RUM索引接口，对全文检索的支持更加强大，不需要SORT，直接走INDEX SCAN的接口，也就是说RUM同时还实现了<=>即文本相似度的属性检索。

Oleg说RUM打开了潘多拉魔盒，除此之外9.6在全文检索方面还有极大的提升，9.6的release notes里也有重点说明，这使得PostgreSQL在文本检索能力方面又更加强大了。

忘掉搜索引擎吧，使用PostgreSQL。

测试RUM

postgres=# create table rum_test(c1 tsvector);CREATE TABLEpostgres=# CREATE INDEX rumidx ON rum_test USING rum (c1 rum_tsvector_ops);CREATE INDEX

性能指标 : 数组对比全文检索类型(GIN对比RUM索引)

下面对比一下数组GIN索引，全文检索类型GIN索引，全文检索类型RUM索引

表结构

postgres=# create table rum_test(c1 tsvector);CREATE TABLEpostgres=# create table gin_test(c1 tsvector);CREATE TABLEpostgres=# create table arr_test(c1 int[]);CREATE TABLE

插入1000万记录，每个字段100个随机值，相当于在10亿随机值中匹配。

$ vi test.sqlinsert into rum_test select to_tsvector(string_agg(c1::text,',')) from  (select (100000*random())::int from generate_series(1,100)) t(c1);$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000$ vi test.sqlinsert into gin_test select to_tsvector(string_agg(c1::text,',')) from  (select (100000*random())::int from generate_series(1,100)) t(c1);$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000$ vi test.sqlinsert into arr_test select array_agg(c1) from  (select (100000*random())::int from generate_series(1,100)) t(c1);$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 200000

创建索引

postgres=# set maintenance_work_mem ='64GB';SETpostgres=# CREATE INDEX rumidx ON rum_test USING rum (c1 rum_tsvector_ops);CREATE INDEXpostgres=# create index idx_gin_test on gin_test using gin (c1) ;CREATE INDEXpostgres=# create index idx_arr_test on arr_test using gin (c1) ;CREATE INDEX

查询效率对比

1. 查询包含1或2的记录

全文检索类型, rum索引postgres=# explain analyze select * from rum_test where c1 @@ to_tsquery('english','1 | 2');                                                            QUERY PLAN                                                            ---------------------------------------------------------------------------------------------------------------------------------- Index Scan using rumidx on rum_test  (cost=16.00..99121.61 rows=99749 width=1387) (actual time=6.403..24.981 rows=19840 loops=1)   Index Cond: (c1 @@ '''1'' | ''2'''::tsquery) Planning time: 0.075 ms Execution time: 26.086 ms(4 rows)全文检索类型, GIN索引postgres=# explain analyze select * from gin_test where c1 @@ to_tsquery('english','1 | 2');                                                          QUERY PLAN                                                           ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on gin_test  (cost=816.06..99386.94 rows=99750 width=1387) (actual time=9.551..34.121 rows=19847 loops=1)   Recheck Cond: (c1 @@ '''1'' | ''2'''::tsquery)   Heap Blocks: exact=19764   ->  Bitmap Index Scan on idx_gin_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.554..5.554 rows=19847 loops=1)         Index Cond: (c1 @@ '''1'' | ''2'''::tsquery) Planning time: 0.113 ms Execution time: 35.279 ms(7 rows)数组类型, GIN索引postgres=# explain analyze select * from arr_test where c1 && array[1,2];                                                          QUERY PLAN                                                           ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.148..31.648 rows=19925 loops=1)   Recheck Cond: (c1 && '{1,2}'::integer[])   Heap Blocks: exact=19605   ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.214..5.214 rows=19925 loops=1)         Index Cond: (c1 && '{1,2}'::integer[]) Planning time: 0.095 ms Execution time: 32.810 ms(7 rows)

2. 排序输出

全文检索类型, rum索引postgres=# explain analyze select * from rum_test where c1 @@ to_tsquery('english','1 | 2') order by c1 <=> to_tsquery('english','1 | 2') offset 19000 limit 100;                                                               QUERY PLAN                                                                ----------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=18988.45..19088.30 rows=100 width=1391) (actual time=58.912..59.165 rows=100 loops=1)   ->  Index Scan using rumidx on rum_test  (cost=16.00..99620.35 rows=99749 width=1391) (actual time=16.426..57.892 rows=19100 loops=1)         Index Cond: (c1 @@ '''1'' | ''2'''::tsquery)         Order By: (c1 <=> '''1'' | ''2'''::tsquery) Planning time: 0.133 ms Execution time: 59.220 ms(6 rows)全文检索类型, GIN索引postgres=# explain analyze select * from gin_test where c1 @@ to_tsquery('english','1 | 2') order by c1 offset 19000 limit 100;                                                                QUERY PLAN                                                                 ------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=176684.69..176684.94 rows=100 width=1387) (actual time=117.809..117.865 rows=100 loops=1)   ->  Sort  (cost=176637.19..176886.57 rows=99750 width=1387) (actual time=94.889..116.929 rows=19100 loops=1)         Sort Key: c1         Sort Method: external merge  Disk: 26968kB         ->  Bitmap Heap Scan on gin_test  (cost=816.06..99386.94 rows=99750 width=1387) (actual time=9.625..38.336 rows=19847 loops=1)               Recheck Cond: (c1 @@ '''1'' | ''2'''::tsquery)               Heap Blocks: exact=19764               ->  Bitmap Index Scan on idx_gin_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.610..5.610 rows=19847 loops=1)                     Index Cond: (c1 @@ '''1'' | ''2'''::tsquery) Planning time: 0.134 ms Execution time: 126.122 ms(11 rows)数组类型, GIN索引postgres=# explain analyze select * from arr_test where c1 && array[1,2] order by c1 offset 19000 limit 100;                                                                QUERY PLAN                                                                 ------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=112837.69..112837.94 rows=100 width=424) (actual time=90.619..90.656 rows=100 loops=1)   ->  Sort  (cost=112790.19..113039.57 rows=99750 width=424) (actual time=82.067..89.622 rows=19100 loops=1)         Sort Key: c1         Sort Method: external merge  Disk: 8440kB         ->  Bitmap Heap Scan on arr_test  (cost=816.06..93595.94 rows=99750 width=424) (actual time=9.087..36.870 rows=19925 loops=1)               Recheck Cond: (c1 && '{1,2}'::integer[])               Heap Blocks: exact=19605               ->  Bitmap Index Scan on idx_arr_test  (cost=0.00..791.12 rows=99750 width=0) (actual time=5.138..5.138 rows=19925 loops=1)                     Index Cond: (c1 && '{1,2}'::integer[]) Planning time: 0.122 ms Execution time: 93.057 ms(11 rows)

RUM 附加能力

rum检索支持近似度排行，这个在搜索应用中太有用了。

通过相似度分值表示文本和检索条件的相似度。

// 分词举例postgres=#  select * from to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造');                                   to_tsvector                                    ---------------------------------------------------------------------------------- '中国科学院':5 '小明':1 '日本京都大学':10 '毕业':3 '深造':11 '硕士':2 '计算所':6(1 row)// 有相似度postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造') , to_tsquery('计算所')); rum_ts_distance -----------------         16.4493(1 row)// 没有相似度postgres=#  select * from rum_ts_distance(to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造') , to_tsquery('计算')); rum_ts_distance -----------------        Infinity(1 row)// 或相似度postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造') , to_tsquery('计算所 | 硕士')); rum_ts_distance -----------------         8.22467(1 row)// 与相似度postgres=# select * from rum_ts_distance(to_tsvector('jiebacfg', '小明硕士毕业于中国科学院计算所，后在日本京都大学深造') , to_tsquery('计算所 & 硕士')); rum_ts_distance -----------------         32.8987(1 row)// 排序postgres=# create table test15(c1 tsvector);CREATE TABLEpostgres=# insert into test15 values (to_tsvector('jiebacfg', 'hello china, i''m digoal')), (to_tsvector('jiebacfg', 'hello world, i''m postgresql')), (to_tsvector('jiebacfg', 'how are you, i''m digoal'));INSERT 0 3postgres=# select * from test15;                         c1                          ----------------------------------------------------- ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3 ' ':2,4,7,11 'digoal':12 'm':10(3 rows)postgres=# create index idx_test15 on test15 using rum(c1 rum_tsvector_ops);CREATE INDEXpostgres=# select *,c1 <=> to_tsquery('hello') from test15;                         c1                          | ?column? -----------------------------------------------------+---------- ' ':2,5,9 'china':3 'digoal':10 'hello':1 'm':8     |  16.4493 ' ':2,5,9 'hello':1 'm':8 'postgresql':10 'world':3 |  16.4493 ' ':2,4,7,11 'digoal':12 'm':10                     | Infinity(3 rows)postgres=# explain select *,c1 <=> to_tsquery('postgresql') from test15 order by c1 <=> to_tsquery('postgresql');                                   QUERY PLAN                                   -------------------------------------------------------------------------------- Index Scan using idx_test15 on test15  (cost=3600.25..3609.06 rows=3 width=36)   Order By: (c1 <=> to_tsquery('postgresql'::text))(2 rows)

小结

正如Oleg说的，RUM非常强大，支持相似度检索，支持非BITMAP scan，从查询效率来看，已经比GIN以及单纯的数组查询效率高出1倍。

忘掉搜索引擎，使用PostgreSQL全文检索吧。

分词方面，PG支持的中文分词插件也很多，例如结巴分词，ZHPARSER。