Fuzzy text search with trigrams in postgresql

Using postgres for fuzzy searches.

We’ll try to match arbitrary beer names to ones in our DB. Let’s create table first:

postgres=# \l
List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 db        | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres         +
           |          |          |             |             | postgres=CTc/postgres+
           |          |          |             |             | pg=CTc/postgres

postgres=# \connect db

db=# create table test (brewery varchar(255), name varchar(255), cleaned varchar(255));

Is the test really there?

db=# \d test
Table "public.test"
 Column  |          Type          | Modifiers
---------+------------------------+-----------
 brewery | character varying(255) |
 name    | character varying(255) |
 cleaned | character varying(255) |

Let’s add some sample data:

db=# insert into test values ('Matuska', 'Raptor', 'matuska raptor'), ('Matuska', 'Zlatá Raketa', 'matuska zlata raketa'), ('Matuska', 'Černá Raketa', 'matuska cerna raketa'), ('Matuska', 'California', 'matuska california');
INSERT 0 4

db=# select * from test;
 brewery |     name     |       cleaned
---------+--------------+----------------------
 Matuska | Raptor       | matuska raptor
 Matuska | Zlatá Raketa | matuska zlata raketa
 Matuska | Černá Raketa | matuska cerna raketa
 Matuska | California   | matuska california
(4 rows)

Search time!

db=# CREATE EXTENSION pg_trgm;

db=# select *, similarity(cleaned, 'zlata raketa matuska') as sim from test where cleaned % 'zlata raketa matuska' order by sim desc;
 brewery |     name     |       cleaned        |   sim
---------+--------------+----------------------+----------
 Matuska | Zlatá Raketa | matuska zlata raketa |        1
 Matuska | Černá Raketa | matuska cerna raketa | 0.576923
 Matuska | Raptor       | matuska raptor       |      0.4
(3 rows)

db=# select *, similarity(cleaned, 'raketa matuska') as sim from test where cleaned % 'raketa matuska' order by sim desc;
 brewery |     name     |       cleaned        |   sim
---------+--------------+----------------------+----------
 Matuska | Zlatá Raketa | matuska zlata raketa |     0.75
 Matuska | Černá Raketa | matuska cerna raketa | 0.714286
 Matuska | Raptor       | matuska raptor       |      0.5
 Matuska | California   | matuska california   | 0.307692
(4 rows)

db=# select *, similarity(cleaned, 'raketa') as sim from test where cleaned % 'raketa' order by sim desc;
 brewery |     name     |       cleaned        |   sim
---------+--------------+----------------------+----------
 Matuska | Zlatá Raketa | matuska zlata raketa |     0.35
 Matuska | Černá Raketa | matuska cerna raketa | 0.333333
(2 rows)

As you can see, it works pretty good. If you change word order, it finds correct match. Once not clear, it offers closest solution.

How about typos?

db=# select *, similarity(cleaned, 'zlta rketa') as sim from test order by sim desc;
 brewery |     name     |       cleaned        |    sim
---------+--------------+----------------------+-----------
 Matuska | Zlatá Raketa | matuska zlata raketa |      0.25
 Matuska | Černá Raketa | matuska cerna raketa |  0.148148
 Matuska | Raptor       | matuska raptor       | 0.0416667
 Matuska | California   | matuska california   |         0
(4 rows)
comments powered by Disqus