trigram snippets

How to generate n-grams with Python and NLTK

Tagged nltk, ngram, bigram, trigram, word gram  Languages python
import nltk
from nltk.util import ngrams

def word_grams(words, min=1, max=4):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

print word_grams('one two three four'.split(' '))

Postgres substring search with trigrams

Tagged trigram, postgres, search, substring, unaccent, pgtrgm, pg_trgm  Languages sql

What we have is: * a table "companies" and a column "name" * Postgres and the pgtrgm extension * a company named "Berkshéiße"

Let's enable the trigram extension:

-- $ psql app_schema -U superuser
CREATE EXTENSION pg_trgm;

Now we can search for trigrams, including shèiß

-- $ psql app_schema -U user
SELECT * FROM companies WHERE name ~* 'shèiß';

The query returns nothing, so let's do this instead: * install the unaccent extension * create an "immutable unaccent" function * apply "unaccent" and "lower" to the query * apply "unaccent" and "lower" to the index

-- $ psql app_schema -U superuser
CREATE EXTENSION unaccent;
-- $ psql app_schema -U user
--DROP INDEX companies_name_search_idx;
CREATE OR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS
$func$
SELECT unaccent('unaccent', $1)
$func$  LANGUAGE sql IMMUTABLE SET search_path = public, pg_temp;
CREATE INDEX companies_name_search_idx ON companies USING gin(f_unaccent(name) gin_trgm_ops);

Finally, the query returns what we're looking for:

-- Plain SQL
SELECT * FROM companies WHERE lower(f_unaccent(name)) LIKE ('%' || lower(f_unaccent('shèiß')) || '%');
-- With pg_trgrm syntax
SELECT * FROM companies WHERE lower(f_unaccent(name) ~* lower(f_unaccent('shéiße'));
-- Look, even this works
SELECT * FROM companies WHERE lower(f_unaccent(name) ~* lower(f_unaccent('shEiSe'));

If Postgres still doesn't use the index we created, it's probably because it's faster to scan the table than using the index.

Notes

* The WHERE-condition must match the index definition:

-- yes
lower(unaccent_text(name) ~* lower(unaccent_text('shéiße'))
-- no
name ~* lower(unaccent_text('shéiße'))
-- no
name ~* unaccent_text('shéiße')

References

* Erwin Brandstetter: Does PostgreSQL support “accent insensitive” collations? * Erwin Brandstetter: PostgreSQL accent + case insensitive search