Similarity search with Jaccard, Minhash and LSH

Tagged lsh, jaccard, minhash, similarity, randomness, probabilistic  Languages python

To find the similarity betwen two sets – for example, a document can be seen as a set of words — you can use these algorithms:

  • Jaccard similarity

Similarity is calculated as the size of the union of the two sets divided by the size of the intersection of the sets resulting in a number between zero and one. Zero means the sets are completely different and a result of one means they are completely similar.

This algorithm is slow because pairwise comparison is needed. For example, a naive approach could store the Jaccard similarity of 1 000 000 documents by calculating the similarity of each pair and storing the similarity in a database. This would require 1 000 000 * 1 000 000 database rows.

Running time: O(n log(n)) Space complexity: O(n log(n))

  • MinHash (with k hash functions)

MinHash leverages hashing and randomness, in a similar way as Bloom filters, to quickly and probabilistically estimate the Jaccard similarity. It does this by generating k signatures for each document.

Pseudo-code example:

a = minhash("some text...")  # signature = [1, 1, 1, 1]
b = minhash(“some text...”)  # signature = [1, 1, 1, 1]
c = minhash(“other text...”) # signature = [2, 1, 1, 1]

This algorithm is still slow because pairwise comparison is needed. We only managed to minimize the size of the two sets by using MinHash.

Because this is a probabilistic algorithm we need to account for errors. To achieve 90% accuracy with a 10% error rate, you would need 100 hash functions (1/sqrt(k)).

Hash collisions also need to be minimized.

  • Locality-sensitive hashing (LSH)

LSH is a way of optimizing MinHash by binning the many signatures generated with MinHash into buckets.

For example, LSH generates 20 signatures based on 200 hashes by breaking down the MinHash generated signatures into 20 bands – or buckets – each containing 10 MinHash signatures.

There will be false and true negatives.

LSH gives us sub-linear query time.

  • MinHash LSH Ensemble

Jaccard similarity works best for small datasets of similar size because the denominator is the union of the two sets.

LSH ensemble is one solution to this issue. For details, see:


Column finding algorithm

Tagged python, detection, column  Languages python

Given text containing left and right aligned data separated by one or more spaces this algorithm will try to detect the columns indices:

def find_columns(text):
    left_aligned_cols = {}
    right_aligned_cols = {}
    lines = text.split("\n")
    line_count = len(lines)

    for line in lines:
        prev_char = None
        # start-of-line is a column delimiter
        if '0' not in left_aligned_cols:
            left_aligned_cols['0'] = 0
        left_aligned_cols['0'] += 1
        for pos, char in enumerate(line):
            # left-aligned column
            if prev_char and prev_char.isspace() and not char.isspace():
                key = str(pos)
                if key not in left_aligned_cols:
                    left_aligned_cols[key] = 0
                left_aligned_cols[key] += 1
            # right-aligned column
            if prev_char and not prev_char.isspace() and char.isspace():
                key = str(pos + 1)
                if key not in right_aligned_cols:
                    right_aligned_cols[key] = 0
                right_aligned_cols[key] += 1
            prev_char = char
        # end-of-line is a column delimiter
        key = str(len(line))
        if key not in right_aligned_cols:
            right_aligned_cols[key] = 0
        right_aligned_cols[key] += 1

    left_aligned_cols = [ int(key) for (key, count) in left_aligned_cols.items() if count == line_count ]
    right_aligned_cols = [ int(key) for (key, count) in right_aligned_cols.items() if count == line_count ]
    column_indices = sorted(set(left_aligned_cols + right_aligned_cols))
    columns = list(map(list, zip(column_indices, column_indices[1:])))
    return columns

Fix "WARNING: Spring is running in production"

Tagged rails, ruby, spring, docker  Languages ruby

To fix “WARNING: Spring is running in production”, run the following in the project directory:

bundle config set with production
bundle config set without test development

For example, in a Dockerfile:

RUN bundle config set with production
RUN bundle config set without test development

Ruby's database_cleaner for Python

Tagged database_cleaner, database, python, test  Languages python

A simple version of Ruby s database_cleaner gem for Python:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine(
    connect_args={"options": f"-c timezone=utc -csearch_path=public"}
Session = sessionmaker(bind=engine, autoflush=True)


import pytest
from db import Session, engine

# You can also use "autouse"
# @pytest.fixture(autouse=True)
def session():
    connection = engine.connect()
    session = Session(bind=connection)
    transaction = connection.begin()
        yield session


def test_something_with_session(session):
    # Session is automatically inserted by the fixture

Also see:

mDNS and DNS-SD example

Tagged mdns, dns, service discovery, dns-sd  Languages bash

To advertise the existence of an HTTP service with mDNS, use:

dns-sd -R "Service X" _http._tcp . 80 path=/

To find the HTTP services on the local network, run:

dns-sd -B _http._tcp

Browsing for _http._tcp
DATE: ---Wed 03 Mar 2021---
17:48:45.207  ...STARTING...
Timestamp     A/R    Flags  if Domain               Service Type         Instance Name
17:48:45.207  Add        2   9 local.               _http._tcp.          Service X

See /etc/services for a list of valid services.

Namespaced singular routes for uncountables in Rails

Tagged namespace, singular_route_key  Languages ruby

Correct(ing) singular route for an uncountable namespaced model in Rails:


$ rails console
> Namespace::SMS.model_name.singular_route_key => namespace_sm
> app.polymorphic_path(Namespace::SMS.first) => NoMethodError namespace_sm_path


ActiveSupport::Inflector.inflections(:en) do |inflect|
  inflect.uncountable %w(namespace_sms)


$ rails console
> Namespace::SMS.model_name.singular_route_key => namespace_sms

Table-driven Programming

Tagged table-driven, programming  Languages python

You can use if statements to run a function when the state of your application matches specific criteria.

However, if statements don’t scale as they are hard to understand and maintain:

if state_x and state_y and state_z:

if state_x and state_z:

if state_x and state_y and not state_z:

Table-driven programming is an alternative that sometimes is easier to maintain:

rules = (
    # x, y, z, function
    (True, True, True, do_xyz),
    (True, True, False, do_xy_not_z),
    (True, False, True, do_xz),
    (True, True, True, do_xz),
for rule in rules:
    rule_x = rule[0]
    rule_y = rule[1]
    rule_z = rule[2]
    doit = rule[3]
    if rule_x == state_x and rule_y == state_y and rule_z == state_z:

Or, more succinctly:

def matching_rules(rules, params):
    for criterion, func in rules:
        if all(params[ix] == criteria for ix, criteria in enumerate(criterion)):
            yield func

# The table of rules
rules = (
    # x, y, z, function
    ((True, True, True), do_xyz),
    ((True, True, False), do_xy_not_z),
    ((True, False, True), do_xz),
    (True, True, True, do_xz),
params = (state_x, state_y, state_z)
for func in matching_rules(rules, params):

In summary, a function is run only when the criteria match.

Pattern matching is also an alternative:

However, the first-to-match rule requires the order to be correct and prevents multiple function calls.

State machines and Prolog are also options…

SQLAlchemy's yield_for with raw SQL and models

Tagged sqlalchemy, yield_for, raw sql  Languages python
from sqlalchemy import text

session = ...
q = session.query(Report).from_statement(text("""
    id = '51812'
    id DESC;

Python's etree and namespaces

Tagged namespaces, etree, python  Languages python

How to parse XML with Python’s etree when XML has namespaces, even a default namespace:

from lxml import etree
doc = etree.XML(bytes(bytearray(xml, encoding='utf-8')))
ns = {}
for k in doc.nsmap.keys():
   ns[k] = doc.nsmap[k]
doc.find('.//tag', ns)
doc.findtext('./periodOfReport', namespaces=ns)

LOL, someone needs to clean up the API.