Mixing arguments and keywords in Ruby

Tagged arguments, options, ruby, splat, keywords  Languages ruby

Mixing arguments and keywords in Ruby:

def hello(*args, **keywords)
  { args: args, keywords: keywords }
end

Splat to the rescue:

* turns all arguments into an array.

** turns all keyword arguments into a hash.

This allows you to do the following:

hello(:one, :two, { three: :four })
# or, simply
hello(:one, :two, three: :four)

=> {:args=>[:one, :two], :keyword_args=>{:three=>:four}}

Readability is improved by using proper names:

def hello(name, **options)
  { name: name, options: options }
end

How to get cron to log to STDOUT under Docker and Kubernetes

Tagged cron, stdout, dockerfile, pid, stderr  Languages bash

Dockerfile

FROM python:3.9-slim-buster
...
COMMAND ["cron", "-f"]

In cron scripts, redirect the scripts’ output to the file descriptor of PID 1, which is cron (Dockerfile’s COMMAND):

# Redirects both stderr and stdout to stdout of PID 1:
run.sh &>> /proc/1/fd/1
# Redirects stderr and stdout to stdout and stderr of PID 1:
run.sh 1>> /proc/1/fd/1 2>> /proc/1/fd/2

Each PID (process) has it’s own file descriptors:

/proc/{PID}/fd/0 # STDIN
/proc/{PID}/fd/1 # STDOUT
/proc/{PID}/fd/2 # STDERR

Similarity search with Jaccard, Minhash and LSH

Tagged jaccard, lsh, minhash, similarity, randomness, probabilistic  Languages python

To find the similarity betwen two sets – for example, a document can be seen as a set of words — you can use these algorithms:

  • Jaccard similarity

Similarity is calculated as the size of the union of the two sets divided by the size of the intersection of the sets resulting in a number between zero and one. Zero means the sets are completely different and a result of one means they are completely similar.

This algorithm is slow because pairwise comparison is needed. For example, a naive approach could store the Jaccard similarity of 1 000 000 documents by calculating the similarity of each pair and storing the similarity in a database. This would require 1 000 000 * 1 000 000 database rows.

Running time: O(n log(n)) Space complexity: O(n log(n))

  • MinHash (with k hash functions)

MinHash leverages hashing and randomness, in a similar way as Bloom filters, to quickly and probabilistically estimate the Jaccard similarity. It does this by generating k signatures for each document.

Pseudo-code example:

a = minhash("some text...")  # signature = [1, 1, 1, 1]
b = minhash(“some text...”)  # signature = [1, 1, 1, 1]
c = minhash(“other text...”) # signature = [2, 1, 1, 1]

This algorithm is still slow because pairwise comparison is needed. We only managed to minimize the size of the two sets by using MinHash.

Because this is a probabilistic algorithm we need to account for errors. To achieve 90% accuracy with a 10% error rate, you would need 100 hash functions (1/sqrt(k)).

Hash collisions also need to be minimized.

  • Locality-sensitive hashing (LSH)

LSH is a way of optimizing MinHash by binning the many signatures generated with MinHash into buckets.

For example, LSH generates 20 signatures based on 200 hashes by breaking down the MinHash generated signatures into 20 bands – or buckets – each containing 10 MinHash signatures.

There will be false and true negatives.

LSH gives us sub-linear query time.

  • MinHash LSH Ensemble

Jaccard similarity works best for small datasets of similar size because the denominator is the union of the two sets.

LSH ensemble is one solution to this issue. For details, see: http://ekzhu.com/datasketch/lshensemble.html

References:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/Min-Hash/index.html

https://cran.r-project.org/web/packages/textreuse/vignettes/textreuse-minhash.html

https://rdrr.io/cran/textreuse/man/lsh.html

https://maciejkula.github.io/2015/06/01/simple-minhash-implementation-in-python/

http://ekzhu.com/datasketch/lsh.html

https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c

Column finding algorithm

Tagged detection, python, column  Languages python

Given text containing left and right aligned data separated by one or more spaces this algorithm will try to detect the columns indices:

def find_columns(text):
    left_aligned_cols = {}
    right_aligned_cols = {}
    lines = text.split("\n")
    line_count = len(lines)

    for line in lines:
        prev_char = None
        # start-of-line is a column delimiter
        if '0' not in left_aligned_cols:
            left_aligned_cols['0'] = 0
        left_aligned_cols['0'] += 1
        for pos, char in enumerate(line):
            # left-aligned column
            if prev_char and prev_char.isspace() and not char.isspace():
                key = str(pos)
                if key not in left_aligned_cols:
                    left_aligned_cols[key] = 0
                left_aligned_cols[key] += 1
            # right-aligned column
            if prev_char and not prev_char.isspace() and char.isspace():
                key = str(pos + 1)
                if key not in right_aligned_cols:
                    right_aligned_cols[key] = 0
                right_aligned_cols[key] += 1
            prev_char = char
        # end-of-line is a column delimiter
        key = str(len(line))
        if key not in right_aligned_cols:
            right_aligned_cols[key] = 0
        right_aligned_cols[key] += 1

    left_aligned_cols = [ int(key) for (key, count) in left_aligned_cols.items() if count == line_count ]
    right_aligned_cols = [ int(key) for (key, count) in right_aligned_cols.items() if count == line_count ]
    column_indices = sorted(set(left_aligned_cols + right_aligned_cols))
    columns = list(map(list, zip(column_indices, column_indices[1:])))
    return columns

Fix "WARNING: Spring is running in production"

Tagged rails, spring, docker, ruby  Languages ruby

To fix “WARNING: Spring is running in production”, run the following in the project directory:

bundle config set with production
bundle config set without test development

For example, in a Dockerfile:

RUN bundle config set with production
RUN bundle config set without test development

Ruby's database_cleaner for Python

Tagged database, test, python, database_cleaner  Languages python

A simple version of Ruby s database_cleaner gem for Python:

db.py:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine(
    'postgresql+psycopg2://',
    connect_args={"options": f"-c timezone=utc -csearch_path=public"}
)
Session = sessionmaker(bind=engine, autoflush=True)

tests/conftest.py:

import pytest
from db import Session, engine

# You can also use "autouse"
# @pytest.fixture(autouse=True)
@pytest.fixture
def session():
    connection = engine.connect()
    session = Session(bind=connection)
    transaction = connection.begin()
    try:
        yield session
    finally:
        session.close()
        transaction.rollback()
        connection.close()

tests/test_something.py:

def test_something_with_session(session):
    # Session is automatically inserted by the fixture
end

Also see: https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#tracking-transaction-state-with-events

mDNS and DNS-SD example

Tagged service discovery, dns, dns-sd, mdns  Languages bash

To advertise the existence of an HTTP service with mDNS, use:

dns-sd -R "Service X" _http._tcp . 80 path=/

To find the HTTP services on the local network, run:

dns-sd -B _http._tcp

Browsing for _http._tcp
DATE: ---Wed 03 Mar 2021---
17:48:45.207  ...STARTING...
Timestamp     A/R    Flags  if Domain               Service Type         Instance Name
17:48:45.207  Add        2   9 local.               _http._tcp.          Service X

See /etc/services for a list of valid services.

Namespaced singular routes for uncountables in Rails

Tagged namespace, singular_route_key  Languages ruby

Correct(ing) singular route for an uncountable namespaced model in Rails:

Problem:

$ rails console
> Namespace::SMS.model_name.singular_route_key => namespace_sm
> app.polymorphic_path(Namespace::SMS.first) => NoMethodError namespace_sm_path

Solution:

ActiveSupport::Inflector.inflections(:en) do |inflect|
  inflect.uncountable %w(namespace_sms)
end

Voila:

$ rails console
> Namespace::SMS.model_name.singular_route_key => namespace_sms

Table-driven Programming

Tagged programming, table-driven  Languages python

You can use if statements to run a function when the state of your application matches specific criteria.

However, if statements don’t scale as they are hard to understand and maintain:

if state_x and state_y and state_z:
    do_xyz()

if state_x and state_z:
    do_xz()

if state_x and state_y and not state_z:
    do_xy_not_z()

Table-driven programming is an alternative that sometimes is easier to maintain:

rules = (
    # x, y, z, function
    (True, True, True, do_xyz),
    (True, True, False, do_xy_not_z),
    (True, False, True, do_xz),
    (True, True, True, do_xz),
)
for rule in rules:
    rule_x = rule[0]
    rule_y = rule[1]
    rule_z = rule[2]
    doit = rule[3]
    if rule_x == state_x and rule_y == state_y and rule_z == state_z:
        doit()
    else:
        print("Skipping")

Or, more succinctly:

def matching_rules(rules, params):
    for criterion, func in rules:
        if all(params[ix] == criteria for ix, criteria in enumerate(criterion)):
            yield func

# The table of rules
rules = (
    # x, y, z, function
    ((True, True, True), do_xyz),
    ((True, True, False), do_xy_not_z),
    ((True, False, True), do_xz),
    (True, True, True, do_xz),
)
params = (state_x, state_y, state_z)
for func in matching_rules(rules, params):
    func()

In summary, a function is run only when the criteria match.

Pattern matching is also an alternative: https://www.python.org/dev/peps/pep-0635/

However, the first-to-match rule requires the order to be correct and prevents multiple function calls.

State machines and Prolog are also options…