Exactly-once delivery with RabbitMQ

Tagged messaging, rabbitmq, exactly-once, amqp  Languages 

Use late acknowledgment and idempotency to achieve fake exactly-once delivery with RabbitMQ.

Late acknowledgment = acknowledge the message after the database transaction has been committed.

Idempotency = don’t process the same message twice, or ensure the effect is the same when processing once or multiple times

Could not find 'bundler' (2.2.16) required by your Gemfile.lock

Tagged gemfile, docker, bundler  Languages bash

You need to install the version listed in the Gemfile.lock file to fix this error:

gem install --force "bundler:$(grep -A 1 "BUNDLED WITH" Gemfile.lock | tail -n 1)"

In a Dockerfile you could run it like this:

RUN gem install --force "bundler:$(grep -A 1 "BUNDLED WITH" Gemfile.lock | tail -n 1)" rake && \
  bundle config set without development test && \
  bundle config set --local deployment 'true' && \
  bundle install --jobs 5 --retry 5 && \
  bundle clean --force

Troubleshooting Python's Celery

Tagged python, celery  Languages python

Task celery.chord_unlock[38d5105a-12f2-4119-80e5-184167998f4b] retry: Retry in 1.0s

The notes https://docs.celeryproject.org/en/latest/userguide/canvas.html#chords:

If you’re using chords with the Redis result backend and also overriding the Task.after_return() method, you need to make sure to call the super method or else the chord callback won’t be applied.

TypeError: task() argument after ** must be a mapping, not list

missing 3 required positional arguments

If you get this error:

TypeError: after_return() missing 3 required positional arguments: ‘args’, ‘kwargs’, and ‘einfo’

You might have specified the arguments incorrectly, for example:

job = job_task.subtask(1, 2, 3)

Use a list to fix the error:

job = job_task.subtask((1, 2, 3))

Fastest way of importing data into PostgreSQL

Tagged import, psql, stdin, copy  Languages bash

The fastest way of importing data into PostgreSQL is to avoid any additional processing, i.e., use PostgreSQL tools instead of writing scripts in Python or other languages.

This will import the file directly from a file into PostgreSQL:

unzip -p data.csv.gz | PGOPTIONS=-—client-min-messages=warning psql —-no-psqlrc —-set ON_ERROR_STOP=on <db name> —-command="COPY table from STDIN"

You can also add preprocessing easily, such as removal of data with AWK, by piping commands together into a workflow.

How to unnest an array of arrays in PostgreSQL

Tagged jsonb_array_elements_text, array_agg, unnest, jsonb_agg  Languages sql

If you try to use unnest with array_agg you will get the following error:

SELECT array_agg(unnest(ids)) FROM (
  SELECT
    month, array_agg(id) as ids
  FROM x
  GROUP BY month
);
ERROR:  0A000: aggregate function calls cannot contain set-returning function calls
LINE 14:   array_agg(unnest(ids)) AS ids,
                     ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.

You can use jsonb_agg and jsonb_array_elements_text to flatten or unnest an array of arrays in PostgreSQL:

SELECT
  jsonb_array_elements_text(jsonb_agg(array_of_arrays))
FROM
  x;

But multiple rows will be returned.

For other solutions, see:

https://stackoverflow.com/a/8142998

https://wiki.postgresql.org/wiki/Unnest_multidimensional_array

Answering what has happened before and after with WINDOW functions and PostgreSQL

Tagged postgresql, function, before, window, after  Languages sql

This query answers what has happened and if something has happened before and after the current row in the group defined by the window function:

WITH log AS (
  SELECT
    array_agg(event_type) OVER (w ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW) AS what_happened_before,
    array_agg(event_type) OVER (w ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE CURRENT ROW) AS what_happened_after,
    bool_or(event_type='error') OVER (w ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW EXCLUDE CURRENT ROW) AS has_happened_before,
    bool_or(event_type='error') OVER (w ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING EXCLUDE CURRENT ROW) AS has_happened_after,
    *
  FROM events
  WINDOW w AS (PARTITION BY month ORDER BY month, id DESC)
  ORDER BY month, id DESC
)
SELECT * FROM log;

This query answers the following questions:

  • What has happened in this group before/after the current row?
  • Has a specific event happened in this group before/after the current row?

See documentation for details: https://www.postgresql.org/docs/12/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

Mixing arguments and keywords in Ruby

Tagged keywords, arguments, options, ruby, splat  Languages ruby

Mixing arguments and keywords in Ruby:

def hello(*args, **keywords)
  { args: args, keywords: keywords }
end

Splat to the rescue:

* turns all arguments into an array.

** turns all keyword arguments into a hash.

This allows you to do the following:

hello(:one, :two, { three: :four })
# or, simply
hello(:one, :two, three: :four)

=> {:args=>[:one, :two], :keyword_args=>{:three=>:four}}

Readability is improved by using proper names:

def hello(name, **options)
  { name: name, options: options }
end

How to get cron to log to STDOUT under Docker and Kubernetes

Tagged cron, pid, stderr, stdout, dockerfile  Languages bash

Dockerfile

FROM python:3.9-slim-buster
...
COMMAND ["cron", "-f"]

In cron scripts, redirect the scripts’ output to the file descriptor of PID 1, which is cron (Dockerfile’s COMMAND):

# Redirects both stderr and stdout to stdout of PID 1:
run.sh &>> /proc/1/fd/1
# Redirects stderr and stdout to stdout and stderr of PID 1:
run.sh 1>> /proc/1/fd/1 2>> /proc/1/fd/2

Each PID (process) has it’s own file descriptors:

/proc/{PID}/fd/0 # STDIN
/proc/{PID}/fd/1 # STDOUT
/proc/{PID}/fd/2 # STDERR

Similarity search with Jaccard, Minhash and LSH

Tagged jaccard, probabilistic, randomness, similarity, minhash, lsh  Languages python

To find the similarity betwen two sets – for example, a document can be seen as a set of words — you can use these algorithms:

  • Jaccard similarity

Similarity is calculated as the size of the union of the two sets divided by the size of the intersection of the sets resulting in a number between zero and one. Zero means the sets are completely different and a result of one means they are completely similar.

This algorithm is slow because pairwise comparison is needed. For example, a naive approach could store the Jaccard similarity of 1 000 000 documents by calculating the similarity of each pair and storing the similarity in a database. This would require 1 000 000 * 1 000 000 database rows.

Running time: O(n log(n)) Space complexity: O(n log(n))

  • MinHash (with k hash functions)

MinHash leverages hashing and randomness, in a similar way as Bloom filters, to quickly and probabilistically estimate the Jaccard similarity. It does this by generating k signatures for each document.

Pseudo-code example:

a = minhash("some text...")  # signature = [1, 1, 1, 1]
b = minhash(“some text...”)  # signature = [1, 1, 1, 1]
c = minhash(“other text...”) # signature = [2, 1, 1, 1]

This algorithm is still slow because pairwise comparison is needed. We only managed to minimize the size of the two sets by using MinHash.

Because this is a probabilistic algorithm we need to account for errors. To achieve 90% accuracy with a 10% error rate, you would need 100 hash functions (1/sqrt(k)).

Hash collisions also need to be minimized.

  • Locality-sensitive hashing (LSH)

LSH is a way of optimizing MinHash by binning the many signatures generated with MinHash into buckets.

For example, LSH generates 20 signatures based on 200 hashes by breaking down the MinHash generated signatures into 20 bands – or buckets – each containing 10 MinHash signatures.

There will be false and true negatives.

LSH gives us sub-linear query time.

  • MinHash LSH Ensemble

Jaccard similarity works best for small datasets of similar size because the denominator is the union of the two sets.

LSH ensemble is one solution to this issue. For details, see: http://ekzhu.com/datasketch/lshensemble.html

References:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

http://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/notes/Min-Hash/index.html

https://cran.r-project.org/web/packages/textreuse/vignettes/textreuse-minhash.html

https://rdrr.io/cran/textreuse/man/lsh.html

https://maciejkula.github.io/2015/06/01/simple-minhash-implementation-in-python/

http://ekzhu.com/datasketch/lsh.html

https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c