Alan Grow's Blog

Fast Filewise Git Blame

2024-05-18T00:00:00

When was each file in a git repository last changed, and who changed it? Here's a one liner shell script that produces a fast filewise git blame report:

#!/bin/sh
TZ=UTC git log --name-status --date=iso-strict-local --pretty="%ad%x09%ae" "$@" |
perl -F'/\t/' -lane '
  if (/^[ACDMRTUXB]/) {
    $path = @F>2 ? $F[2] : $F[1];
    print "$date\t$email\t$path" if -e "$path";
  } elsif (@F) {
    ($date, $email) = @F;
  }
' |
sort -k3,3 -k1,1r |
uniq -f2

The report looks roughly like this:

~/src/coreutils $ git-filewise-blame src | head
2024-01-01T13:22:42    a@b.com    basename.c
2024-03-19T15:55:18    a@b.com    basenc.c
2023-10-27T15:56:39    x@y.com    blake2/b2sum.c
2021-11-01T05:30:38    x@y.com    blake2/b2sum.h
2021-12-18T17:34:31    x@y.com    blake2/blake2b-ref.c
2021-12-18T17:34:31    x@y.com    blake2/blake2.h
2022-09-15T05:30:31    x@y.com    blake2/blake2-impl.h
2016-10-31T13:29:34    a@b.com    blake2/.gitignore
2024-04-06T22:13:23    x@y.com    cat.c
2024-01-01T13:22:42    a@b.com    chcon.c
...

The keyword here is fast. All other approaches I've found execute git log for every file in your checkout. Here's one example of the slow approach:

git ls-files | while read file; do
  git log -n 1 --pretty="Filename: $file, commit: %h, date: %ad" -- "$file"
done

If your repository has many files and deep history, an exec-for-every-file approach will be horrifically slow – on the order of minutes or even hours. By contrast, the git-filewise-blame approach consumes the output of a single git log command. On my laptop, it takes 1 minute to filewise blame the entire webkit git repository, which has 405k files and 275k commits (!).

Qmail with a Let's Encrypt SSL Cert

2024-05-04T00:00:00

After a long hiatus, I am once again running a mailserver.

Are You Crazy? It's 2024

A lot of us nerds got lazy and switched to hosted gmail about a decade ago. This consolidation has gradually corrupted some of our email ecosystem norms, and I think it's important to push back on that.

But it's also important to have an email that's fully under your control, even if you only use it as the recovery email for your primary gmail account. Watching a friend lose access to his primary gmail recently – with no working recovery email, and no recourse – scared me into action.

Finally, I also wanted to generate a new gpg key now that ed25519 is widely supported. A gpg key with a gmail identity feels silly and unserious. My new key's identity is i@ this domain. That's cool factor! In the apocalypse, you can send me email at this new address, and sleep well knowing that it's both encrypted-in-transit and encrypted-at-rest.

That's three reasons why. Onward with the writeup.

Let's Encrypt

The world of mail evolved over the past decade, and TLS is now a must. Luckily, the world also evolved a free and easy way to get an SSL cert: Let's Encrypt. It's great. This site uses it.

But that's https. We'd like to reuse the Let's Encrypt cert for smtp.

Patching Qmail

Out of the box, qmail doesn't support SSL for inbound or outbound mail. To get both, you can apply the qmail-tls patch from this qmail patch directory. You may also want the force-tls patch.

I've also found the any-to-cname and remove-cname-check patches necessary to avoid DNSSEC-related nonsense and deferred / delayed emails. Like I said, things have evolved, and not always in a good way.

Networking

Here's another revolting development: gmail will not attempt mail delivery to ports 465 of 587. It sure would be nice if it did, because then we could use always-on SSL. That would let us run stock qmail-smtpd under sslserver.

Alas, you need to run an smtp server with STARTTLS support on port 25. So make sure that most-intimidating-of-ports is open. The silver lining is that you'll be able to test your setup with telnet. So will the spammers, but SSL doesn't seem to slow them down these days anyway.

Configuring Qmail for SSL

To do one-time generation of DH conversation keys:

/var/lib/qmail/bin/update_tmprsadh

Our patched qmail expects a combined public cert + chain + private key at /var/lib/qmail/control/servercert.pem. Since Let's Encrypt rotates your cert every 60-90 days, we can't just cat some files into place and forget about it. We need to regenerate qmail-smtpd's cert and bounce the service whenever rotation happens.

Fortunately Let's Encrypt has renewal hooks. Put the following in a +x file at the location on line 2:

#!/bin/sh
# /etc/letsencrypt/renewal-hooks/deploy/deploy-qmail-servercert 
set -e
cd /etc/qmail
touch servercert.pem.tmp
chmod go-rwx servercert.pem.tmp
chown qmaild servercert.pem.tmp
cat "$RENEWED_LINEAGE"/fullchain.pem "$RENEWED_LINEAGE"/privkey.pem > servercert.pem.tmp
mv servercert.pem.tmp servercert.pem
cd qmail-smtpd
svc -h .

Note the careful permissioning of servercert.pem – it contains private key material.

To generate the cert the first time, and test that the hook will work for future cert rotations, force a renewal as root:

certbot renew --force-renewal

Be careful during testing that you don't force too many renewals. Let's Encrypt will throttle you after a few.

If you have multiple domains under Let's Encrypt, the hook will run multiple times. I'm not sure how to avoid that. Maybe it could skip out unless the primary domain is renewing.

Testing It

First try telnet example.com 25. After the smtp server's 220, respond with EHLO. If TLS is available you should see 250-STARTTLS. Example test session:

$ telnet example.com 25
Trying 1.2.3.4...
Connected to example.com.
Escape character is '^]'.
220 example.com ESMTP
EHLO
250-example.com
250-STARTTLS
250-PIPELINING
250 8BITMIME

If you don't see 250-STARTTLS, try these debug tips.

To test that the server's cert chain is trusted and you can send mail over SSL, use this:

openssl s_client -starttls smtp -ign_eof -crlf -connect example.com:25

As an smtp protocol refresher, you can type in mail to deliver by hand, like so:

MAIL FROM:<you@home.example.com>
250 ok
RCPT TO:<you@example.com>
250 ok
DATA
354 go ahead
Subject: test subject

test body
.
250 ok 1714841964 qp 843479
QUIT
221 example.com

Once this works, you can try sending mails to your server from gmail. If you're a gsuite admin, their Email Log Search will tell you about delivery errors, and whether successful deliveries were actually encrypted in transit.

Outbound mail from your server should also be encrypted thanks to the qmail-tls patch. You'll see a big fat warning in gmail if not. For better deliverability, you'll also want to configure SPF, DKIM, and DMARC, which is way beyond the scope of this humble post.

What Letter-Pair Tileset Forms the Most Words?

2024-03-23T00:00:00

While building a word game, Daniel Feldman ran into a problem that nerdsniped me instantly: what choice of twenty letter-pair tiles generates the most words?

A number of approaches were proposed in the ensuing thread, with some folks even wondering if the problem might be NP-complete. In this post I'll present a greedy algorithm that's linear in the dictionary size and quadratic in the squared alphabet size. I believe this finds an optimal solution, but haven't proven that formally.

Let's first define the problem.

The Problem Definition

There are 26 alphabet letters, and each tile has two letters on it, so that works out to a total of 26 * 26 = 676 possible tiles. We only get to choose a meager 20 of these 676 to form our tileset. Like in Scrabble, you can then rearrange subsets of the tileset to form dictionary words. The problem: find the tileset of size 20 that lets you form the most dictionary words.

A Much Smaller Example Tileset

Here's all possible tiles, with a specific size-3 tileset ed,pi,ti highlighted:

The ed,pi,ti tileset generates these 6 words:

pi
pied
pitied
ti
tied
tipi

Note that throughout this post I'll be using american-english under /usr/share/dict/ as the dictionary.

Initial Approach: Maximum Letter Pair Frequency

Before we get into the final greedy approach, let's try something more straightforward.

When Alfred Butts designed the Scrabble tileset, he looked at the front page of the New York Times and hand-tabulated letter frequencies. He then added more copies of frequently occurring letters.

While our problem is different in a couple ways – unlike Scrabble, duplicate tiles aren't allowed, and also unlike Scrabble, we can only hope to generate a small fraction of all dictionary words — this approach feels intuitively promising.

We'll iterate through the dictionary, split each word into letter-pairs, and count pair occurrences. But note:

Words of odd length get thrown out, because they can't be formed by a sequence of pairs.
Words that use the same letter pair twice also get thrown out, since, per the problem definition, our tileset doesn't contain repeats.

Among the words that remain, we'll pick the most frequently occurring pairs as our tileset.

Here's the Python code:

from collections import defaultdict
import sys
import re

args = sys.argv[1:]
NTILES = int(args.pop(0))
RE_VALID_WORD = re.compile(r'^([a-z][a-z]){1,%s}$' % NTILES)
RE_REPEATED_PAIR = re.compile(r'''
  ^(..)*
  (?P<letter1>.)(?P<letter2>.)
  (?P=letter1)(?P=letter2)
''', re.X)

freqs = defaultdict(lambda: 0)
words = []

for word in open(args.pop(0)) if args else sys.stdin:
  word = word.rstrip('\n')

  # discard words with odd length, capitals, apostrophes.
  if not RE_VALID_WORD.match(word):
    continue

  # split the word into letter pairs.
  pairs = re.findall(r'.{2}', word)

  # discard words with repeated pairs.
  if RE_REPEATED_PAIR.match(''.join(sorted(pairs))):
    continue

  # add to valid word list. update letter pair statistics.
  words.append(word)
  for pair in pairs:
    freqs[pair] += 1

# our tileset is the top N most frequently occurring pairs.
metric = lambda pair: freqs[pair]
tileset = set(sorted(freqs.keys(), key=metric)[-NTILES:])

# report the tileset.
print(','.join(sorted(tileset)))

# report all formable dictionary words.
for word in words:
  if not set(re.findall(r'.{2}', word)).difference(tileset):
    print(word)

You'll notice the program takes the tileset size, NTILES, as a command line argument. We can use this to run a sanity check on much smaller tilesets. When we do, we immediately see a problem:

./lettergen2 3 /usr/share/dict/american-english
ed,er,es
es

It's no surprise that ed,er,es occur most frequently, since they're common English word endings. However, word endings by themselves don't play nice together. They rely on word beginnings and middles to form complete words. And we already know from the small example above that ed,pi,ti generates 6 words. Generating 1 word, es, is suboptimal.

This maximum letter pair frequency approach was a greedy algorithm, and its failure makes one despair of finding any optimal greedy (read: simple) algorithm. To construct the optimal tileset from scratch, perhaps you need to perform search on a graph of successively longer words, so you're passing from word beginnings to word middles to word endings? I pursued this approach myself before giving up. Ruminate long enough, and your thoughts may even turn to dark subjects like the set cover problem, which is NP-Complete.

That level of despair didn't sit right with me though. Our problem isn't the same as the set cover problem, which seeks a set-of-sets that union together to form a bigger set. Let's accept a set-theoretic framing for a moment to see why.

Thinking in Terms of Sets

Suppose we're working with 676 sets, one for each letter pair. Each set contains all the words a particular letter pair occurs in. I'll denote these word sets W_aa, W_ab, … W_zz.

The tileset we seek is a subset of these 676 word sets. But it isn't a set-of-sets we get to union together like in the set cover problem. Consider: just because our tileset includes W_ed doesn't mean we can form the word need – our tileset must also contain W_ne for that. The "and" logic here feels more like set intersection – W_ne ∩ W_ed – than set union.

But set intersection isn't the right operation either. For example, just because our tileset contains W_ne and W_ed, and both contain the word needle, doesn't mean we can actually form the word needle. We'd also need W_le for that.

So it looks like constructing the list of formable words isn't a basic set operation over the tileset.

However, hidden in the negative space here, there is a basic set operation at work. We've seen that if our tileset doesn't include W_ne and doesn't contain W_ed, we have no hope of forming need, needle, nerd, edit or any of the words in either word set. When we omit W_ne and W_ed from the tileset, we lose all the words in the union W_ne ∪ W_ed.

Another name for "the set of all word sets omitted from our tileset" is the tileset's complement. The unformable words are the union of those omitted word sets. So what we're really seeking is a tileset whose complement has minimal union size. Now we're dealing with basic set operations!

At this point it occurred to me to try a new greedy approach, but working backwards this time. Instead of constructing the tileset from scratch by greedily picking the next best tile to add, what if we started with the full size 676 tileset, and greedily picked the least-bad tile to remove, until only 20 tiles remained?

Final Approach: Subtractive Minimum Damage

Without further ado, here's Python code for this new approach:

from collections import defaultdict
import sys
import re

args = sys.argv[1:]
NTILES = int(args.pop(0))
RE_VALID_WORD = re.compile(r'^([a-z][a-z]){1,%s}$' % NTILES)
RE_REPEATED_PAIR = re.compile(r'''
  ^(..)*
  (?P<letter1>.)(?P<letter2>.)
  (?P=letter1)(?P=letter2)
''', re.X)

words = defaultdict(set)

for word in open(args.pop(0)) if args else sys.stdin:
  word = word.rstrip('\n')

  # discard words with odd length, capitals, apostrophes.
  if not RE_VALID_WORD.match(word):
    continue

  # split the word into letter pairs.
  pairs = re.findall(r'.{2}', word)

  # discard words with repeated pairs.
  if RE_REPEATED_PAIR.match(''.join(sorted(pairs))):
    continue

  # add the word to the wordsets for its letter pairs.
  for pair in pairs:
    words[pair].add(word)

# work backwards from 676 to NTILES.
# remove the least-damaging letter pair at each step.
damage = lambda pair: len(words[pair])
while len(words) > NTILES:
  least_damaging_pair = min(words.keys(), key=damage)
  lost_words = words.pop(least_damaging_pair)
  for wordset in words.values():
    wordset.difference_update(lost_words)

# report the tileset.
print(','.join(sorted(words.keys())))

# report all formable dictionary words.
for word in sorted(set().union(*words.values())):
  print(word)

As you can see, the setup is substantially the same. This time, instead of counting appearances of each letter pair, we're populating the W_aa … W_zz word sets for the 676 letter pairs. If a word uses a letter pair, it appears in that pair's word set.

This gives us a direct and O(1) way of measuring the damage incurred by removing a letter pair: it is simply the size of its word set.

After the dictionary has been scanned, all letter pairs are in play, and we can start greedy removal. At each step we pick the letter that inflicts the least amount of damage in terms of formable words. Note that any word longer than 2 letters will appear in multiple word sets, and we have to remember to remove these extra copies of every "lost" word. Python's set data type is doing a fair bit of work here.

Results

You can use acg/lettergen to compare the results of the two approaches, which I'm calling maxfreq and subtractive. Simply type make and wait a few seconds:

make
running maxfreq/results1 ...
running maxfreq/results2 ...
running subtractive/results1 ...
running subtractive/results2 ...

maxfreq/results1: 12392 words
a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,y

maxfreq/results2: 172 words
al,at,co,de,ed,en,er,es,in,le,li,ly,ng,on,re,ri,rs,st,te,ti

subtractive/results1: 12392 words
a,b,c,d,e,f,g,h,i,k,l,m,n,o,p,r,s,t,u,y

subtractive/results2: 292 words
ar,ca,co,de,di,ed,er,es,in,li,ng,nt,ra,re,ri,si,st,te,ti,ve

For completeness, I wrote Python scripts that handle the 1-letter tile case (lettergen1). There are only 26 tiles to pick the 20 from, and you can see that both approaches arrive at the same result of 12,302 formable words.

The 2-letter tile case (lettergen2) is another story. Maximum Letter Pair Frequency comes up with a tileset that generates 172 words, but Subtractive Minimum Damage does substantially better by finding a 292-word-generating tileset – a 1.7x improvement. You'll find the full lists of formable words at */results2.

So according to Subtractive Minimum Damage, the optimal tileset is the following:

ar,ca,co,de,di,ed,er,es,in,li,ng,nt,ra,re,ri,si,st,te,ti,ve

It's also interesting to experiment with different tileset sizes. For instance, try make -B NTILES=100. You'll notice as NTILES gets larger, the maxfreq approach converges on the subtractive approach. This makes sense: they should agree for NTILES=676 because there are no letter pair decisions to make. And in fact they should agree even earlier than that, since English doesn't use all possible letter pairs.

Open Questions & Further Research

Yes, but is Subtractive Minimum Damage optimal? The answer is I don't know! I vaguely remember proving greedy optimality once in undergrad computer science, but that was two decades ago. Pointers welcome.

What if there's a tie for least damaging letter pair? If there's no path dependence here, you should be able to pick either one at random, and the greedy subtractive approach should still arrive at an optimal solution. To explore this idea, I decided to pick from the top 2 least damaging tiles at random. Then I ran the script thousands of times. To my surprise, it did find a couple tilesets that produced 293 and 294 words – slightly better than the thought-to-be-optimal tileset! A revolting development. But this gap (1-2 tiles) is suspiciously small, and I'm just gonna go to press with what I've got.

What happens if the tileset can have repeats? I haven't thought about this too deeply, but it seems like it would spell trouble for a greedy approach, which can no longer make stepwise progress towards optimal subproblems.

What about words of odd length? Yeah it's awkward we have to exclude those, and it makes even less sense when you look at what motivated this problem (Daniel's Ambigame). One approach would be to pad all odd-lettered dictionary words with a trailing period, and then add a., b., c., and so on to the possible letter tiles. A similar trick with leading padding might let you split words after the 1st, 3rd, 5th etc character instead of always splitting at even indexes.

Is this a known problem? I mean surely Knuth solved this like 50 years ago? I found many related problems, but not this specific one. I worry this means it's considered too easy / too obvious, and I should feel embarrassed for writing a whole blog post about it. Anyway, please reach out if you know.

Is Scrabble's tileset optimal? I dabble in Scrabble myself, and I'd always heard it wasn't. In researching this, I learned that Peter Norvig has calculated more accurate English letter frequencies than Alfred Butt's. Norvig has a couple proposals for a better Scrabble tileset at the link. TL;DR no.

The Formable Word List

I buried the lede to avoid a wall of text. Here's the complete list of 292 formable words found by Subtractive Minimum Damage:

arcade, ardent, ares, arranged, arranger, arranges, arrant, arrest, arrested, arrive, arteries, artier, artist, artistes, calico, calicoes, cant, canted, canter, cantered, care, career, careered, caries, caring, casing, cast, caster, castes, castling, castrate, castrating, catering, cave, code, coding, coed, coin, coined, congestive, contesting, contraries, contrast, contrasted, contrite, contrive, core, coring, cosier, cosies, cost, costar, costarring, costed, costlier, cote, coteries, cove, covering, coveting, dear, dearer, decant, decanted, decanter, decoding, decorate, decorating, decorative, dedicate, dedicating, deed, deer, deli, delicate, deliveries, delivering, dent, dented, dentin, deranged, deranges, deriding, derisive, derive, desire, desiring, desist, desisted, destined, destines, detentes, detest, detested, died, dies, ding, dinged, dint, dire, direst, disinter, disinterring, dive, divest, divested, eddies, errant, erring, es, in, indeed, indelicate, indent, indented, indicate, indicating, indicative, inside, insist, insisted, intent, interest, interested, invent, invented, invest, invested, liar, lied, lies, linger, lingered, lint, lira, lire, list, listed, lite, literati, live, liveried, liveries, livest, rain, rained, rang, ranged, ranger, ranges, rant, ranted, ranter, rare, rarest, raring, rarities, raster, rate, rating, rave, raveling, re, rear, reared, rearranged, rearranges, recant, recanted, recast, recoveries, recovering, redecorate, redecorating, rededicate, rededicating, reed, rein, reindeer, reined, reinvent, reinvented, reinvest, reinvested, relied, relies, relive, rent, rented, renter, reside, resident, residing, resist, resisted, resister, rest, restarting, rested, restrain, restrained, retiring, reveling, revenged, revenges, reveries, revering, ride, riding, riling, ring, ringed, ringer, ringside, rising, rite, riveting, side, siding, sierra, silica, silicate, sing, singed, singer, singes, sire, siring, sister, site, siting, star, stared, stares, starling, starrier, starring, starting, starve, sterling, stinting, strain, strained, strainer, stranger, stride, strident, striding, string, stringed, stringer, strive, tear, teared, teed, tees, tent, tented, test, tested, tester, testes, ti, tide, tidied, tidier, tidies, tiding, tied, tier, ties, tiling, ting, tinged, tinges, tint, tinted, tirade, tire, tiredest, tiring, veer, veered, vein, veined, vent, vented, verier, verities, vest, vested, vestries

Music To Program To

2021-02-27T20:36:24

Programming is deep work. Tuning out distractions is key, and music is one of the most effective tools at your disposal.

But not all music helps you program. Music with lyrics can interfere with your ability to read and write code. Music with too many surprises can add rather than remove distraction. After some experimentation, many programmers arrive at the same conclusion: repetitive electronic music helps them program.

After a couple decades of programming, including a decade of remote work with the talented musician-programmers at blend.io and ROLI, here's some of the music I turn to when I need to Get Shit Done.

All music has a Spotify embed and a quick review. Know the mood you're after? Start with this index of mental states. YMMV. Enjoy! ✌️

By Desired Mental State

Focus, Intensity, Urgency 🎯

Calmness, Contemplation, Perfection 🧘

Creative, Energetic, Mischevious 👿

Wistful, Reflection 🍂

Playlists

If you're not sure where to start, pick one of these 2+ hour playlists and dig in. These artists have deeper catalogs you can branch out into.

Deep Dark Minimal

Repetitive, trance-inducing electronic music for intense focus and deep work. No vocals or lame chord progressions. Mostly German.

Modern Acid

The tasty sounds of the 303 / 808 / 909 used in new ways. All tracks post 2000. Higher energy, faster tempos, and busier arrangements. 🧠

Albums

Some full-length albums that won't disappoint. Each is good for about an hour of listening.

Microlith: Dance With Me (2016)

An album of sublime electro from Maltese producer Rhys Celeste. Everything Rhys made until his tragic death at age 24 is worth a listen. See also the Float House microgenre.

Beatwife: Cornbrail Acid 2 (2014)

A Scottish acid madman with an artist name you can't mention in polite company. Fast, frenetic music with a quirky sense of humor. See also the Braindance microgenre.

Tin Man: Dripping Acid (2017)

How much acid is too much acid? This monster neo acid album may provide the answer. Haunting, hypnotic tunes with slow builds.

Mikron: Severance (2019)

Peaceful, aquatic, ambient techno landscapes from an Irish duo. Track 4, "Ghost Node", highsteps out of a thick fog.

Anthony Naples: Fog FM (2019)

Another case of driving beats shrouded in fog, this time from an NYC-based producer.

Boards of Canada: Music Has The Right To Children (1998)

By law I am required to include this album, and I'm happy to comply. A landmark in electronic music from the Scottish duo. This album hasn't aged a day.

Maurizio: M-Series (1997)

Minimal dub techno from the master of the genre, Basic Channel co-founder Moritz von Oswald. If you're new to dub techno you may be forgiven for thinking "nothing ever happens." That's kind of the point, but it's also not quite true: there's lots of subtle variation if you start looking for it, yet never enough to distract if you aren't.

Substance: Session Elements (1998)

Lush but restrained minimal German techno variations.

RX-101: Dopamine (2019)

Bask in the warm analog glow cast by these 13 tracks from Dutch producer Erik Jong.

John Tejada: Parabolas (2011)

Dark, intelligent tech house from an Austrian-Californian producer. Tejada teaches at CalArts and consistently puts out great tech house albums, among them Signs Under Test (2015), Live Rytm Trax (2018), and Year Of The Living Dead (2021).

Superski: Mondo Moderno (2023)

I'll be honest: trance is not my jam. Yet somehow these trancey, cinematic, Italo-disco-influenced techno tracks from Litrowski & Voiski won me over. Look, you can wear your sunglasses at night. They can be fine Italian sunglasses. You can even be the protagonist in a Fellini film. But if you raise them and wink like Ferris Bueller, we're not going to take you entirely seriously -- and I think that's the idea.

CN: The Expedition Beyond (2011)

The year is 3984, and this is the soundtrack to our mysterious space explorations. CN is one of several projects from the outrageously talented and prolific Norwegian producer Stian Gjevik. There's a second album that picks up where this one left off.

Martin Schulte: Slow Beauty (2012)

Ambient that gets its inspiration from nature. While most music is busy painting portraits, these tracks are content to paint landscapes. If you like this stuff, Schulte has a whole series of albums exploring different seasons and places.

Four Tet: New Energy (2017)

Natural inspiration in this one too, which comes to you from a cabin in upstate New York.

Steve Reich: Music for 18 Musicians (1976)

A minimalist classical masterpiece from 1976 that anticipated electronic music as we know it: layering, envelopes, precise rhythms, repetitiveness, gradual rather than sudden harmonic changes...it's all in there. I find it incredible that 18 skilled humans can approximate dense electronic music like this. "18 Musicians" is structurally interesting too, as the interior sections are organized around a cycle of eleven chords articulated in the opening and closing "Pulses" movements.

Terry Riley: In C (1964)

The granddaddy of all minimalist classical masterpieces. For about an hour we never leave the key of C. Unlike Spinal Tap, Riley pulls it off. A fascinating and elevating listen.

EPs

These are half-albums that nonetheless stand out as excellent music to work to. They vary in length, but are ½ an hour on average.

EOD: Utrecht (2010)

Lush synth landscapes collide with hard-edge acid techno, leaving you stranded in the best of both worlds. EOD is Norwegian producer Stian Gjevik's main shingle. His melodic gift and arranging skills are on full display here.

EOD: Questionmarks (2012)

On this EP Gjevik strips away the lushness and lets the hard-edge techno rip. Sweet, intricately arranged melodies take a back seat to an urgency and raw speed that's borderline frightening. Fear not: Gjevik is a professional driver on a closed course.

Automatic Tasty: Fieldwork EP (2012)

Morning, afternoon, evening, night: you must admit this is a nice four-part cyclic structure for an EP. Although the instrumentation uses the innocent but dated sounds of early techno, Dillon also weaves in real field recordings from different times of day. The result is charming and feel-good.

The Field: Sound of Light - Nordic Light Hotel (2007)

Another four-part day cycle EP from Sweden. True to form, these tracks are driving, repetitive, and awash in sound -- the kind of thing that makes you hitch up the sled dogs and log a couple hundred miles of frozen tundra.

Seb Wildblood: The One with the Emoticon (2017)

Before emoji conquered the world, we typed things like :~^, which is the actual name of this album, and possibly a self-portrait? Lush, organic deep house from the UK.

DMX Krew: Broken SD140 Part II (2013)

What is an SD140, and are we sure it's safe to use a broken one? Harsh electro rhythm sounds topped with sweet melodies. "Apple Grid" is a standout track.

Khotin: Baikal Acid (2016)

Dancy, imaginative acid house from up north. Khotin saves the best for last: side B has not one, but two lovely, warm tunes.

Jonas Kopp: Desire EP (2013)

German minimal techno by way of Argentina. It's dark, but the opener is funkier than your typical Tresor track, and the closer feels like some kind of ceremonial ascension.

Etapp Kyle: Klockworks 10 (2015)

Dark, driving, haunted minimal techno of the German variety. All of the albums on Ben Klok's Klockworks series are worth a listen, but Klockworks 10 and 16 from this Ukranian producer are standouts.

Luke Hess: Facette (2017)

Modern Detroit minimal techno. A propulsion system made from deep, dark textures and thumping beats.

Artist Samplers

Some artists don't fit well into the album box. And some artists make albums of such breadth that they no longer fit into the "music to work to" box. Here's a few sampler playlists from artists not featured above, but no less deserving.

Basic Channel - Sampler

As Basic Channel, the duo of Moritz von Oswald and Mark Ernestus pioneered minimal dub techno in the early 90s. Except for BCD and BCD-2, their output consists of a series of cryptically labeled singles. Here's a curated selection.

Trickfinger - Sampler

Did you know John Frusciante -- yes, that John Frusciante -- has a side gig making acid techno? Insane. There are a couple tracks here where it's hard to believe he didn't pick the melody out first on a guitar.

Ceephax - Sampler

No list would be complete without Andy Jenkinson, AKA Ceephax. Personally I like his stuff more than his brother's. It's funny, nostalgic, slightly unhinged, and brimming with bonafide musical genius.

Slippery Device Names and Portable AMIs

2020-12-10T21:12:32

Pain, thy name is hotpluggable device name assignment.

In the course of migrating some EC2 servers from C3 to C5, I learned why this feature in newer linux kernels is controversial.

To be clear, most people couldn't care less whether their primary network interface is called eth0 or enx0150b6e42dfe, or whether a drive appears as /dev/xvda or /dev/nvme9n5, as long as they can continue to do their Computer Stuff. For ops folks trying to make a portable system image, though, this can be a real problem.

My goal was to create an AMI that can be booted on a variety of EC2 instance types. Here's how I got there.

Network Interfaces

Hotpluggable network interface names make sense for multi-homed systems, and systems that might change network configuration later.

They also make sense for consumer devices that don't need to be portably imaged. When was the last time you pulled the hard drive out of your laptop, put it in a different brand of laptop, and had everything just work? Would you even expect this to work? No, this is crazy talk.

However, those of us who operate and upgrade servers have different (higher?) expectations.

The essential problem is described in gory detail on debian's NetworkInterfaceNames wiki page. "We're not in the 90s anymore, network devices come and go, deal with it. That said, here are a dozen different ways to avoid this new nonsense..."

Adding net.ifnames=0 boot parameters to /etc/default/grub worked for me:

GRUB_CMDLINE_LINUX_DEFAULT="... net.ifnames=0"
GRUB_CMDLINE_LINUX="net.ifnames=0"

Follow this up with a perfunctory update-grub.

Importantly for portability, this means any config files under /etc/ can refer to eth0 directly, and that will continue to work even if you make an AMI and boot it on another instance type — so long as it has just one network interface.

NVMe Disks

Next we have the disk problem. C5 instances use NVMe throughout, even for EBS storage. The AWS docs warn us that:

The block device driver can assign NVMe device names in a different order than you specified for the volumes in the block device mapping.

And oh boy, do they ever like to assign in different order.

If you only have one disk, you might never have this problem. I use multiple disks because:

Different subsystems have different access patterns and performance requirements. Using separate disks for, say, /var/lib/postgresql/ and /var/log/ lets you provision and tune them separately.
The disk boundary is a convenient blast radius. Have you ever had a server fill up with log files and grind to a halt? A separate /var/log/ disk will contain the problem and let other subsystems continue to run normally.

So I've got 5 different NVMe disks attaching to an EC2 instance in random order. Once in a while it works, but usually home directories have become the database, logfiles are now volatile runtime state, and so on. A real Mister Potato Head mess.

Russell Ballestrini ran into this same issue and found a script, ebsnvme-id, that ships with Amazon Linux. This script interrogates an EBS NVMe device (eg /dev/nvme1n1) and outputs the original name specified in the block mapping (eg /dev/xvdb).

But we're not quite there yet. Armed with ebsnvme-id, you can create symlinks like /dev/nvme1n1 -> /dev/xvdb, but how and when you should you do this?

The /dev directory gets populated anew via udev during boot. So there's a right time to do this, and there are many wrong times to do this. My first attempt via /etc/rc.local failed horribly — it ran too late.

Eventually I came around to the idea of using udev, and I learned from this nice udev primer that udev rules can be flexible in the extreme. You can pattern match on device names. You can also run an external program that figures out how to rename a device. This culminated in the following magical one liner:

KERNEL=="nvme[0-9]*n1", PROGRAM="ebsnvme-namer %k", SYMLINK+="%c"

Save this to 70-persistent-ebsnvme.rules under /etc/udev/rules.d/. You'll notice it doesn't hardcode any device names, so it's safe to include in a portable machine image. It creates /dev symlinks that look like:

lrwxrwxrwx 1 root root 7 Dec  9 14:34 xvda1 -> nvme0n1
lrwxrwxrwx 1 root root 7 Dec  9 14:34 xvdd -> nvme1n1
lrwxrwxrwx 1 root root 7 Dec  9 14:34 xvde -> nvme3n1
lrwxrwxrwx 1 root root 7 Dec  9 14:34 xvdf -> nvme4n1
lrwxrwxrwx 1 root root 7 Dec  9 14:34 xvdg -> nvme2n1

These match the old disk device names on my C3 instances. All my scripts and config files that reference specific xvd* names? In the end, not a single one needed changing for the C3 -> C5 upgrade!

Finally, here's the ebsnvme-namer script:

#!/bin/sh -e
ebsdev=`ebsnvme-id --block-dev /dev/"$1"`
echo xvd${ebsdev##sd}

Matrix Chat in the Terminal with weechat-matrix

2020-08-30T19:27:00

I've been using the Element iOS App to chat with a few security-conscious friends. It works fine, but at some point you outgrow chatting with your thumbs, and long for the full ten-fingered chat experience. (A 5x improvement!!)

Fortunately the Matrix protocol is an open standard with plenty of clients. I like terminal programs, and this Slack plugin for weechat has been a pleasant surprise of late. It turns out there's a Matrix plugin for weechat too. A few months back I tried and failed to set up weechat-matrix, but today things went a little better.

So here's what worked for me. But first...

Are You Sure You Want to Do This

The Matrix ecosystem is new, peopled with technical users, and hardcore about security. This doesn't make for a pleasant experience for most human beings. If you don't like messing around with tech for its own sake, and you just want 1:1 secure chats, might I recommend Signal?

Still here? Onward...

Forget Python2

Although weechat still supports Python2, and weechat-matrix claims to support it, its matrix-nio dependency doesn't. Don't waste your time. Start with Python3.

Isolate the Install

To avoid hosing my existing weechat setup or my base system, I started from a clean Debian 10 debootstrap. If you're hipper than me maybe you prefer docker. Either way, it pays to isolate!

Use the Weechat Development Packages

Once your environment is up and running, you'll want to grab the weechat-devel packages for maximum Python3 compat.

Follow the README

Follow the weechat-matrix README. If all goes well you'll be connected, logged in, and automatically joined to your channels. But there's a problem: your new "device" isn't verified.

Verify Devices

Immediately after weechat-matrix successfully connected, Element popped up a modal that it was verifying the new device. This was the interactive verification flow, and it didn't work for me. Close out of it.

What worked instead was going to Element -> Settings -> Security, finding my new "Weechat Matrix" session, and manually verifying it. To make sure things match on the other end, switch to any Matrix channel in weechat, type /olm info, then switch back to the first weechat buffer. You should see your new weechat device's identity keys. If they match, you can verify them in Element.

You can also verify devices from weechat's perspective via /olm verify <username>. There's even some pattern syntax that lets you verify multiple devices at once — but be careful with this.

At this point, you should be able to chat safely with other devices you've done the verification dance with, but there's still a problem: you can't read channel history. You'll see usernames and timestamps in weechat, but each message will start with <Unable to decrypt>.

Decrypt Old Messages

To decrypt channel history, you'll need to export keys from Element and import them into weechat-matrix. Element makes this pretty easy: go to Settings -> Security -> "Export keys manually." Create a passphrase for the key file, and email it to yourself.

Back in weechat, use /olm import ~/path/to/riot-keys.txt <passphrase>. This may take a bit, and weechat will likely hit 100% cpu during the process.

On success, you still can't read channel history...that is, until you restart weechat!

Feedback

Did it work? Did I miss something? Let me know.

More Info

Here's another useful guide to weechat + matrix.

I'd be remiss if I didn't mention that weechat-matrix is in maintenance mode, and a new Rust port is underway. At the time of this writing it isn't very far along.

On Remote Work: An Interview

2020-03-26T00:00:00

This interview originally appeared on remote.community in March 2020.

Hello! who are you and where do you work?

Hello! I'm Alan Grow, co-founder at Endcrawl. We're a SaaS that makes credits for film & TV. We've been used on thousands of productions including Oscar Winners "Moonlight" and "Nomadland."

How did you get started working remotely?

I first tried remote work back in 2006. It was a failure. Part of that failure was lack of consulting experience, but part of it was also my lack of experience with remote work. After two decades of classrooms and offices, I didn't know how to create productive environments and habits for myself.

The next time I tried full-time remote work was in 2014. I interviewed over Skype and was hired the next day, by someone I'd never met in real life. Six months would pass before we finally met face to face. But that was fine, because this time, things were actually working great! I'd finally figured out what keeps me focused and motivated within a remote team.

What I realize now is that remote work starts and ends with a team of one: yourself. If you can’t keep yourself focused and motivated, someone will have to do that for you remotely, and that’s a tall order. You have to figure those out for yourself the hard way. But the good news is, once you have them, they’re a superpower.

Describe your typical work day or week.

I begin and end my work days at my home office, but in the middle I almost always leave the house to work from a coffee shop. This is a crucial thing that I learned the hard way: even as my home office improves, I still need to get out of the house to clear my head and avoid cabin fever.

Coffee shops and coworking spaces are less predictable in a couple ways that affect work. Sometimes there are network issues, and sometimes there's ~~an annoying conversation~~ noise. Both make it harder to have meetings. So I adapt by taking meetings at home and using these spaces instead for deep work.

That brings some challenges of its own. How do you focus deeply in a noisy public space, where people you know or strangers may interrupt you? For me: repetitive electronic music! A decent pair of headphones with focus-inducing music works wonders, both for your own mental state, and to signal to others that you're uninterruptible.

Just like a change of venue in the afternoon seems to keep me focused throughout the day, changing activities during nights and weekends helps me reset and avoid burnout. Whether it's playing music, lifting weights, or exploring the wilderness here in Utah, getting into a very different mental state keeps my "work mind" fresh.

When I was younger I wanted to program 24/7, and every other activity seemed like a waste of time. Now I know that the opposite is true: the right mix of non-programming activities makes me a much more productive programmer.

What tools do you use when working remotely?

I spend a fair bit of time in Github and Slack. Video conferencing is usually via Google Meet. Beyond that, I try to stay minimal and inhabit a simpler world of text.

I keep long-running tmux sessions both locally and on remote machines. For email I use mutt + offlineimap, and try to only poll for new emails at the beginning and end of the day.

I like spending time in my editor (vim) and at a shell prompt. Learning unix and the Unix toolset has been a 1000x investment over the course of my career.

Describe how working remotely has affected your life.

Not gonna lie — remote work was a difficult adjustment at first. I'm a social person, and most of my socialization when I lived in NYC in the 2000s was (surprise) via the workplace. I didn't know any different.

But learning how to make and maintain friendships turned out to be an orthogonal concern. It doesn't have to happen through the workplace. That sort of bundling is just an easy default. Friendship can be unbundled from work.

When I left NYC, I had to make friends all over again, as well as maintain more remote friendships. Both of those turned out to be good exercises.

Remote work has let me live away from a major city, but still work with people in major cities on exciting things. I don't have to deal with the stress of traffic, rush hour, crowds, or the concrete jungle. In 15 minutes I can be enjoying a quiet hike in the wilderness. It's the best of both worlds.

Remote work has also let me work with people around the world. I was fortunate to work for London-based music hardware & software maker ROLI, with teams from all over the US and Europe. That kind of geographic diversity comes with all other wonderful kinds of diversity, and it's a difficult thing to replicate in non-remote companies.

Finally, remote work has made me much more results-oriented. Physical offices inevitably become a stage for productivity theater. Try as they might, people are pulled towards the things that look busy, and away from the things that actually produce value. Remote work keeps you honest – you're constantly proving yourself, and you're measured by your output. Those are good things!

What advice would you give to people working remotely?

Adopt a schedule and commit to it. Once it’s “burned in” you can start to experiment, but try to build rhythm first.
Measure your daily output. This could be git commits, tasks checked off, new revenue booked, etc. It doesn’t have to be how others measure you. But it should loosely correlate with that, and it should be something trivially easy for you to measure.
Course correct often. When something impacts your daily output, act on it. Does your output drop noticeably on less than 7 hours of sleep? Get more sleep!
Focus is sacred. Your focus is your temple — don't let anything or anyone defile it. Especially yourself. Find ways to subdue the part of your brain that craves distraction.
Embrace the written word. Always be reading, always be writing, and always be improving your reading and writing.
Review, and seek review. Critical — but constructive — written dialog with your co-workers will level you both up.

Would you like to add anything else?

If you're bootstrapping a remote-first company, consider applying to Earnest Capital. They're an incredible resource to companies like ours, and their community of founders and mentors is very much aligned with the future of work — which is remote!

Shell Quirk: Assignment From a Heredoc

2017-06-10T20:30:00

I have a ~~fetish for~~ fascination with POSIX shell corner cases. It all started a decade ago with a segfault: a certain while read loop ran fine on every Unix except AIX. We were stumped, and I was hooked.

Here's a new find. What will the following POSIX shell program print?

#!/bin/sh
paths=`tr '\n' ':' | sed -e 's/:$//'`<<EOPATHS
/foo
/bar
/baz
EOPATHS
echo "$paths"

If you said /foo:/bar:/baz, you're right...that is, if you're on Linux and /bin/sh is provided by dash.

If you're on MacOS [1] or FreeBSD instead, this same script will wait for input and print nothing. This is probably the behavior on all BSD derivatives, and it's likely the correct behavior too, since the BSDs are usually right about these things.

Correct or not, the dash behavior is a bit more useful. It also points to a fundamental difference in the way here-documents work: dash interprets the heredoc before anything else on the line. When the assignment is interpreted next, stdin already has the contents of the heredoc. I'm not even sure what the other POSIX shells do. Is the heredoc interpreted after the assignment? Where does it even go?

Fortunately there's an easy portable alternative: wrap the whole thing in backquotes.

#!/bin/sh
paths=`tr '\n' ':' | sed -e 's/:$//'<<EOPATHS
/foo
/bar
/baz
EOPATHS`
echo "$paths"

[1] Note that on recent MacOS versions, /bin/sh is actually bash in POSIX mode. Don't believe me? Run /bin/sh --help and /bin/sh -c 'echo $POSIXLY_CORRECT'.

Blog Refresh: Now With Less

2017-05-01T02:13:42

To readers who enjoyed the 3-column layout, the Edgar Allen Poe quote, and the engraving of the fragile rowboat disappearing into the mighty maelstrom: I'm sorry. It's all gone. To me, minimalism is less an aesthetic than it is the search for time invariants, and well...here we are some years later.

It's actually a bit more practical than all that. After porting this blog from jekyll to tinysite, I discovered that the very problem I set out to solve -- fast incremental site rebuilds -- was still a problem. No comment on why this seems to be a common failure mode for shiny two-point-oh-y things.

The culprit? That index of posts in the right column. The simple act of fixing a typo on a single page would cause posts.json to rebuild, and then every post would be rebuilt in a cascade, since the right column of every post depended on posts.json. Other static site generators probably learned to avoid this years ago. I finally came around to it this weekend.

In the interim, editing posts has been pretty unpleasant. Doubly so because I had no one to blame but myself. Now incremental site rebuilds are quick and can be accelerated with make -j as before.

With that out of the way, I decided to take advantage of the "let's optimize the shit out of everything" mental state I was in and see what could be done to speed up the publishing side of things. I really like the Heroku / Github Pages approach of "just git push and we'll do the rest," and have spent the last few years building systems to make everything at Endcrawl work like that. Maybe those years would have been better spent learning docker or kube. Maybe the people who regard deploying-via-git as an antipattern are right. But I can't shake the idea that we're overengineering the hell out of this problem right now. As one HN commenter put it:

In the long term I predict that base OS everywhere will improve support for deployment, workload scheduling, resource allocation, endpoint discovery, and dependency management. These will match and eventually surpass the additional capabilities that containers offer, and then we can all go back to putting files on a server and restarting a process, which is all that 99% of us actually need.

There's a bit more to the story than the part I emphasized, but that's one for another day. Suffice to say there's tooling now that fully realizes the "dream deploys" idea, this site uses it, and who knows, maybe it'll get opensourced one day.

I also took a stab at the horribly clunky {% highlight lang %} template syntax this blog used for code highlighting. When I started there was no good standard for this kind of thing, but now it seems fenced code blocks have won. Good for them, they're awesome. Switching tinysite to fenced code turned out to be trivial (diff), mainly because the original approach was a small regex hack rather than a more evolved approach. That Yagni guy they're always invoking knows what's up!

Oh yeah. The Disqus comments section is gone. Good riddance. It's been broken for years, ever since I migrated from acg.github.io to this domain. I probably made a mistake somewhere in the Disqus migration tool but never could figure it out. If you feel a burning desire to rebutt or high-five something, hit me up on twitter and I may link to it. Better yet, open a github issue against this blog.

Dream Deploys: Atomic, Zero-Downtime Deployments

2015-06-05T21:11:00

(Update: here's a real implementation.)

Are you afraid to deploy? Do deployments always mean either downtime, leaving your site in an inconsistent state for a while, or both? It doesn't have to be this way!

Let's conquer our fear. Let's deploy whenever we damn well feel like it.

You Don't Need Much

This is a tiny demo to convince you that Dream Deploys are not only possible, they're easy.

To live the dream, you don't need much:

You don't need a fancy load balancer.
You don't need magic "clustering" infrastructure.
You don't need a specific language or framework.
You don't need a queue system.
You don't need a message bus or fancy IPC.
You don't even need multiple instances of your server running.

All you need is a couple old-school Unix tricks.

A Quick Demo

Don't take my word for it. Grab the code here with:

git clone git@github.com:acg/dream-deploys.git
cd dream-deploys

In a terminal, run this and visit the link:

./serve

In a second terminal, deploy whenever you want:

./deploy

Refresh the page to see it change.

Edit code, static files, or both under ./root.unused. Then leave ./root.unused and run ./deploy to see your changes appear atomically and with zero downtime.

Questions & Answers

What do you mean by a "zero downtime" deployment?

At no point is the site unavailable. Requests will continue to be served before, during, and after the deployment. In other words, this is about availability.

What do you mean by an "atomic" deployment?

For a given connection, either you will talk to the new code working against the new files, or you will talk to the old code working against the old files. You will never see a mix of old and new. In other words, this is about consistency.

How does the zero downtime part work?

This brings us to Unix trick #1. If you keep the same listen socket open throughout the deployment, clients won't get ECONNREFUSED under normal circumstances. The kernel places them in a listen backlog until our server gets around to calling accept(2).

This means, however, that our server process can't be the thing to call listen(2) if we want to stop and start it, or we'll incur visible downtime. Something else – some long running process – must call listen(2) and keep the listen socket open across deployments.

The trick in a nutshell, then, is this:

A tiny, dedicated program calls listen(2) and then passes the listen socket to child processes as descriptor 0 (stdin). This process replaces itself by executing a subordinate program.
The subordinate program is just a loop that repeatedly executes our server program. Because this loop program never exits, the listen socket on descriptor 0 stays open.
Our server program, instead of calling bind(2) and listen(2) like everyone loves to do, humbly calls accept(2) on stdin in a loop and handles one client connection at a time.
When it's time to restart the server process, we tell the server to exit after handling the current connection, if any. That way deployment doesn't disrupt any pending requests. We tell the server process to gracefully exit by sending it a SIGHUP signal.

Note: a shocking, saddening number of web frameworks force you to call listen(2) in your Big Ball Of App Code That Needs To Be Restarted. The connect HTTP server framework used by express, the most popular web app framework for Node.js, is one of them.

"I'll just use the new SO_REUSEPORT socket option in Linux!" you say.

Fine, but take care that at least one server process is always running at any given time. This means some handoff coordination between the old and new server processes. Alternately, you could run an unrelated process on the port that just listens.

At any rate, an accept(2)-based server is simpler. It also has some nice added benefits unrelated to deployments:

An accept(2)-based server is network-agnostic. For instance, you can run it behind a Unix domain socket without modifying a single line of code.
An accept(2)-based server is a more secure factoring of concerns. If your server listens directly on a privileged port (80 or 443), you'll need root privileges or a fancy capabilities setup. After binding, a listen server should also drop root privileges (horrifyingly, some don't). The accept(2) factoring means a tiny, well-audited program can bind to the privileged port, drop privileges to a minimally empowered user account, and run a known program. This is a huge security win.

How does the atomic part work?

A connection will either be served by the old server process or the new server process. The question is whether the old process might possibly see new files, or the new process might see old files. If we update files in-place then one of these inconsistencies can happen. This forces us to keep two complete copies of the files, an old copy and a new copy.

While we're updating the new files, no server process should use them. If the old server process is restarted during this phase, intentionally or accidentally, it should continue to work off the old files. When the new copy is finally ready, we want to "throw the switch": deactivate the old files and simultaneously activate the new files for future server processes. The trick is to make throwing the switch an atomic operation.

There are a number of things Unix can do atomically. Among them: use rename(2) to replace a symlink with another symlink. If the "switch" is a simply a symlink pointing at one directory or the other, then deployments are atomic. This is Unix trick #2.

What about serving inconsistent assets? Browsers open multiple connections.

This is a problem, but there's also a straightforward solution.

Let's clarify the problem first: during a deployment, a client may request a page from the old server, then open more connections that request assets from the new server. (Remember, consistency is only guaranteed within the same connection.) So you can get old page content mixed with new css, js, images, etc.

The solution in prevailing practice is to build a new tagged set of static assets for every deployment, then have the page refer to all assets via this tag. You can do this by modifying the ./deploy script to do this, like so:

Update the new files.
Generate a unique tag $TAG. Epoch timestamps are usually good enough.
Record $TAG in a file inside the new file directory.
Copy all the static assets into a new directory assets.$TAG outside of both file copies.
Continue with the deployment.

When the server starts up, it should read $TAG from the file, and make sure all asset URLs it generates contain $TAG.

That's pretty much it. Eventually you'll want to delete them, but if you keep the old assets.$TAG directories around for a while, even sessions that haven't reloaded the page will continue to get consistent results across deployments.

The long term solution to this problem is HTTP/2 multiplexing, which makes multiple browser connections unnecessary.

What about serving inconsistent ajax requests?

Let's clarify this problem: during a deployment, a client may request a page from the old server, then open more connections that make ajax requests of the new server using old client code.

There's a less technical solution to this one: simply make your API backwards compatible. This is a good idea regardless.

What about concurrency? Your example only serves one connection at a time.

You can run as many accept(2)-calling server processes as you want on the same listen socket. The kernel will efficiently multiplex connections to them.

In production, I use a small program I wrote called forkpool that keeps N concurrent child processes running. It doesn't do anything beyond this, which means it doesn't have any bugs at this point and never needs restarting. Remember, children are a precious resource, but without a parent to keep that listen socket open they're orphans.

What about deployment collisions?

Yes, you really should prevent concurrent deployments via a lock. That's not demonstrated here, but it's extremely easy and reliable to do with the setlock(8) program from daemontools.

What about deploying database schema changes?

This topic has been covered well elsewhere.

Turn Vim Into Excel: Tips for Editing Tabular Data

2013-03-29T00:00:00

Vim editing some 2010 US census data

Vim can edit just about anything, including tabular data. This post has a few tips for making stock Vim more spreadsheet-like.

We'll assume you're editing files in tab-separated value format (TSV). CSV is a notoriously thorny file format with plenty of edge cases and surprises, so if you have CSV files, it's simpler to sidestep all that and roundtrip CSV to TSV for editing.

A Note on the TSV Format

To do TSV right, you should escape newline and tab characters in data. Here are two scripts, csv2tsv and tsv2csv, that will handle escaping during CSV <-> TSV conversions.

Converting CSV to TSV, with C-style escaping:

csv2tsv -e < file.csv > file.tsv

Converting TSV back to CSV, with C-style un-escaping:

tsv2csv -e < file.tsv > file.csv

Setting up Tabular Editing in Vim

Open the file:

:e file.tsv

Excel numbers the rows, why can't we?

:set number

Adjust your tab settings so you're editing with hard tabs:

:setlocal noexpandtab

Now, widen the columns enough so they're aligned:

:setlocal shiftwidth=20
:setlocal softtabstop=20
:setlocal tabstop=20

Fiddle with that number 20 as needed. As far as I can tell, Vim doesn't support variable tab stops. It would be real nifty if I was wrong about this. It would be even niftier if column width detection / tabstop setting could be automated.

Tall Spreadsheets: Always-Visible Column Names Above

Typically, the first line of the tsv file is a header containing the column names. We want those column names to always be visible, no matter how far down in the file we scroll. The way we'll do this is by splitting the current window in two. The top window will only be 1 line high and will show the headers. The bottom window will be for data editing.

:sp
:0
1 CTRL-W _
CTRL-W j

At this point you should have two windows, one above the other showing the first row of column headers. If you don't have very many columns, then you're done.

Wide Spreadsheets: Horizontal Scrolling

If you do have lots of columns, or very wide columns, you're probably noticing how confusing it looks when lines wrap. Your columns don't line up so well anymore. So turn off wrapping for both windows:

:set nowrap
CTRL-W k
:set nowrap
CTRL-W j

One problem remains: when you scroll right to edit columns in the data pane, the header pane doesn't scroll to the right with it. Once again, your columns aren't aligned.

Fortunately Vim has a solution: you can "bind" horizontal scrolling of the two windows. This forces them to scroll left and right in tandem.

:set scrollopt=hor
:set scrollbind
CTRL-W k
:set scrollbind
CTRL-W j

Wide spreadsheets also make it harder to eyeball other cells in the current row. You can enable a row highlight with:

:set cursorline

But What About Formulas and Calculations?!

It's true, Excel does way more than just edit tabular data. Vim is "just" an editor.

If you're up for some programming, this approach might work for you:

Start with your data tsv.
Mirror it with a second "formula tsv" that contains interpreted cells.
Write a program that will apply (2) to (1), "rendering" a tsv with calculated data.
View (3) in a read-only buffer. Separately edit the data and formula tsvs.

If you're not up for that, I hear good things about VisiData.

How to printf a length-delimited string

2012-11-15T00:00:00

Sometimes you're dealing with a string that isn't null-delimited but rather length-delimited, and you wind up doing somersaults just to print it out:

void logit(const char *string, size_t length) {
  char buf[255];
  strncpy(buf, string, sizeof(buf));
  buf[sizeof(buf) - 1] = '\0';
  fprintf(stderr, "debug: %s\n", buf);
}

The extra copying isn't necessary, and you don't have to live with the potential length-truncation either. Did you know printf(3) can format length-delimited strings directly? Buried in the man page is this little gem:

The precision

An optional precision, in the form of a period ('.') followed by an optional decimal digit string. Instead of a decimal digit string one may write "*" or "*m$" (for some decimal integer m) to specify that the precision is given in the next argument, or in the m-th argument, respectively, which must be of type int. This gives ... the maximum number of characters to be printed from a string for s and S conversions.

With that in mind, we can just write:

void logit(const char *string, size_t length) {
  fprintf(stderr, "debug: %.*s\n", (int)length, string);
}

Recovering a Dying iPod Disk

2012-04-03T00:00:00

An 80GB iPod Classic filled with 4 years of music started to die on us. The symptom: the menu screen suddenly showed "No Music," but disk usage was still nearly 100%. I figured this meant the internal 1.8" hard disk had started to go south and had taken some critical sectors with it.

That turned out to be the case. But here's how we recovered nearly 10,000 files from the iPod anyway...

The Winning Ticket

Before things got any worse, I decided to grab an image of the entire disk:

sudo dd if=/dev/sdc bs=1M conv=noerror,sync | pv > ipod.img

The "conv=noerror" directive tells dd to keep on going if there are disk read errors instead of erroring out. (There were about a dozen. Sectors had probably been going bad for some time, and finally a critical one bit the dust.)

The "conv=sync" directive tells dd to write out an appropriately sized block of zeroes whenever there's an error reading a block. This is necessary, or file offsets will be wrong from the point of the error onward.

The pv command just shows some nice info about how much data is flowing through and how long it's taken. It's not essential here.

As described below, I tried to fsck.vfat the first partition of the disk image, but this reported that an unusually high number of free cluster chains would be reclaimed. This indicated that FAT32 metadata had been damaged and that walking the complete filesystem directory structure wouldn't be possible anymore.

The new approach was to say, to hell with directory structure, let's just linearly scan the disk image for files and extract them. This needles-in-the-haystack approach isn't for everybody: you will lose filenames, permissions, directory locality etc. But most mp3s have self-identifying id3 tag metadata so we didn't care too much.

There are a couple programs that can find file needles in a disk image haystack. The one that worked was PhotoRec, which can actually find much more than just photo files. For an opensource unix program it has a rather strange set of options and user interface. Anyway, I ran it with:

photorec /log /debug /d rescue ipod.img

All in all photorec recovered over 8,000 mp3s and some other files to boot.

Pass 1 - Reading sector  135045680/155907592, 9944 files found
Elapsed time 1h14m22s - Estimated time for achievement 0h11m29
mp3: 8339 recovered
mov: 1264 recovered
txt: 129 recovered
apple: 96 recovered
tx?: 63 recovered
jpg: 21 recovered
aif: 13 recovered
riff: 12 recovered
mpg: 3 recovered
gpg: 1 recovered
others: 3 recovered

Afterwards, the files were scattered randomly in flat directories named rescue.1, rescue.2, rescue.3 etc:

ls rescue.1 | grep mp3 | head

f0234384.mp3
f0241008.mp3
f0247536.mp3
f0254352.mp3
f0257680.mp3
f0263664.mp3
f0271120.mp3
f0277872.mp3
f0284784.mp3
f0292176.mp3

If desired, they can be renamed into Artist + Album + Track + Title directories via a program like supertag (disclaimer: I'm the author). But I'm not sure iTunes even cares about filenames.

Addendum: as time has gone on, we've noticed that a fair percentage of the songs were truncated by photorec, something like 1 in 5. One of these rainy weekends I'm going to see if I can patch photorec's mp3 recognition.

Dead-Ends and Other Things We Tried

The filesystem was W95 FAT32 but couldn't be mounted due to the bad sectors. Doing an fsck on the block device was also not possible because of read errors. The errors manifested themselves like this in dmesg:

[64658.941382] sd 6:0:0:0: [sdc] Unhandled sense code
[64658.941395] sd 6:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[64658.941407] sd 6:0:0:0: [sdc] Sense Key : Medium Error [current]
[64658.941422] Info fld=0x0
[64658.941428] sd 6:0:0:0: [sdc] Add. Sense: Unrecovered read error
[64658.941442] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 40 00 00 01 00
[64658.941470] end_request: I/O error, dev sdc, sector 512
[64658.941484] Buffer I/O error on device sdc, logical block 64

After capturing the disk image, it was possible to run fsck.vfat directly on the partition file; it doesn't actually require a block device, which is cool.

To run fsck on the disk image file, we needed to extract the lone FAT32 partition into a file by itself. The trick here was figuring out where the partition started. Doing an fdisk on the actual block device for the iPod (/dev/sdc) to figure out the disk geometry helped. Using that geometry, this command let us figure out the sector offset of the first partition:

fdisk -u -C 14991 -b 4096 -l ipod.img

Device Boot         Start         End      Blocks   Id  System
ipod.img1              63    19488469    77953628    b  W95 FAT32

A trick to extract the partition image:

{ dd bs=4096 skip=63 count=0 ; pv ; } < ipod.img > ipod.img.part1

This took a while. Disks are slow.

Then I ran fsck.vfat on the partition image:

fsck.vfat -v -n ipod.img.part1

...
Checking for unused clusters.
Reclaimed 3561014 unused clusters (58343653376 bytes).
...

As you can see, it thought most of the disk consisted of free clusters -- this is bad. If I had tried to repair the disk via fsck, only a small fraction of the files would have been recovered.

You can see which file paths were traversed with the -l switch:

fsck.vfat -v -n -l ipod.img.part1

In our case this helped me verify that only a small number of files were actually going to be recovered by the fsck.

Once I gave up on fsck and embarked on needle-in-haystack file extraction, I tried magicrescue. It found mp3s but kept saying "invalid mp3 file" and extracted almost none of them. It was also really slow -- it shells out to perl scripts and mpg123 to test mp3 validity. Yuck.

How Many Consonant Pairs Do We Actually Use?

2012-02-26T00:00:00

Of all possible pairs of consonants you could start a word with, how many are actually valid in the English language?

The question came up at a party during a disappointing Ouija board session where the spirits conjured gibberish like "QHPEV." Someone wondered aloud how difficult it was to pick a valid pairs of consonants at random. Instinctively, we felt that most of them were invalid.

This is a nice little problem for the unix text processing toolset. I used the 2006 Scrabble Tournament Word List because /usr/share/dict/words contains many proper names and non-words. To get the count:

tr '[A-Z]' '[a-z]' < TWL06.txt |
sed -nEe 's/^([a-z]{2}).*$/\1/p' | 
grep -v '[aeiouy]' |
sort -u | 
wc -l
82

There are 20 consonants in the language after removing "aeiouy", so that makes 400 possible pairs of consonants.

So only 20.5% of all consonant pairs are valid beginnings for an English word.

To see the 82 valid pairs:

tr '[A-Z]' '[a-z]' < TWL06.txt |
sed -nEe 's/^([a-z]{2}).*$/\1/p' | 
grep -v '[aeiouy]' |
sort -u |
tr '\n' ' '

bd bh bl br bw ch cl cn cr ct cw cz
dh dj dr dw fj fl fr gh gj gl gn gr
gw hm hr hw jn kb kh kl kn kr kv kw
ll lw mb mh mm mn mr ng nt pf ph pl
pn pr ps pt qw rh sc sf sg sh sj sk
sl sm sn sp sq sr st sv sw tc th tm
tr ts tw tz vr wh wr zl zw zz

To see an example word for each valid pair (remember, this is the Scrabble dictionary, so there's some pretty weird stuff in there):

tr '[A-Z]' '[a-z]' < TWL06.txt |
tr -d '\r' |
sed -nEe 's/^([a-z]{2})(.*)$/\1\2 \1/p' |
grep ' [^aeiouy][^aeiouy]' |
sort |
uniq -f1 |
awk '{ print $2, $1 }'

bd bdellium
bh bhakta
bl blabbed
br brabble
bw bwana
ch chabazite
cl clabber
cn cnida
cr craal
ct ctenidia
cw cwm
cz czar
dh dhak
dj djebel
dr drabbed
dw dwarf
fj fjeld
fl flabbergasted
fr frabjous
gh gharial
gj gjetost
gl glabellae
gn gnar
gr graal
gw gweduc
hm hm
hr hryvna
hw hwan
jn jnana
kb kbar
kh khaddar
kl klatches
kn knacked
kr kraaled
kv kvases
kw kwacha
ll llama
lw lwei
mb mbaqanga
mh mho
mm mm
mn mnemonically
mr mridangam
ng ngultrum
nt nth
pf pfennige
ph phaeton
pl placabilities
pn pneuma
pr praam
ps psalmbook
pt ptarmigan
qw qwerty
rh rhabdocoele
sc scabbarded
sf sferics
sg sgraffiti
sh shabbatot
sj sjamboked
sk skag
sl slabbed
sm smacked
sn snacked
sp spaceband
sq squabbier
sr sraddha
st stabbed
sv svarajes
sw swabbed
tc tchotchkes
th thacked
tm tmeses
tr trabeated
ts tsaddikim
tw twaddled
tz tzaddikim
vr vroomed
wh whacked
wr wracked
zl zlote
zw zwiebacks
zz zzz

Aside: finding good and freely available (ie opensource or creative commons) word lists is surprisingly annoying.

Mutt Tip: Attach Multiple Files

2011-11-25T00:00:00

You can attach multiple files in mutt's file browser, if they're in the same directory: just use 't' to tag them, then ';'-Enter. (Quickly, one after the other.) You can also view files from the file browser before attaching them, just hit Space. Ten years of mutt and I'm still discovering this stuff...

Inconsistent split Behavior in Python

2011-11-05T00:00:00

Here's a futile but cathartic bug report I filed against Python recently.

In Python, string.split and re.split both take an optional argument that limits the number of splits that are done. This is unlike Perl's split builtin, which limits the number of pieces. But it makes sense I guess, and consistency between the two languages is not something I'd necessarily expect.

However, consistency within a language...a reasonable expectation, no?

The inconsistency lies in how the string.split and re.split handle the edge cases of "do an unlimited number of splits" and "don't do any splits." The two agree that "unlimited splits" is the default. They don't agree on how to interpret the value of an explicit maxsplit parameter.

	maxsplit=0	maxsplit=-1
string.split	no splits	unlimited splits
re.split	unlimited splits	no splits

I think string.split is doing the sensible thing here.

Of course, the "bug" has zero chance of being fixed at this point. I pretty much just filed it to create a search result for others similarly bitten, annoyed, or both.

PostgreSQL Tip: Bulk Copying Data Between Tables

2011-06-17T00:00:00

Suppose you have two different PostgreSQL databases, db1 and db2. You want to populate db2.table2 with data from db1.table1. How?

Try this:

psql -c 'COPY table1 TO STDOUT' db1 | \
psql -c 'COPY table2 FROM STDIN' db2

Is there a more efficient way to do this if the two databases are hosted by the same server instance? Probably.

Then again, if the databases are on different servers, this works:

psql -c 'COPY table1 TO STDOUT' db1 | gzip -c | \
ssh host2 "gunzip -c | psql -c 'COPY table2 FROM STDIN' db2"

Bonus: with pv(1), you can see how quickly the data is flowing:

psql -c 'COPY table1 TO STDOUT' db1 | pv | \
psql -c 'COPY table2 FROM STDIN' db2

Measuring the Measurers

2011-06-10T00:00:00

"Projects A and B are your top priority now. Oh, and Project C can't be impacted."

Sound familiar?

It's a common complaint of the project-managed: everything can't be top priority. Something has to give. Resources allocated to Project A must be deallocated from elsewhere, either Project C, or some other project. Declaring everything "top priority" is not helpful.

If project management accomplishes one thing, it should help each of us answer the question, "What should I work on next?"

A friend of mine relates a story about a meeting between tech and client services. The tech team came prepared with a list of development tasks in loose priority order. As the meeting progressed, the client services team found more and more reasons to disagree with the priorities.

Eventually, in frustration, the tech lead said, "Here's the list. You order it."

The client services lead was taken aback and refused: "It all has to be done. As soon as possible."

This is not helpful.

While I do think there are better ways of scheduling work than imposing a single ordering -- which breaks down when multiple workers are able to proceed in parallel -- I also think the ability to see and maintain consistent priorities is an important thing to look for in a project manager. Or any manager, really.

Which is why I propose the following fun experiment. Present a manager with two randomly sampled work items from their team, side by side, and ask which is higher priority. Repeat until you've got a decent number of comparisons. Remember xkcd's project to find the funniest image in the world? Yeah. It's kinda like that.

Now that we've turned a human being into a comparison operator ;) we can ask how good that operator is. Does it define an ordering? For any reasonable sample size, probably not.

Forget about stable sort. Viewed as a directed graph, there will probably be cycles, like A > B > C > A. In general, you can induce an acyclic digraph from a cyclic digraph by identifying the strongly connected components. So one metric would be to compare the size of the induced acyclic graph to the original graph (1/∥𝐕∥ is the worst, ∥𝐕∥/∥𝐕∥=1 is the best). Another metric would be the height of the induced acyclic graph over the number of nodes (work items). A perfect comparison operator would produce a line of nodes in a well-defined order, and would score 1.0.

Another thing to measure would be the consistency of the ordering over time. Yes, priorities change, but resource re-allocation also has a cost.

Measuring the measurers seems like a good thing for a number of reasons. Among them, that it exposes the often subtle problems of conflicting directives and the even subtler problems of competing directives. Too often, only the people carrying out the directives are aware of them.

Put Everything in vi Mode

2011-05-17T00:00:00

If you're a vi user like me, try adding these two lines to your ~/.inputrc file:

set keymap vi
set editing-mode vi

Now, every program that uses the readline library for tty input (perl -d, the python REPL, psql, gdb, anything you run under rlwrap, etc.) has vi key bindings instead of the default emacs bindings.

In short, this means things like:

0 and $ for beginning and end of line
k and j for navigating history forwards and backwards
b and e for skipping words
u for undo

See this readline vi mode cheatsheet for a longer list.

I've been using this for years with bash, where one can do set -o vi. Apparently vi mode has been present since GNU readline 2.0, released in 1994, so I really have no excuse for this one!

How I Lost $100 and Blamed It On cal(1)

2011-03-22T00:00:00

True story. Back in September 2008, I decided that this year, I would not wait until the last minute to book my Thanksgiving flight home.

What's the rule for Thanksgiving again? Oh right, fourth Thursday in November. So I busted out cal(1):

$ cal
   September 2008
Su Mo Tu We Th Fr Sa
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30

Whoops, it only shows the current month. So I passed it the year:

$ cal 08
                               8

      January               February               March
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7            1  2  3  4               1  2  3
 8  9 10 11 12 13 14   5  6  7  8  9 10 11   4  5  6  7  8  9 10
15 16 17 18 19 20 21  12 13 14 15 16 17 18  11 12 13 14 15 16 17
22 23 24 25 26 27 28  19 20 21 22 23 24 25  18 19 20 21 22 23 24
29 30 31              26 27 28 29           25 26 27 28 29 30 31

       April                  May                   June
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7         1  2  3  4  5                  1  2
 8  9 10 11 12 13 14   6  7  8  9 10 11 12   3  4  5  6  7  8  9
15 16 17 18 19 20 21  13 14 15 16 17 18 19  10 11 12 13 14 15 16
22 23 24 25 26 27 28  20 21 22 23 24 25 26  17 18 19 20 21 22 23
29 30                 27 28 29 30 31        24 25 26 27 28 29 30

        July                 August              September
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7            1  2  3  4                     1
 8  9 10 11 12 13 14   5  6  7  8  9 10 11   2  3  4  5  6  7  8
15 16 17 18 19 20 21  12 13 14 15 16 17 18   9 10 11 12 13 14 15
22 23 24 25 26 27 28  19 20 21 22 23 24 25  16 17 18 19 20 21 22
29 30 31              26 27 28 29 30 31     23 24 25 26 27 28 29
                                            30
      October               November              December
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
    1  2  3  4  5  6               1  2  3                     1
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   2  3  4  5  6  7  8
14 15 16 17 18 19 20  11 12 13 14 15 16 17   9 10 11 12 13 14 15
21 22 23 24 25 26 27  18 19 20 21 22 23 24  16 17 18 19 20 21 22
28 29 30 31           25 26 27 28 29 30     23 24 25 26 27 28 29
                                            30 31

I booked my flight for Tuesday, November 20th, and forgot about it.

The day approached. I called home just to make sure someone could pick me up from the airport. That's when I discovered that Thanksgiving was actually the following week. I had booked my flight based on the calendar for the year 8 A.D.

What I should have done was this:

$ cal 2008
                             2008

      January               February               March
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
       1  2  3  4  5                  1  2                     1
 6  7  8  9 10 11 12   3  4  5  6  7  8  9   2  3  4  5  6  7  8
13 14 15 16 17 18 19  10 11 12 13 14 15 16   9 10 11 12 13 14 15
20 21 22 23 24 25 26  17 18 19 20 21 22 23  16 17 18 19 20 21 22
27 28 29 30 31        24 25 26 27 28 29     23 24 25 26 27 28 29
                                            30 31
       April                  May                   June
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
       1  2  3  4  5               1  2  3   1  2  3  4  5  6  7
 6  7  8  9 10 11 12   4  5  6  7  8  9 10   8  9 10 11 12 13 14
13 14 15 16 17 18 19  11 12 13 14 15 16 17  15 16 17 18 19 20 21
20 21 22 23 24 25 26  18 19 20 21 22 23 24  22 23 24 25 26 27 28
27 28 29 30           25 26 27 28 29 30 31  29 30

        July                 August              September
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
       1  2  3  4  5                  1  2      1  2  3  4  5  6
 6  7  8  9 10 11 12   3  4  5  6  7  8  9   7  8  9 10 11 12 13
13 14 15 16 17 18 19  10 11 12 13 14 15 16  14 15 16 17 18 19 20
20 21 22 23 24 25 26  17 18 19 20 21 22 23  21 22 23 24 25 26 27
27 28 29 30 31        24 25 26 27 28 29 30  28 29 30
                      31
      October               November              December
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
          1  2  3  4                     1      1  2  3  4  5  6
 5  6  7  8  9 10 11   2  3  4  5  6  7  8   7  8  9 10 11 12 13
12 13 14 15 16 17 18   9 10 11 12 13 14 15  14 15 16 17 18 19 20
19 20 21 22 23 24 25  16 17 18 19 20 21 22  21 22 23 24 25 26 27
26 27 28 29 30 31     23 24 25 26 27 28 29  28 29 30 31
                      30

When all was said and done -- with the change fee and the fare difference -- the mistake cost me $100. But it "inspired" me to actually learn a thing or two about cal(1).

TL;DR: RTFM, or you will pay.

CAL(1)
...
A single parameter specifies the year (1 - 5875706) to be displayed; note the year must be fully specified: “cal 89” will not display a calendar
for 1989.

Teasing Out a New Git Repository

2011-03-02T00:00:00

The Ideal Git Law states that the documentation surrounding git(1) will expand to fill all available volume.

I'm building a suite of record processing tools. Up to now, the development has taken place inside the lwpb git repository. But it doesn't really belong there, since other record formats besides protobuf are supported: the classic unix tab-separated text format, and soon json.

So how does one extract part of a git repository into a new repository, preserving history where possible?

All of the files I want to extract from the main repository live under the same subdirectory, which should become the top-level directory of the new repository. So a good place to start is this stack overflow thread which explains git filter-branch --subdirectory-filter subdir. It goes something like this:

mkdir newrepo
cd newrepo
git clone --no-hardlinks /oldrepo ./
git filter-branch --subdirectory-filter subdir HEAD
git reset --hard
git gc --aggressive
git prune

As a comment on the stackoverflow thread mentions, it's also a good idea to remove the old repo as a remote of the new repo, so you don't accidentally push changes back to it:

git remote rm origin

So far so good. But I only want some of the files under this subdirectory in the new repo. The rest shouldn't be there. Can I rewrite the commit history again, this time file-wise?

Yes. For this I used git filter-branch --tree-filter command. This works by checking out each commit, running $SHELL -c "$command", looking at what changes were made to the checkout, and then formulating a new commit. If the command removes a file in the checkout, it will be removed from the commit. If a command creates a file, it will be added to the commit.

In my case, I only want to remove certain files, so the filter command is a shell script that looks like this:

#!/bin/sh
find . -type f -not -path "*/.git/*" |
sed -e 's#^./##' |
grep -v -E '^(pb.*\.py|flat\.py|percent.*)$' |
tr '\n' '\0' |
xargs -0 rm -v

The rm -v lets me see all the deletions this script makes for each commit. I saved this as my-git-filter and ran

git filter-branch -f --prune-empty --tree-filter my-git-filter HEAD

The -f option forces the operation even if there's already a backup of the original repo from a previous git filter-branch run.

Follow this up with the same cleanup procedure from the --subdirectory-filter example:

git reset --hard
git gc --aggressive
git prune

Profiling every command in a Makefile

2011-02-25T00:00:00

Here's the scenario. I've got a batch data processing pipeline implemented as a Makefile. (Hey! It's only a prototype! Trust me, I'm a make hater just like you!) There's already a lot of data, so an end-to-end full run can take about a day, with some of the individual stages taking hours.

Now I'm thinking, wouldn't it be nice to know how long each rule took? Even better, wouldn't it be nice to get a report of how much cpu it consumed, how much memory it used, how much I/O it performed, etc.? Armed with this information, I could start optimizing poorly performing stages.

So, let's suppose we cook up some wrapper program that runs a subordinate program, collects rusage when it exits, and prints out the interesting info. Fortunately, such a wrapper program basically already exists.

I'd rather not go rewrite every rule in the Makefile, prefixing it with this wrapper program. That wouldn't even work if the rule was a pipeline: since make(1) executes rules by wrapping them with $(SHELL) -c, only the first command in the pipeline would actually run under the wrapper.

The solution is to set the shell in your Makefile to:

SHELL = rusage sh

Where rusage is a wrapper shell script that looks like this:

#!/bin/sh
exec time -f 'rc=%x elapsed=%e user=%U system=%S maxrss=%M avgrss=%t ins=%I outs=%O minflt=%R majflt=%F swaps=%W avgmem=%K avgdata=%D argv="%C"' "$@"

Note that this uses /usr/bin/time, not to be confused with the bash builtin time, which is what you're using probably 90% of the time at the command line.

Note also, this unfortunately only works with GNU time(1). The BSD (and probably Darwin, haven't actually checked) versions of time(1) don't support the -f argument to specify a format string. But on BSD derivatives, you should be able to at least get a human readable dump of the rusage structure by using /usr/bin/time -l. Which looks equivalent to the /usr/bin/time -v output from GNU time. (It's just not as convenient if you plan to analyze the logs later.)

Bouncing, Hopping and Tunneling with tcpforward

2011-02-07T00:00:00

This weekend I dusted off a little network utility of mine called tcpforward. It proved its worth once again, so instead of throwing it back into the rusty toolbox like I always do, here's why you might want to throw it into your very own rusty toolbox. ;)

Scenario: Remote Assistance, AKA "Bouncing Your Signal Off The Moon"

Suppose you need to SSH to a friend's machine, but you're both behind NATs.

If your friend is savvy enough to compile it, and you've got time for that, you could use pwnat. You could also have your friend configure port forwarding on his router -- again, only if your friend is savvy enough, and doesn't mind punching a hole in his firewall. Yet another option: give your friend an SSH account on a public machine, and go look up the SSH arguments for reverse port forwarding for the bazillionth time.

The lowest-hassle option I can think of is to use tcpforward. Suppose you and your friend can both reach a 3rd machine, a public server you own called moon.

Run the following on moon:

tcpforward -v -N 1 -l moon:9922 -l moon:9921

Arrange for your friend to run the following on his local machine:

./tcpforward -v -N 1 -c moon:9922 -c localhost:22

Now, on your machine, run:

ssh -p 9921 moon

And voila, your SSH connection is forwarded past your friend's NAT, to his machine. The -N 1 option makes this a one-shot connection. The -v option gives him something to watch while you go to work -- some realtime transfer statistics.

(This example assumes port 9921 and 9922 are open on moon, and that your friend is running sshd).

Scenario: Hopping Over the Middleman

Ever wanted to copy files to a machine you could only reach from an intermediate machine? For no particular reason, let's call these machines production and gateway. I bet you usually end up scp'ing or rsync'ing files to gateway, ssh'ing to gateway, then running scp or rsync again, then cleaning up the files, etc.

"There must be a better way!" I hear you scream.

Yes. First, ssh to gateway and run:

tcpforward -v -k -l 0.0.0.0:9922 -c production:22

In another tty on your local machine, you can now run:

scp -o Port=9922 somefile gateway:somefile

Or, rsync:

rsync -e "ssh -p 9922" -avzp somedir/ gateway:somedir/

Remember to kill the tcpforward session on gateway, or your sysadmin may get angry, annoyed, frightened, or all of the above.

(Once again, assumes port 9922 is open on gateway.)

Scenario: Tunneling Through Corporate Firewalls

Let's continue with the slightly subversive examples. Suppose you're behind a corporate firewall that doesn't allow SSH connections out, only web traffic. You've got a public server out there called freedom, and you want to log in once in a while.

You could run hts from httptunnel on freedom. That's a fair bit of C code to expose to the world though. ;)

Alternately, let's say you're not running anything on freedom:443. Most corporate firewalls will allow https out, and most of them don't do deep packet inspection to verify that the initial handshake actually conforms to the TLS protocol.

Before going off to work, run the following on freedom:

tcpforward -v -k -l 0.0.0.0:443 -c localhost:22

From work:

ssh -p 443 freedom  # scream FREEEEEEDOOOOMMM!!! as you're doing this

How it Works

The time has come to pull back the curtain, revealing the wizened figure of a 160 line Perl script.

How does it work?

Well, you always run tcpforward with two arguments that specify a pair of TCP sockets to set up, then copy bytes between. Each socket argument is either a listen / accept socket -- if you specify the -l flag -- or a connect socket, if you specify the -c flag. Once both sockets of a pair are accepted or connected, a little async I/O copy loop runs until both sockets close for reading. If you pass the -k flag, the I/O copy loop runs in a forked process and another socket pair is immediately ready for setup.

There's more documentation in the POD.

Happy connection hacking!

A Python Gotcha: References as Default Parameters

2011-02-05T00:00:00

Suppose you're writing a Python function like this one that unpacks data into a dictionary; optionally, an existing dictionary instead of an empty one.

Surprise!

$ python
Python 2.6.4
[GCC 4.4.1] on linux2
>>> def hashcopy(src, dst={}):
...   for k, v in src.items():
...     dst[k] = v
...   return dst
...
>>> hashcopy({1:2,3:4})
{1: 2, 3: 4}
>>> hashcopy({5:6,7:8})
{1: 2, 3: 4, 5: 6, 7: 8}

I haven't looked deeply into this, but it seems like default parameters must be bound to object instances at compile time.

In Perl 5 you typically only set default parameters at runtime, so the empty hashref you get is always the freshest in the land:

sub hashcopy
{
  my $src = shift;
  my $dst = shift || {};
  %$dst = (%$dst, %$src);
  return $dst;
}

All other things equal, this is undoubtedly slower, but considerably less wtf-subtle.

Thinkpad T43 Key Removal, Assembly

2007-02-18T00:00:00

Within a few days of the destruction of my T40, I got a T43 from a guy on craigslist. The left control key promptly broke so I swapped it for the right one. There's relatively little info out there about how to assemble and disassemble keys, so here's some info on the process. Before we begin, get out your jeweler's eyepiece...

You can pry off the key face gently as described here, just push away from you and up with a flat object. The face snaps into a cage mechanism consisting of three parts: a top plate and two wickets which anchor it to from the north and south respectively. Each wicket has a bar that wraps over the top plate, and two legs with pegs that secure it to the keyboard bevel. Viewed from the east or west sides, the wickets cross over each other, making an X. There is enough play in the cage's anchoring that you can squish the whole thing down flat. The only thing that impedes you is a little rubber spring glued to the keyboard bevel. This spring is primarily responsible for that distinctive Thinkpad key feel.

By squishing the cage flat, you can hook or unhook the wickets. To reassemble and replace a key, I found it easiest to build the cage first. Start by crossing the wickets--they are fitted to each other. While pressing the X sides of the cage in, you can slip in the face plate. Don't put on the key face yet. Attach the cage to the keyboard bevel by putting it in place and hooking in the south wicket's legs first. Getting the north wicket in is a bit of a stretch. Flatten the cage by pressing down on it until the north legs slip in. Now you can attach the key face by setting it on top of the cage and applying gentle downward force. You should hear it snap.

TAI64 For All Time

2006-09-14T00:00:00

From Bernstein's tai64 page:

"Integers 2^63 and larger are reserved for future extensions. Under many cosmological theories, the integers under 2^63 are adequate to cover the entire expected lifetime of the universe; in this case no extensions will be necessary."

Phew!

Dealing with multilog's TAI64 timestamps is always a bit annoying, but I suppose old djb may very well be laughing his head off in 2038. Still, the idea of writing software "for all time" has enough allure to the developer mind that it feels like a trap.

Colorful Bash Prompt Generator

2004-12-30T00:00:00

(A very old post, but I've used this prompt ever since.)

Customizing a shell prompt often culminates in an impressive plumage display like

export PS1='\[\e]0;\w\a\]\n\[\e[32m\]\u@\h \[\e[33m\]\w\n\[\e[0m\]$ '

the idea being that lots of escape sequences = eliteness. Though, I'd guess most people just copy someone else's bash prompt and foist it off as their own, rather than learn ansi / xterm / bash escape sequences. Like me initially. :)

However, you can easily make your prompt setup readable by breaking it down.

# ansi color escape sequences
prompt_black='\[\e[30m\]'
prompt_red='\[\e[31m\]'
prompt_green='\[\e[32m\]'
prompt_yellow='\[\e[33m\]'
prompt_blue='\[\e[34m\]'
prompt_magenta='\[\e[35m\]'
prompt_cyan='\[\e[36m\]'
prompt_white='\[\e[37m\]'
prompt_default_color='\[\e[0m\]'

My motivation initially was to avoid beeping console prompts. The xterm escape sequence to set the window title contains a bell character, which was of course interpreted by xterm and friends, but not when I'd sit down at system consoles (where usually TERM=cons25). I needed to set $PS1 according to $TERM.

In the course of things, I discovered the \t bash escape sequence, which gives you the current time in hh:mm:ss form. Nice. By incorporating this into the prompt you can now tell by inspection how long you've been sitting with your jaw open trying to remember what you were about to do. Or, how severe one's random spastic ls-ing has gotten.

For emergencies, there's also the no-color prompt.

prompt_nocolor='\n\u@\h \w\n$ '

For nostalgia (or out of masochism) there's the old dos prompt.

prompt_dos='\n\w>'