All right, enough stalling; here's what I've come up with so far
(sorry, long post ahead. Be brave, friend, the journey will be worth it)
Combining methods 3 and 4 from the original post into a kind of 'fuzzy' or dynamic whitelist, and then - and here's the trick - not blocking non-whitelisted IPs, just throttling them to hell and back.
Note that this measure is only meant to thwart this very specific type of attack. In practice, of course, it would work in combination with other best-practices approaches to auth: fixed-username throttling, per-IP throttling, code-enforced strong password policy, unthrottled cookie login, hashing all password equivalents before saving them, never using security questions, etc.
Assumptions about the attack scenario
If an attacker is targeting variable usernames, our username throttling doesn't fire. If the attacker is using a botnet or has access to a large IP range, our IP throttling is powerless. If the attacker has pre-scraped our userlist (usually possible on open-registration web services), we can't detect an ongoing attack based on number of 'user not found' errors. And if we enforce a restrictive system-wide (all usernames, all IPs) throttling, any such attack will DoS our entire site for the duration of the attack plus the throttling period.
So we need to do something else.
The first part of the countermeasure: Whitelisting
What we can be fairly sure of, is that the attacker is not able to detect and dynamically spoof the IP addresses of several thousand of our users(+). Which makes whitelisting feasible. In other words: for each user, we store a list of the (hashed) IPs from where the user has previously (recently) logged in.
Thus, our whitelisting scheme will function as a locked 'front door', where a user must be connected from one of his recognized 'good' IPs in order to log in at all. A brute-force attack on this 'front door' would be practically impossible(+).
(+) unless the attacker 'owns' either the server, all our users' boxes, or the connection itself -- and in those cases, we no longer have an 'authentication' issue, we have a genuine franchise-sized pull-the-plug FUBAR situation
The second part of the countermeasure: System-wide throttling of unrecognized IPs
In order to make a whitelist work for an open-registration web service, where users switch computers frequently and/or connect from dynamic IP addresses, we need to keep a 'cat door' open for users connecting from unrecognized IPs. The trick is to design that door so botnets get stuck, and so legitimate users get bothered as little as possible.
In my scheme, this is achieved by setting a very restrictive maximum number of failed login attempts by unapproved IPs over, say, a 3-hour period (it may be wiser to use a shorter or longer period depending on type of service), and making that restriction global, ie. for all user accounts.
Even a slow (1-2 minutes between attempts) brute force would be detected and thwarted quickly and effectively using this method. Of course, a really slow brute force could still remain unnoticed, but too slow speeds defeat the very purpose of the brute force attack.
What I am hoping to accomplish with this throttling mechanism is that if the maximum limit is reached, our 'cat door' slams closed for a while, but our front door remains open to legitimate users connecting by usual means:
- Either by connecting from one of their recognized IPs
- Or by using a persistent login cookie (from anywhere)
The only legitimate users who would be affected during an attack - ie. while the throttling was activated - would be users without persistent login cookies who were logging in from an unknown location or with a dynamic IP. Those users would be unable to login until the throttling wore off (which could potentially take a while, if the attacker kept his botnet running despite the throttling).
To allow this small subset of users to squeeze through the otherwise-sealed cat door, even while bots were still hammering away at it, I would employ a 'backup' login form with a CAPTCHA. So that, when you display the "Sorry, but you can't login from this IP address at the moment" message, include a link that says "secure backup login - HUMANS ONLY (bots: no lying)". Joke aside, when they click that link, give them a reCAPTCHA-authenticated login form that bypasses the site-wide throttling. That way, IF they are human AND know the correct login+password (and are able to read CAPTCHAs), they will never be denied service, even if they are connecting from an unknown host and not using the autologin cookie.
Oh, and just to clarify: Since I do consider CAPTCHAs to be generally evil, the 'backup' login option would only appear while throttling was active.
There is no denying that a sustained attack like that would still constitute a form of DoS attack, but with the described system in place, it would only affect what I suspect to be a tiny subset of users, namely people who don't use the "remember me" cookie AND happen to be logging in while an attack is happening AND aren't logging in from any of their usual IPs AND who can't read CAPTCHAs. Only those who can say no to ALL of those criteria - specifically bots and really unlucky disabled people - will be turned away during a bot attack.
EDIT: Actully, I thought of a way to let even CAPTCHA-challenged users pass through during a 'lockdown': instead of, or as a supplement to, the backup CAPTCHA login, provide the user with an option to have a single-use, user-specific lockdown code sent to his email, that he can then use to bypass the throttling. This definitely crosses over my 'annoyance' threshold, but since it's only used as a last resort for a tiny subset of users, and since it still beats being locked out of your account, it would be acceptable.
(Also, note that none of this happens if the attack is any less sophisticated than the nasty distributed version I've described here. If the attack is coming from just a few IPs or only hitting a few usernames, it will be thwarted much earlier, and with no site-wide consequences)
So, that is the countermeasure I will be implementing in my auth library, once I'm convinced that it's sound and that there isn't a much simpler solution that I've missed. The fact is, there are so many subtle ways to do things wrong in security, and I'm not above making false assumptions or hopelessly flawed logic. So please, any and all feedback, criticism and improvements, subtleties etc. are highly appreciated.