Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
463 views
in Technique[技术] by (71.8m points)

regex - How to verify form input using HTML5 input verification

I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url, email, tel and such, but I couldn't find any. Currently, the built-in versions of these input verifications are far from perfect (and tel doesn't even check if the thing you're entering is a phone number). So I was wondering, which patterns could I use for verifying the user is entering the right format in the inputs?

Here are a few examples of cases where the default verification allows input that is not supposed to be allowed:

type="email"

This field allows emails that have incorrect domains after the @, and it allows addresses to start or end with a dash or period, which isn't allowed either. So, .example-@x is allowed.

type="url"

This input basically allows any input that starts with http:// (Chrome) and is followed by anything other than a few special characters such as those that have a function in URLs (, @, #, ~, etc). In FF, all that's checked is if it starts with http:, followed by anything other than : (even just http: is allowed in FF). IE does the same as FF, except that it doesn't disallow http::.

For example: http://. is allowed in all three. And so is http://,.

type="tel"

There currently is no built-in verification for phone numbers in any of the major browsers (it functions 100% the same as a type="text", other than telling mobile browsers which kind of keyboard to display.


So, since the browsers don't show a consistent behaviour in each of these cases, and since the behaviour they do show is very basic with many false positives, what can I do to verify my HTML forms (still using HTML5 input verification)?


PS: I'm posting this because I would find it useful to have a complete list of form verification patterns myself, so I figured it might be useful for others too (and of course others can post their solutions too).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.

##TL;DR:

Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.

URLs

https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?
https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?

Emails

(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}

Phone numbers

((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}

Western-style names

([A-Z?-?à-??-T][A-Z?-?à-??-Ta-z?-??-??-?]{1,19} ?){1,10}

##URLs, without IDN support

https?://(?![^/]{253}[^/])((?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])).){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(/.*)?

Regular expression visualization

Explanation:

  • DNSes
  • URLs should always start with http:// or https://, since we don't want links to other protocols.
  • Domain names should not start or end with -
  • Domain names can be a maximum of 63 characters each (so a maximum of 63 characters between each dot), and the total length (including dots) cannot exceed 253 (or 255? be safe and bet on 253.) characters [1].
  • Non-IDNs can only support the letters of the Latin alphabet, the numbers 0 through 9, and a dash.
  • Top-level domains of non-IDNs only contain at least the letters of the Latin alphabet [2].
  • I've set an arbitrary limit of 15 letters, since there are currently no domains that exceed 13 characters (".international"), which most likely won't change any time soon.
  • IPs
  • Special cases such as 0.0.0.0, 127.0.0.1, etc. are not checked for
  • IPs that have padded zeroes in them are not allowed (for example 01.1.1.1) [4].
  • IP numbers can only go from 0 through 255. 256 is not allowed.

Note that the default http:.* pattern built into modern browsers will always be enforced, so even if you remove the https?:// at the start in this pattern, it will still be enforced. Use type="text" to avoid it.

##URLs, with IDN support

https?://(?!.{253}.+$)((?!-.*|.*-.)([^ !-,./:-@[-`{-~]{1,63}.)+([^ !-/:-@[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]).){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(/.*)?

Regular expression visualization

Explanation:

Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].

  • Disallowed characters in domain names are: !"#$%&'()*+, ./ :;<=>?@ []^_`` {|}~ with the exception of a period as domain seperator.
  • These are matched in the ranges [!-,] [./] [:-@] [[-``] [{-~].
  • All other characters are allowed in this input field
  • TLDs are allowed to have the same letters in them, up to an arbitrary limit of 15 characters (like with the non-IDN URLs).
  • Alternatively, TLDs can be of the format xn--* with * being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.

##Email addresses

(?!(^[.-].*|[^@]*[.-]@|.*.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+/=?^_`{|}~.-]+@)(?!-.*|.*-.)([a-zA-Z0-9-]{1,63}.)+[a-zA-Z]{2,15}

Regular expression visualization

Explanation:

Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.

  • Email addresses can only contain letters of the Latin alphabet, the numbers 0-9, and the characters in !#$%&'*+/=?^_``{|}~.- [6].
  • Accents are not universally supported [7], but if needed, post a comment, and I could perhaps write a version that meets the RFC 6530 standard.
  • The local part (before the @ can only be 63 characters long, and the total address can only be 254 characters long [8].
  • Addresses may not start or end with a - or ., and no two dots may appear consecutively [8].
  • The domain may not be an IP address [9].
  • Other than that, I only included the non-IDN part of the pattern. IDNs are allowed too though, so those will result in false negatives.

##Phone numbers

((+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((+|00)?[1-9]{2}|0)[1-9]([0-9]){8}

Regular expression visualization

Explanation:

  • Phone numbers must start with one of the following, where [CTRY] stands for the country code, and X stands for the first non-zero digit (such as 6 in mobile numbers),
  • 00[CTRY]X
  • +[CTRY]X
  • 0X
  • [CTRY]X (This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)
  • Spaces are allowed between the digits (see the second pattern for the space-less version), except before the non-zero X as defined above.
  • Phone numbers must be exactly 9 digits long, other than the part before the first non-zero X as defined above.

This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.

##Extra: Western-style names

([A-Z?-?à-??-T][A-Z?-?à-??-Ta-z?-??-??-?]{1,19} ?){1,10}

Regular expression visualization

Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).

  • All names must start with an uppercase letter
  • Uppercase letters may occur in the middle of names (such as John McDoe)
  • Names must be at least 2 letters long
  • I've set an arbitrary maximum of 10 names (these people probably won't mind), each of which can be at most 20 letters long (the length of "Werbenjagermanjensen", who happens to be #1).
  • Latin and Greek letters are allowed, including all accented Latin and Greek letters (list) and Icelandic letters (DT et):
  • A-Z matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
  • ?-? matches all uppercase Greek letters, including the accented ones: ??????????? ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ?ΣΤΥΦΧΨΩ ??.
  • à-??-T matches all uppercase accented Latin letters, and the D and T: àá??????èéê?ìí??D?òó????ùú?üYT. In between there's also the character × (between ? and ?), which is left out this way.
  • a-z matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
  • ?-? matches all lowercase Greek letters, including the accented ones: ?????αβγδεζηθικλμνξοπρ?στυφχψω?????
  • ?-??-? matches all lowercase accented Latin letters, and the ?, e and t: ?àáa?????èéê?ìí??e?òó????ùú?üyt?. In between there's also the character ÷ (between ? and ?), which is left out this way.

##References

  1. https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax https://www.rfc-editor.org/rfc/rfc1034#section-3.1
  2. https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains / https://www.icann.org/resources/pages/tlds-2012-02-25-en
  3. https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process / What are the allowed characters in a subdomain?
  4. Based on the fact neither browsers nor the Windows cmd line allow the padded format.
  5. <a href="https://stackoverflow.com/q/7111881/1256925#2231

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...