Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
340 views
in Technique[技术] by (71.8m points)

iphone - NSRegularExpression to validate URL

I found this regular expression on a website. It is said to be the best URL validation expression out there and I agree. Diego Perini created it.

The problem I am facing is when trying to use it with objective-C to detect URLs on strings. I have tried using options like NSRegularExpressionAnchorsMatchLines, NSRegularExpressionIgnoreMetacharacters and others, but still no luck.

Is the expression not well formatted for Objective-C? Am I missing something? Any ideas?

I have tried John Gruber's regex, also, but it fails with some invalid URLs.

        Regular Expression                                  Explanation of expression                      

 ^                                                  match at the beginning
//Protocol identifier
(?:
    (?:https?|ftp                                   http, https or ftp
    ):\/\/                                        ://
)?                                                  optional
// User:Pass authentication
(?:
    ^\s+                                           non white spaces, 1 or more times
    (?:
        :^\s*                                      : non white spaces, 0 or more times, optionally
    )?@                                             @
)?                                                  optional
//Private IP Addresses                              ?! Means DO NOT MATCH ahead. So do not match any of the following
(?:
    (?!10                                           10                                                          10.0.0.0 - 10.999.999.999
        (?:
            \.\d{1,3}                             . 1 to 3 digits, three times
        ){3}
    )
    (?!127                                          127                                                         127.0.0.0 - 127.999.999.999
        (?:
            \.\d{1,3}                             . 1 to 3 digits, three times
        ){3}
    )
    (?!169\.254                                    169.254                                                     169.254.0.0 - 169.254.999.999
        (?:
            \.\d{1,3}                             . 1 to 3 digits, two times
        ){2}
    )
    (?!192\.168                                    192.168                                                     192.168.0.0 - 192.168.999.999
        (?:
            \.\d{1,3}                             . 1 to 3 digits, two times
        ){2}
    )
    (?!172\.                                       172.                                                        172.16.0.0 - 172.31.999.999
        (?:                                                                                                             
            1[6-9]                                  1 followed by any number between 6 and 9
            |                                       or
            2\d                                    2 and any digit
            |                                       or
            3[0-1]                                  3 followed by a 0 or 1
        )
        (?:
            \.\d{1,3}                             . 1 to 3 digits, two times
        ){2}
    )
    //First Octet IPv4                              // match these. Any non network or broadcast IPv4 address
    (?:
        [1-9]\d?                                   any number from 1 to 9 followed by an optional digit        1 - 99
        |                                           or
        1\d\d                                     1 followed by any two digits                                100 - 199
        |                                           or
        2[01]\d                                    2 followed by any 0 or 1, followed by a digit               200 - 219
        |                                           or
        22[0-3]                                     22 followed by any number between 0 and 3                   220 - 223
    )
    //Second and Third Octet IPv4
    (?:
        \.                                         .
        (?:
            1?\d{1,2}                              optional 1 followed by any 1 or two digits                  0 - 199
            |                                       or
            2[0-4]\d                               2 followed by any number between 0 and 4, and any digit     200 - 249
            |                                       or
            25[0-5]                                 25 followed by any numbers between 0 and 5                  250 - 255
        )
    ){2}                                            two times
    //Fourth Octet IPv4
    (?:
        \.                                         .
        (?:
            [1-9]\d?                               any number between 1 and 9 followed by an optional digit    1 - 99
            |                                       or
            1\d\d                                 1 followed by any two digits                                100 - 199
            |                                       or
            2[0-4]\d                               2 followed by any number between 0 and 4, and any digit     200 - 249
            |                                       or
            25[0-4]                                 25 followed by any number between 0 and 4                   250 - 254
        )
    )
    //Host name
    |                                               or                  
    (?:
        (?:
            [a-zu00a1-uffff0-9]+-?                any letter, digit or character one or more times with optional -
        )*                                          zero or more times
        [a-zu00a1-uffff0-9]+                      any letter, digit or character one or more times
    )
    //Domain name
    (?:
        \.                                         .
        (?:
            [a-zu00a1-uffff0-9]+-?                any letter, digit or character one or more times with optional -
        )*                                          zero or more times
        [a-zu00a1-uffff0-9]+                      any letter, digit or character one or more times
    )*                                              zero or more times
    //TLD identifier
    (?:
        \.                                         .
        (?:
            [a-zu00a1-uffff]{2,}                  any letter, digit or character more than two times
        )
    )
)
//Port number
(?:
    :\d{2,5}                                       : followed by any digit, two to five times, optionally
)?              
//Resource path
(?:
    \/[^\s]*                                      / followed by an optional non space character, zero or more times
)?                                                  optional
$                                                  match at the end

EDIT I think I forgot to say that I am using the expression in the following code: (partial code)

NSError *error = NULL;
NSRegularExpression *detector = [NSRegularExpression regularExpressionWithPattern:[self theRegularExpression] options:0 error:&error];
NSArray *links = [detector matchesInString:theText options:0 range:NSMakeRange(0, theText.length)];
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
^(?i)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$

Is the best URL validation regular expression I found and it is explained on my question. It is already formatted to work on Objective-C. However, using it with NSRegularExpression gave me all sorts of problems, including my app crashing. RegexKitLite had no problems handling it. I do not know if it is a size limitation or some flag not being set. My final code looked like:

//First I take the string and put every word in an array, then I match every word with the regular expression
NSArray *splitIntoWordsArray = [textToMatch componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewLineCharacterSet]];
NSMutableString *htmlString = [NSMutableString stringWithString:textToMatch];
for (NSString *theText in splitIntoWordsArray){
    NSEnumerator *matchEnumerator = [theText matchEnumeratorWithRegex:theRegularExpressionString];
    for (NSString *temp in matchEnumerator){
        [htmlString replaceOccurrencesOfString:temp withString:[NSString stringWithFormat:@"<a href="%@">%@</a>", temp, temp] options:NSLiteralSearch range:NSMakeRange(0, [htmlString length])];
    }
}
[htmlString replaceOccurrencesOfString:@"
" withString:@"<br />" options:NSLiteralSearch range:NSMakeRange(0, htmlString.length)];
//embed the text on a webView as HTML
[webView loadHTMLString:[NSString stringWithFormat:embedHTML, [mainFont fontName], [mainFont pointSize], htmlString] baseURL:nil];

The result: a UIWebView with some embedded HTML, where URLs and emails are clickable. Do not forget to set dataDetectorTypes = UIDataDetectorTypeNone

You can also try

NSError *error = NULL;
NSRegularExpression *expression = [NSRegularExpression regularExpressionWithPattern:@"(?i)(?:(?:https?):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?" options:NSRegularExpressionCaseInsensitive error:&error];
if (error)
    NSLog(@"error");
NSString *someString = @"This is a sample of a sentence with a URL http://. http://.. http://../ http://? http://?? http://??/ http://# http://-error-.invalid/ http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com within it.";
NSRange range = [expression rangeOfFirstMatchInString:someString options:NSMatchingCompleted range:NSMakeRange(0, [someString length])];
if (!NSEqualRanges(range, NSMakeRange(NSNotFound, 0))){
    NSString *match = [someString substringWithRange:range];
    NSLog(@"%@", match);
}   
else {
    NSLog(@"no match");
}

Hope it helps somebody in the future

The regular expression will sometimes cause the application to hang, so I decided to use gruber's regular expression modified to recognize url without protocol or the www part:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))*(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>???“”‘’])*)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...