Linkify Plus Plus

Onegin

§

Publicado: 28/11/2016

Editado: 06/12/2016

[RESOLVED] False positives with underscore

Please, have a look at the following fragment:

cat D:\backup_friday.sa D:\backup_weekday.sa | strarc -xlo:a -d:C:\

friday.sa and weekday.sa must NOT be linked here since they are parts of longer contiguous non-domain sequences. Another false positive that involves underscore is _jabber._tcp.gmail.com. No sense to linkify tcp.gmail.com when the whole sequence is not meant to be used as website address (it’s DNS setting). The last example of similar kind is a passage from another manual: See tar?07.zip below, which is wrong again since the whole sequence (had it been a domain) cannot contain question mark.

eightAutor

§

Publicado: 28/11/2016

Denunciar comentario

How do you define a non-domain sequence? I don't think it can be determine by checking the character before the link.

Onegin

§

Publicado: 28/11/2016

Editado: 28/11/2016

Denunciar comentario

Visible content consists of signs and spaces. We are interested in signs since valid web addresses cannot contain spaces. Passing by a cute cat, we meet contiguous sequence of signs D:\backup_friday.sa. Since 1) domain cannot contain underscores (subdomains can) and 2) D:\ obviously cannot be a part of hostname, this sequence should not be linkified, skip it.

Same goes about tar?07.zip: since sequence contains question mark between signs which is permitted only within path after domain name, but not within/before domain name, then skip the sequence.

Speaking of _jabber._tcp.gmail.com: either linkify it entirely or rather skip sequences starting with underscore as they are most likely for internal use (e.g. DNS setting).

eightAutor

§

Publicado: 28/11/2016

Denunciar comentario

For backup_friday.sa and _jabber._tcp.gmail.com, we can add underscore as a valid character and drop the link if the domain contains _, so the script won't split them up to backup_ + friday.sa and _jabber._ + tcp.gmail.com.

But for tar?07.zip, even though a link shouldn't start with ?, the following part is still a valid domain. For example:

... blabla, bla?www.google.com whatever whatever...

Onegin

§

Publicado: 28/11/2016

Editado: 28/11/2016

Denunciar comentario

blabla, bla?www.google.com whatever contains a mistake since space between sentence and implied link is mandatory, but omitted. Do you believe Linkify++ should encourage users to type with mistakes or accept illiteracy of others? We respect various RFC when choose what to linkify, e.g. the restriction that forbids underscores in domain names. Others should respect basic language rules as well. See also: the proverb “haste makes waste”. It’s not a spell-checker after all, but a great tool to insert <a href></a> behind the scenes for correctly typed web addresses.

eightAutor

§

Publicado: 28/11/2016

Denunciar comentario

We can add an option to make the script only convert the links lead with space or bracket. But I doubt if it is a common use case. These links won't get linkification with such behavior:

!www.googe.com
@www.google.com
#www.google.com
...

Onegin

§

Publicado: 28/11/2016

Editado: 28/11/2016

Denunciar comentario

Extra option, I believe, would be an overhead since idea is not to look for a needle in a haystack, i.e. not to catch every single entry that looks like link, but to highlight correctly typed web addresses with missing href. Let me illustrate.

The following passage is a pure garbage, but with www.google.com injected. Current almighty implementation linkifies half of second line, because it looks like link (hello www-pattern!), but does it make sense eventually? No.

u5+MuFA0/VstFMPMHJmrnVdVr32UyxccI84J7BcjIsIfQV3HO4a0UK9A5kQy13/f 1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm Rbf3rdUpYNuXfbqN2QCXDpgmwpAkIC0fUItZ8u4pf+C05JZdmM84XMIxYYHhOcM0

Let’s delete the question mark before www.google.com and see what happens.

u5+MuFA0/VstFMPMHJmrnVdVr32UyxccI84J7BcjIsIfQV3HO4a0UK9A5kQy13/f 1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm Rbf3rdUpYNuXfbqN2QCXDpgmwpAkIC0fUItZ8u4pf+C05JZdmM84XMIxYYHhOcM0

All of a sudden the whole second line of garbage becomes a link. OMG, WTF, right?

eightAutor

§

Publicado: 28/11/2016

Denunciar comentario

You think this shouldn't be linkified:

1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm

How about this?

1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm

or

1OjveBcxU1WBCm4DrwVKwjwCHs(www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm

For this one, it will still be linkified after applying only spaces or brackets surrounding rule.

1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm

Onegin

§

Publicado: 28/11/2016

Editado: 29/11/2016

Denunciar comentario

Common web address consists of protocol + subdomains (prefix) + domain + suffix + port + path, right? Essential parts are domain and suffix, other parts are optional, that’s why google.com works fine. Moreover a meaningful entity (either word, number, or link) should stand out from the rest which usually means to have spaces around (including punctuation if any), e.g. “cat is cute VS catiscute”. Idea is to cover the basics and grow with exceptions along the way.

Let's apply this.

1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm 1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm should not be linkified: suffix com?2is invalid because question mark is permitted only in path after slash. We could linkify these parts: www.google.com and 1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com.

should not be linkified:
- prefix (with left round bracket) is not separated from preceding context by space,
- suffix is invalid because of question mark.
We could linkify this part: ```www.google.com```.

Other valid examples are 

```1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com/ELkU1CY5Jp+y5p3Hc5Tnm
1OjveBcxU1WBCm4DrwVKwjwCHs (www.google.com/ELkU1CY5Jp?y5p3Hc5Tnm) C05JZdmM84XM

eightAutor

§

Publicado: 28/11/2016

Denunciar comentario

Query part starts with a question mark.

google.com?q=test

If we don't allow brackets before/after the links, all of these URLs won't be linkified.

(http://www.example.com/) 
(Some text... http://www.example.com/) 
http://www.example.com/(Some text...) 
http://www.example.com/(Some) 
http://en.wikipedia.org/wiki/Darwin_(operating_system) 
(http://www.foobar.com/test) 
http://www.foobar.com/test). 
http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))

These links are copied from my test file.

IMHO, trying to list all kinds of non-domain sequence doesn't worth it. The more rules there are, the more chance that valid URLs are filtered out. There are also some URLs that doesn't look like an URL like "3.141592653589793238462643383279502884197169399375105820974944592.com" or "yesno.wtf"

Onegin

§

Publicado: 28/11/2016

Editado: 28/11/2016

Denunciar comentario

It is not correct and against RFC to skip slash after suffix, i.e. should be google.com/?q=test. Look closely at the address bar, modern browsers (created with heedless fellows in mind) silently redirect to the version with slash.

Brackets are fine, I never said brackets are wrong, even listed valid example with brackets: 1OjveBcxU1WBCm4DrwVKwjwCHs (www.google.com/ELkU1CY5Jp?y5p3Hc5Tnm) C05JZdmM84XM

Speaking of your examples they should be linkified as long as they stand out.

eightAutor

§

Publicado: 29/11/2016

Denunciar comentario

It is not correct, but it exists around the internet. I prefer to leave it as is so some URLs like http://example.com?querystring can be linkified.

For bla?example.com and brackets thing, they are too complex. Unless there are more people asking to implement it, I stay with current behavior which is more natural for me.

Onegin

§

Publicado: 29/11/2016

Editado: 29/11/2016

Denunciar comentario

Well, it sounds disappointing. Common sense implies standards should be supported above all.

To sum up invalid or half-linkified findings from this topic:

cat D:\backup_friday.sa
_jabber._tcp.gmail.com
See tar?07.zip below
1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
google.com?q=test

P.S. Visit [Linkify-It][1] demo, scroll to “Not links” section and watch for many more false positives Linkify++ produces. P.P.S. See also: [Linkification][2] userscript, which partially agrees with my remarks.

[1]: http://markdown-it.github.io/linkify-it/ [2]: https://greasyfork.org/en/scripts/10400-linkification

eightAutor

§

Publicado: 29/11/2016

Denunciar comentario

I noticed that there is a leading period bug.

I have opened an issue for this: https://github.com/eight04/linkify-plus-plus/issues/6

eightAutor

§

Publicado: 06/12/2016

Denunciar comentario

Fixed in 7.4.0.

Onegin

§

Publicado: 06/12/2016

Editado: 06/12/2016

Denunciar comentario

Half-fixed as not being thoroughly tested. Some linkfied entities refer to themselves when option “surrounded by whitespace” is enabled.

Onegin

§

Publicado: 06/12/2016

Denunciar comentario

Proof.

eightAutor

§

Publicado: 06/12/2016

Denunciar comentario

Fixed in 7.4.1.

Greasy Fork

Linkify Plus Plus

Puntuación: Bueno; el script funciona tal y como promete

Publicar respuesta