Based on Linkify Plus. Turn plain text URLs into links.
< Opiniones de Linkify Plus Plus
How do you define a non-domain sequence? I don't think it can be determine by checking the character before the link.
Visible content consists of signs and spaces. We are interested in signs since valid web addresses cannot contain spaces. Passing by a cute cat
, we meet contiguous sequence of signs D:\backup_friday.sa
. Since 1) domain cannot contain underscores (subdomains can) and 2) D:\
obviously cannot be a part of hostname, this sequence should not be linkified, skip it.
Same goes about tar?07.zip
: since sequence contains question mark between signs which is permitted only within path after domain name, but not within/before domain name, then skip the sequence.
Speaking of _jabber._tcp.gmail.com: either linkify it entirely or rather skip sequences starting with underscore as they are most likely for internal use (e.g. DNS setting).
For backup_friday.sa
and _jabber._tcp.gmail.com
, we can add underscore as a valid character and drop the link if the domain contains _
, so the script won't split them up to backup_ + friday.sa
and _jabber._ + tcp.gmail.com
.
But for tar?07.zip
, even though a link shouldn't start with ?
, the following part is still a valid domain. For example:
... blabla, bla?www.google.com whatever whatever...
blabla, bla?www.google.com whatever
contains a mistake since space between sentence and implied link is mandatory, but omitted. Do you believe Linkify++ should encourage users to type with mistakes or accept illiteracy of others? We respect various RFC when choose what to linkify, e.g. the restriction that forbids underscores in domain names. Others should respect basic language rules as well. See also: the proverb “haste makes waste”. It’s not a spell-checker after all, but a great tool to insert <a href></a>
behind the scenes for correctly typed web addresses.
We can add an option to make the script only convert the links lead with space or bracket. But I doubt if it is a common use case. These links won't get linkification with such behavior:
!www.googe.com
@www.google.com
#www.google.com
...
Extra option, I believe, would be an overhead since idea is not to look for a needle in a haystack, i.e. not to catch every single entry that looks like link, but to highlight correctly typed web addresses with missing href
. Let me illustrate.
The following passage is a pure garbage, but with www.google.com injected. Current almighty implementation linkifies half of second line, because it looks like link (hello www-pattern!), but does it make sense eventually? No.
u5+MuFA0/VstFMPMHJmrnVdVr32UyxccI84J7BcjIsIfQV3HO4a0UK9A5kQy13/f 1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm Rbf3rdUpYNuXfbqN2QCXDpgmwpAkIC0fUItZ8u4pf+C05JZdmM84XMIxYYHhOcM0
Let’s delete the question mark before www.google.com and see what happens.
u5+MuFA0/VstFMPMHJmrnVdVr32UyxccI84J7BcjIsIfQV3HO4a0UK9A5kQy13/f 1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm Rbf3rdUpYNuXfbqN2QCXDpgmwpAkIC0fUItZ8u4pf+C05JZdmM84XMIxYYHhOcM0
All of a sudden the whole second line of garbage becomes a link. OMG, WTF, right?
You think this shouldn't be linkified:
1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
How about this?
1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
or
1OjveBcxU1WBCm4DrwVKwjwCHs(www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
For this one, it will still be linkified after applying only spaces or brackets surrounding rule.
1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
Common web address consists of protocol + subdomains (prefix) + domain + suffix + port + path, right? Essential parts are domain and suffix, other parts are optional, that’s why google.com works fine. Moreover a meaningful entity (either word, number, or link) should stand out from the rest which usually means to have spaces around (including punctuation if any), e.g. “cat is cute VS catiscute”. Idea is to cover the basics and grow with exceptions along the way.
Let's apply this.
1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
should not be linkified: suffix com?2
is invalid because question mark is permitted only in path after slash.
We could linkify these parts: www.google.com
and 1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com
.
should not be linkified:
- prefix (with left round bracket) is not separated from preceding context by space,
- suffix is invalid because of question mark.
We could linkify this part: ```www.google.com```.
Other valid examples are
```1OjveBcxU1WBCm4DrwVKwjwCHs www.google.com/ELkU1CY5Jp+y5p3Hc5Tnm
1OjveBcxU1WBCm4DrwVKwjwCHs (www.google.com/ELkU1CY5Jp?y5p3Hc5Tnm) C05JZdmM84XM
Query part starts with a question mark.
google.com?q=test
If we don't allow brackets before/after the links, all of these URLs won't be linkified.
(http://www.example.com/)
(Some text... http://www.example.com/)
http://www.example.com/(Some text...)
http://www.example.com/(Some)
http://en.wikipedia.org/wiki/Darwin_(operating_system)
(http://www.foobar.com/test)
http://www.foobar.com/test).
http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))
These links are copied from my test file.
IMHO, trying to list all kinds of non-domain sequence doesn't worth it. The more rules there are, the more chance that valid URLs are filtered out. There are also some URLs that doesn't look like an URL like "3.141592653589793238462643383279502884197169399375105820974944592.com" or "yesno.wtf"
It is not correct and against RFC to skip slash after suffix, i.e. should be google.com/?q=test. Look closely at the address bar, modern browsers (created with heedless fellows in mind) silently redirect to the version with slash.
Brackets are fine, I never said brackets are wrong, even listed valid example with brackets:
1OjveBcxU1WBCm4DrwVKwjwCHs (www.google.com/ELkU1CY5Jp?y5p3Hc5Tnm) C05JZdmM84XM
Speaking of your examples they should be linkified as long as they stand out.
It is not correct, but it exists around the internet. I prefer to leave it as is so some URLs like http://example.com?querystring
can be linkified.
For bla?example.com
and brackets thing, they are too complex. Unless there are more people asking to implement it, I stay with current behavior which is more natural for me.
Well, it sounds disappointing. Common sense implies standards should be supported above all.
To sum up invalid or half-linkified findings from this topic:
cat D:\backup_friday.sa
_jabber._tcp.gmail.com
See tar?07.zip below
1OjveBcxU1WBCm4DrwVKwjwCHs?www.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
1OjveBcxU1WBCm4DrwVKwjwCHswww.google.com?2l/ELkU1CY5Jp+y5p3Hc5Tnm
google.com?q=test
P.S. Visit [Linkify-It][1] demo, scroll to “Not links” section and watch for many more false positives Linkify++ produces. P.P.S. See also: [Linkification][2] userscript, which partially agrees with my remarks.
[1]: http://markdown-it.github.io/linkify-it/ [2]: https://greasyfork.org/en/scripts/10400-linkification
I noticed that there is a leading period bug.
I have opened an issue for this: https://github.com/eight04/linkify-plus-plus/issues/6
Fixed in 7.4.0.
Half-fixed as not being thoroughly tested. Some linkfied entities refer to themselves when option “surrounded by whitespace” is enabled.
Proof.
Fixed in 7.4.1.
[RESOLVED] False positives with underscore
Please, have a look at the following fragment:
cat D:\backup_friday.sa D:\backup_weekday.sa | strarc -xlo:a -d:C:\
friday.sa and weekday.sa must NOT be linked here since they are parts of longer contiguous non-domain sequences. Another false positive that involves underscore is
_jabber._tcp.gmail.com
. No sense to linkify tcp.gmail.com when the whole sequence is not meant to be used as website address (it’s DNS setting). The last example of similar kind is a passage from another manual:See tar?07.zip below
, which is wrong again since the whole sequence (had it been a domain) cannot contain question mark.