Discussions » Creation Requests

Extract links to tweet and media on twitter as you scroll down.

§
Posted: 2020-05-15
Edited: 2020-05-17

Extract links to tweet and media on twitter as you scroll down.

I tried looking for a browser extension that is capable of extracting links and media URLs on the modern twitter layout as I scroll down on twitter and have failed. Before june 1, you can extract links to tweets on twitter by using GoodTwitter (forces twitter to use the old layout), load tweets by constantly scrolling down and then use linkgopher (extracts links) and to extract media is after scrolling down, you right-click and “view page info” on firefox, going to the media tab and copying the list. This method will not work on june 1 and afterwards because twitter is ditching the legacy layout.

The new twitter layout unloads tweets when they leave the screen, so using link extraction tools will only extract links to tweets that are loaded in memory (in the HTML at the moment). This makes getting tons of links extremely tedious now you have to get only a screen-amount of tweets.

Can someone make a script that constantly scans the site (in real time)? Like I turn it on, and every interval (customizable by the user, in milleseconds) checks for changes in the HTML, if there are none, wait for the next interval, if they are changes, look for tweet links.

I basically wanted a URL sniffing script.

I'm doing this to make it easy to save twitter links to the internet archive (you should check this out that you can save multiple links at once).

§
Posted: 2020-05-15

Also, just to warn everybody, Twitter accounts aren't permanent. It did not mention what happens to tweets posted by that account, so it may wipe out old tweets when the user inevitably goes inactive in the future.

§
Posted: 2020-05-17

have no idea what you want to do, but only extract link when you scroll down is not so difficult

// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.1
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==

(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

it will output on browser console, maybe you should disable the console warning message to get clear ouput view

§
Posted: 2020-05-17
Edited: 2020-05-18

Awesome, for anyone reading this, also disable errors and you get a clean list.

I've also can extract media links on firefox's network monitor, however, to copy all URLs, I had to “copy all as HAR”, when pasted on notepad++, it spews out a HUGE number of lines, this is very bloated compared to showing just the links. This could mean that if you wanted to save thousands of tweets and there are tons of media, it might very well exceeded NP++'s size limit.

Twitter's media URLs format:

https://pbs.twimg.com/tweet_video_thumb/<Base64String>.<FileExtension>
https://pbs.twimg.com/tweet_video/<Base64String>.mp4 - probably for gifs.
https://video.twimg.com/tweet_video/<Base64String>.mp4
https://pbs.twimg.com/media/<base64string>?format=<extension>&name=<resolution_name>

Can you add a sniffing tool for that? Thx.

Essentially this is the same as firefox's “view page info”, but self-updates and does not disappear when they unload.> @indefined said:

have no idea what you want to do, but only extract link when you scroll down is not so difficult

// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.1
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==

(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

it will output on browser console, maybe you should disable the console warning message to get clear ouput view

§
Posted: 2020-05-18

extract image link is not difficult also, similar to the tweet link

// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.2
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==
(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
        Array.from(document.querySelectorAll('[src*="pbs.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

I do not filter the image urls format as it seems all match, maybe you should filter it yourself.

But to extract the video link is not easy, they're not in the page, and the video link may be not permanent also.

In order to crawle twitter, maybe you should search for github as there are many twitter spider or crawler without using a browser, althought it may need some programing knowlage to use.

§
Posted: 2020-05-19

@indefined said: extract image link is not difficult also, similar to the tweet link

// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.2
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==
(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
        Array.from(document.querySelectorAll('[src*="pbs.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

I do not filter the image urls format as it seems all match, maybe you should filter it yourself.

But to extract the video link is not easy, they're not in the page, and the video link may be not permanent also.

In order to crawle twitter, maybe you should search for github as there are many twitter spider or crawler without using a browser, althought it may need some programing knowlage to use.

I can simply filter them using NP++'s features (sort lines lexicographically ascending), which sorts each URL (assuming each line has 1 URL), and then remove consecutive duplicate lines. For the video part, external videos will display as [blob:] in the media tab in view page info, and the URL will say "exttwvideo", I really don't need to try to save that, just the one that are uploaded directly to twitter (which the URL will say "tweet_video").

I should go to w3schools and understand each "word" in that code you gave me. Thanks for helping. I'll let you know if something's up.

§
Posted: 2020-05-19
// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.3
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==
(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
        Array.from(document.querySelectorAll('[src*="pbs.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
        Array.from(document.querySelectorAll('[src*="video.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

Well if you need only the videos have direct urls, it's simple too just add another part in getLink similar to images, but it seems only gifs video have a direct url to get.

Actually The !all.has check have filter duplicate link in the code, so if you have not reload the page manually there should have no duplicate line, I just have not check if pbs.twimg.com or video.twimg.com links match the format you list. The global variable window.allLink which equal to variable all also contains all unique extracted links before you reload the page, so maybe you can ignore the single line console log, type

Array.from(window.allLink).join('\n')

in console and press Enter to get all unique line when you want

§
Posted: 2020-05-19

@indefined said:

// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.3
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==
(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
        Array.from(document.querySelectorAll('[src*="pbs.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
        Array.from(document.querySelectorAll('[src*="video.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

Well if you need only the videos have direct urls, it's simple too just add another part in getLink similar to images, but it seems only gifs video have a direct url to get.

Actually The !all.has check have filter duplicate link in the code, so if you have not reload the page manually there should have no duplicate line, I just have not check if pbs.twimg.com or video.twimg.com links match the format you list. The global variable window.allLink which equal to variable all also contains all unique extracted links before you reload the page, so maybe you can ignore the single line console log, type

Array.from(window.allLink).join('\n')

in console and press Enter to get all unique line when you want

I found another problem, the new layout also doesn't load images hidden under “The following media includes potentially sensitive content”, and there isn't an extension that reveals that automatically without having to log in. Can you code an autoclick to reveal that? Thanks.

§
Posted: 2020-05-20
// ==UserScript==
// @name         Extract Twitter links
// @namespace    twitter.com
// @version      0.4
// @description  try to take over the world!
// @include      https://twitter.com/*
// @grant        none
// ==/UserScript==
(function() {
    'use strict';
    const all = window.allLink = new Set();
    function getLink() {
        Array.from(document.querySelectorAll('a[href*="/status/"]')).forEach(link=>{
            if(!all.has(link.href)&&link.href.match(/twitter\.com\/.+?\/status\/\d+$/)) {
                all.add(link.href);
                console.log(link.href);
            }
        });
        Array.from(document.querySelectorAll('[src*="pbs.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });
        Array.from(document.querySelectorAll('[src*="video.twimg.com"]')).forEach(link=>{
            if(!all.has(link.src)) {
                all.add(link.src);
                console.log(link.src);
            }
        });

        Array.from(document.querySelectorAll('.r-1u4rsef.r-1tlfku8.r-1phboty.r-rs99b7.r-t23y2h.r-1w50u8q>.r-1kihuf0.r-tm2x56>div'))
            .forEach(item=>item.click());
    }
    getLink();
    window.addEventListener('scroll',getLink);
})();

Easy to click, but I don't know how long will it work. I still don't think it a good idea to crawle twitter on broswer, so I may not fixed it later as this script is low valueable

§
Posted: 2020-05-21

Thanks. The autoclick may break that the string may even be randomized. Plus the amount of tweets loaded when scrolling down has an artificial limit when not logged in, I think up to 85 before it doesn't load any additional tweets. The same goes if you use the search bar.

§
Posted: 2020-06-10

to listen for elements that are appear after script was running you could use MutationObserver API or arrive.js library

Post reply

Sign in to post a reply.