intercept and change html before load

adambot

§

Gönderildi: 18.05.2015

Yorum raporla

intercept and change html before load

I visit a website that loads hundreds of 404's because when they upgraded the forum software the old bbcode broke. What i am hoping to do is convert my script from a post-document load to use "// @run-at document-start"

What i can't figure out is how to get the source code of the page before the browser starts loading it so i can do some quick regex replaces before it loads because the 404's start right after the first DOMNodeInserted Event

Any help is appreciated!

woxxomMod

§

Gönderildi: 18.05.2015

Düzenlendi: 20.05.2015

Yorum raporla

DOMNodeInserted event is deprecated because it blocks the page load process, use Mutation Observer instead. However to support Chrome's page prerender feature (it works on history back/forward and on omnibox's typing predictor) also process the nodes directly, see the code:

// ==UserScript==
// @name         123
// @match        http://puzzleanddragonsforum.com/threads/*
// @run-at       document-start
// @grant        none
// ==/UserScript==

var findBad = /(<a .*?href=\S+) .*?(><img.*?[0-9]+).*?=([0-9]+).*?(title=.*?) width.*?\)(.*?<\/a>)/ig; 
var fixBad = '$1"$2.png" width="$3px" $4) $5';

var selector = 'a[href*=" "]'; // catch malformed A tags with space in href
setMutationHandler(document, selector, processNodes); // process a page while it's loading
/*
// process a prerendered page in case it's loaded right now
processNodes(null, document.querySelectorAll(selector));
// process a prerendered page in case it's not loaded COMPLETELY at this moment
window.addEventListener('DOMContentLoaded', function(e) { 
  processNodes(null, document.querySelectorAll(selector));
});
*/

//var processed = [];
function processNodes(observer, nodes) {
  for (var i=0, len=nodes.length, n, img; i<len && (n=nodes[i]); i++) {
    //if (processed.indexOf(n) >= 0)
    //  continue;
    //processed.push(n);
    n.outerHTML = n.outerHTML.replace(findBad, fixBad);
  }
  //if (condition && observer) {
  //  observer.disconnect(); // stop the observer
  //  return false;
  //}
  return true; // continue the enumeration in setMutationHandler
}

function setMutationHandler(baseNode, selector, cb) {
  var ob = new MutationObserver(function(mutations){
    for (var i=0, ml=mutations.length, m; (i<ml) && (m=mutations[i]); i++)
      for (var j=0, nodes=m.addedNodes, nl=nodes.length, n; (j<nl) && (n=nodes[j]); j++)
        if (n.nodeType == 1) 
          if ((n = n.matches(selector) ? [n] : n.querySelectorAll(selector)) && n.length)
            if (!cb(ob, n))
              return;
  });
  ob.observe(baseNode, {subtree:true, childList:true}); 
}

I'm using plain for-loops because on complex pages with thousands of mutation events they're considerably faster. Use Chrome's Profile tab in Dev Tools panel to see how much of CPU your code uses.
DOM changes you perform inside mutation handler also generate mutation events which may hang the script in an endless recursion, so use some method of avoiding it, see the code above for one method, the other would be using such a selector that won't catch the changed nodes.
In your case the code may be much simpler.

adambot

§

Gönderildi: 18.05.2015

Yorum raporla

Thanks!! I'll work on this a today and if i get hung up (since i just learned javascript this last weekend, for this project) i'll post back with details on what i'm running into (i'm trying to do most of this myself to learn more)

adambot

§

Gönderildi: 19.05.2015

Yorum raporla

ok, so as expected this script is WAY beyond my current skillset... Here is a sample of the HTML i'm trying to intercept:

Broken:

<a href="http://www.example.com/en/foo.asp?n=1234 width=60"><img src="http://www.example.com/en/img/thumbnail/1234 width=60.png" width="{2}px" title="(1234 width=60)Some Text Here"></a>

Corrected:

<a href="http://www.example.com/en/foo.asp?n=1234"><img src="http://www.example.com/en/img/thumbnail/1234.png" width="60px" title="(1234) Some Text Here"></a>

Here is the regex i've developed to give me my output, i'm just not sure how to put it in the code you gave me (i think it is the selector variable that i'm tripping over):

var findBad = /(<a href=.*?[0-9+]) .*?(><img.*?[0-9]+).*?=([0-9]+).*?(title=.*?) width.*?\)(.*?<\/a>)/ig; 
var fixBad = '$1"$2.png" width="$3px" $4) $5';

There are also some tables that are similarly broken, (old bbcode ex: [table...) but i'll hit those after i understand this part

woxxomMod

§

Gönderildi: 19.05.2015

Yorum raporla

I've simplified the code above accordingly.

adambot

§

Gönderildi: 20.05.2015

Düzenlendi: 20.05.2015

Yorum raporla

interesting, for some reason it isn't working... Here's a sample page: http://puzzleanddragonsforum.com/threads/very-wip-team-building-deathly-hell-deity-jackal-anubis.61757/ if you look at post #5 you will see a bunch of the href's with a space, if you look at post #4 you will see a bunch of href's that don't show images like they are supposed to as well is there is a bunch of image tags that are showing as pure text.

Here is the code that i use after the page loads to fix all 3 of the issues pointed out before:

//fix thumbnail links with width= at the end
var thumb = document.getElementsByTagName("img"); //array
var thumbregex = /^(?:.*)(http:\/\/www.puzzledragonx.com\/.*[0-9]+)(?:.*?width=)([0-9]+)(?:.png.*)$/i
var thumbreplace = '$1$2.png" width="$4px" title="$6'
for (var i=0,imax=thumb.length; i<imax; i++) {
    var thumbmatches = thumb[i].outerHTML.match(thumbregex);
    if (thumbmatches) {
        thumb[i].setAttribute("src", thumbmatches[1] + ".png");
        thumb[i].setAttribute("height", thumbmatches[2]);
        thumb[i].setAttribute("width", thumbmatches[2]);
    }
}


//Fix links with width= at the end
var links = document.getElementsByTagName("a"); //array
var linksregex = /^(http:\/\/www.puzzledragonx.com\/.*?)(%20width=)([0-9]+)$/i;
for (var i=0,imax=links.length; i<imax; i++) {
    links[i].href = links[i].href.replace(linksregex, "$1");

}


//fix image tags that are being shown as text.
var fixTxt = document.evaluate("//text()[contains(.,'[img=')]" , document, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
var imgRegex = /(\[img=)([0-9]+)(.)([0-9]+)(.)(http.*?)(\[\/img\])/ig;
for(var i=fixTxt.snapshotLength - 1; i>=0 ; i--) {
    var txtSnap = fixTxt.snapshotItem(i);

    //check if the text is a broken image
    if(imgRegex.test(txtSnap.nodeValue)) {
        var imgSpan = document.createElement("span");
        txtSnap.parentNode.replaceChild(imgSpan, txtSnap);
        var text = txtSnap.nodeValue;
        imgRegex.lastIndex = 0;
        for(var match = null, lastLastIndex = 0; (match = imgRegex.exec(text)); ) {
            imgSpan.appendChild(document.createTextNode(text.substring(lastLastIndex, match.index)));
            var image = document.createElement("img");

            //anything less than 20px is to small to see
            if (match[2] < 20) {
                var height = 20;
            } else {
                var height = match[2];
            }

            if (match[4] < 20) {
                var width = 20;
            } else {
                var width = match[4];
            }

            image.setAttribute("src", match[6]);
            image.setAttribute("height", height);
            image.setAttribute("width", width);
            imgSpan.appendChild(image);
            lastLastIndex = imgRegex.lastIndex;
        }
        imgSpan.appendChild(document.createTextNode(text.substring(lastLastIndex)));
        imgSpan.normalize();   
    }
}

woxxomMod

§

Gönderildi: 20.05.2015

Yorum raporla

Updated the code:

fixed your findBad
added @run-at document-start required for the script to actually work

adambot

§

Gönderildi: 20.05.2015

Yorum raporla

so just to make sure i'm understanding everything, if i also neeed to fix

woxxomMod

§

Gönderildi: 20.05.2015

Düzenlendi: 20.05.2015

Yorum raporla

If that img is inside a that is caught by the selector then you can alter it in one go as my code does right now via outerHTML
Multiple observers will eat CPU on complex pages so it's better to do everything in one, here's a simplified (but potentially slower) example:

// catch A with space in href
// catch TABLE with something bad
// catch IMG with space in src but not inside A with space
var selector = 'a[href*=" "], table[somebadthing], *:not(a[href*=" "]) > img[src*=" "]';
var observer = new MutationObserver(function(mutations){
    mutations.forEach(function(mutation) {
        [].forEach.call(mutation.addedNodes, function(node) {
            var found = node.matches(selector) ? [node] : node.querySelectorAll(selector);
            [].forEach.call(found, function(baddie) {
                switch (baddie.localName) {
                    case 'a':
                        baddie.outerHTML = baddie.outerHTML.replace(findBad, fixBad);
                        break;
                    case 'img':
                        // fix img
                        break;
                    case 'table':
                        // fix table
                        break;
                }
            });
        });
    });
});
observer.observe(document, {subtree:true, childList:true});

Currently [].forEach.call(nodelist, function) trick (or Array.prototype.forEach.call which is slightly faster but anyway both are slower than a plain for loop) should be used because Chrome hasn't yet added forEach iterator to NodeList.

adambot

§

Gönderildi: 20.05.2015

Yorum raporla

ahhh i completely understand now!! thank you so much for all your help (basically teaching me and doing most of the work)

Greasy Fork

intercept and change html before load

Cevap paylaş