Eza's Tumblr Scrape

switchaud

§

Postat în: 22-07-2017

Photosets are not always scraped properly

Sometimes, photosets are not showing up in the scrape results. I suspect it might be due to how themes differ, and how iframes work in tumblr.

I'm working on a script to scrape them too, but I have trouble getting it right.

EzaliasAutor

§

Postat în: 26-07-2017

Raportează comentariu

Do you have any specific examples? That'd help narrow down where it is and isn't working. (Message me directly if you'd prefer.)

switchaud

§

Postat în: 27-07-2017

Raportează comentariu

I couldn't reproduce nor find the original post that didn't show up (was scripting and debugging at the same time, so it was fairly nebulous).

Perhaps the post was indeed showing up, but not on the page it was supposed to be on (sometimes, it seems posts are shifting a bit from page to page). So it could be a false alert.

I'll keep an eye out, but perhaps I was just wrong and there's no problem.
Sorry about that. :neutral:

EzaliasAutor

§

Postat în: 27-07-2017

Raportează comentariu

I found some missing examples and fixed it anyway. links_from_page() grabbed anything starting with "http", but not relative links like "/post"-etc. More photosets now work in the whole-site scraper and original image browser. (The fetch-every-post image browser worked fine already. Somehow.)

switchaud

§

Postat în: 27-07-2017

Raportează comentariu

Wow, thank you! :)

switchaud

§

Postat în: 27-07-2017

Editat în: 27-07-2017

Raportează comentariu

Well, after trying the new version, it's not scraping anything anymore.

I get a bunch of errors like the following in console:

Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/2/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/10/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/100/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/1000/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/10000/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled. Fetch API cannot load https://www.tumblr.com/safe-mode?url=http://venomous-sausage.tumblr.com/page/100000/mobile. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://venomous-sausage.tumblr.com' is therefore not allowed access. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

Googling around, it seems it could be fixed? https://github.com/github/fetch/issues/301 https://stackoverflow.com/questions/36878255/allow-access-control-allow-origin-header-using-html5-fetch-api

Adding {mode: 'no-cors'} to the fetch request, the errors don't occur but the response can't be read (it yields "NaN").

I've also managed to find the post that wasn't showing up in the results last time: on page 4 (currently) http://shittyhorsey.tumblr.com/page/4, the post http://shittyhorsey.tumblr.com/post/161149073066/ didn't show up in the scraped results.

[WARNING: highly Not Safe For Work !!! ]

EzaliasAutor

§

Postat în: 28-07-2017

Raportează comentariu

The script works as usual on most pages - those two tumblrs are the first I've seen fail this way. (E.g. abbydraws, lizardlicks, deshmond, and notjess all show up fine.)

The fact it can't even count pages is mindboggling. Page counts use /mobile pages. Those are Tumblr-standard.

Any attempt to fetch from those domains on those domains fails, because it magically becomes a safe-mode URL. Even fetch( window.location.href ) fails. Because of the same-origin policy, I don't think there's anything I can do. Current browsers literally cannot access those pages. Any blog where Tumblr decides to play these stupid games with HTTP is unreadable. Bastards!

switchaud

§

Postat în: 29-07-2017

Raportează comentariu

Even weirder is that these two were working fine 5 days ago, now suddenly these two tumblrs don't.
I hope their sysadmins didn't push a switch just because I went scraping them a few times to test things back then... Either way, it doesn't seem to be related to your script update, I tested the 2 previous versions and I get the errors still. Very weird indeed.

EzaliasAutor

§

Postat în: 29-07-2017

Raportează comentariu

Raiseshipseerve just stopped working. I had that tab open from several weeks ago and was a hundred pages in. This is definitely something new that Tumblr can break on a whim.

Welp. The original model for this script was a www.tumblr.com page where you entered a particular blog name, and apparently that's back on the table.

Petr Savelev

§

Postat în: 03-08-2017

Editat în: 03-08-2017

Raportează comentariu

It is Tumblr Safe mode issue.
HTML5 fetch method by default don't send browser cookies and have CORS enabled. So for non-https blogs scraper fails because Tumblr redirect request to https page to set cookie and fails with CORS protection.
This can be fixed by changing request to: fetch(url, { credentials: 'include' })

EzaliasAutor

§

Postat în: 07-08-2017

Raportează comentariu

Oh, thank god. I'll update that ASAP. Thank you for recognizing the root problem.

Greasy Fork

Eza's Tumblr Scrape

Întrebare/comentariu

Postează un raspuns