Greasy Fork is available in English.

HIT Scraper WITH EXPORT

Snag HITs.

Stan na 22-07-2015. Zobacz najnowsza wersja.

Autor
Tjololo
Oceny
0 0 0
Wersja
2.4.5
Utworzono
02-06-2014
Zaktualizowano
22-07-2015
Licencja
Brak licencji
Dotyczy

Hit_scraper with hit export script added CUZ IT'S MORE CONVENIENT! Here's a few guides, one on mturkforum.com and one on mturkgrind.com. There is also a screencast video located here where I ramble on about hit scraper for a time, giving a good overview of all the functionality it offers.

Additionally, there is a script by clickhappier located here that uses scraper's blocklist to block hits on the regular mturk search results interface as well.

v2.0 Major Update

I've been doing a lot of work (with copious help from clickhappier and others, and I figure enough's been done to go for a major version release. You'll find the changelog down below, but a major rundown of all features follows.

What is Hit Scraper WITH EXPORT and why should I download it?
Hit Scraper WITH EXPORT (hereafter referred to has HS) at its core is really just a different way of looking at mturk pages. Its purpose was to take the place of several other scripts people were using every day, and to make a unified, easy-to-understand interface that everyone can use with minimal training. That being said, HS still has a ton of features to enhance your turking and make your life a lot easier.

How do I use HS?
To use HS, you need to visit This URL. Bookmark it so you don't forget. If HS doesn't load right away, try refreshing a few times. If it still doesn't load, there might be an issue and I'll try to see if I can figure it out.

When you get to that page, you'll see the main interface. This should be pre-populated with some default data...You can start going right away by clicking "Start", or you can customize it as shown below.


OptionDefaultDescription



Auto-refresh delay0How many seconds will elapse before the page starts scraping again. 0 is manual scrape only. EG 10 = scrape 10 seconds after the last scrape finished



Pages to scrape3How many pages you want HS to look at. Default is 3 pages



Correct for skipsNoIf you have a lot of hits on your blocklist, you might end up blocking a lot of hits. "Correct for skips" will search additional pages to "fill up" your results. If correct for skips is off, it will ONLY search the number of pages you select in "pages to scrape"



Minimum batch size100 (not specified)For searching for batches. This does not matter unless you sort by most available.



Minimum rewardNoneMinimum dollar reward you want HS to show. EG 1 = don't show hits under $1; .2 = don't show hits under $0.20



QualifiedYes if logged in, No if logged outIf yes, only show hits you're qualified for. If no, show all hits regardless of whether you qualify



Masters RequireNoIf yes, only show masters hits. If no, show all hits



Masters ShowShowIf set to "Show", it will show both masters and non-masters hits (not applicable if you don't have masters and have "qualified" checked). If set to "hide", it will remove masters hits from the results



Sort typesLatestLatest sorts by time created, earliest first. Most available is by number of hits available, most first. Reward is by monetary reward, highest first. Title is alphabetical by title, A first



InvertNoReverses the order of the sort type. Latest = oldest hits first; Most available = fewest hits available first; Reward = lowest reward first; Title = Z first



New HIT Highlighting300Hits that are new to the scrape show up in bold. This number determines how long they will remain that way, in seconds.



Sound on new hitNoPlay a sound when a new hit is discovered. The sound is only played once for each "screen" of new hits. For example, if two new hits are found in one scrape, the sound will play once. If one of the hits goes away, but the other remains, and it's still new based on the New HIT Highlighting number, the sound will not play because it already has.



DingDingWhich sound you want to hear, the old-style "Ding", or the new-style "Squee" best pony approved



Sort by ToNoSorts hits by TO with lowest numbers on top, highest numbers on the bottom, and "No TO" requesters on the bottom most. I've tried altering the order of this, but I can't, it doesn't seem like it's working properly. If I get it working, I'll add it into another version



Min pay TONoneAllows you to set a minimum "Pay" TO threshold (0-5). Any hits with a "Pay" TO below that threshold will be hidden. You can click on the "Show hits below TO threshold" button to see them. This button only appears if you're using this option. See important note below.



Hide no ToNoHides requesters who do not have a TO (not recommended) See important note below.



Disable TONoTurns off TO checking altogether, TO Pay column will report "TO Disabled". Used for when TO is blocked, should speed it up a bit by not querying the TO server. This will invalidate any other TO configurations.



Search TermsNoneAllows you to search mturk for given terms. This is the same as searching the mturk interface. All results will contain one or more of your terms.



Use includelistNoAllows you to only show requesters on your "include list". You must have an include list set before using this option or you will get no results. It will do normal searches, but any requester NOT on your include list will be ignored.



Use blocklistYesEnables/disables the blocklist. If you are not using the blocklist, any hits that WOULD have been blocked are outlined in red.



Highlight IncludelistedNoAdds a highlight to any requester on your include list even when "use includelist" not checked.



StartButtonStarts scraping



Hide SettingsButtonHides everything above the buttons to give you more room. It's a toggle, so clicking it once will hide, once will show.



Edit BlocklistButtonOpens the blocklist for manual editing if you need to remove a name or something. Blocklist and include list items are delimited by the ^ symbol.



Edit IncludelistButtonOpens the include list for manual editing to add or remove requesters. Blocklist and include list are delimited by the ^ symbol



Show TO-hidden hitsHidden ButtonSee Min pay TO



StoppedStatus messageShows you the status of hit scraper, if it's stopped, scraping, running, waiting, etc



Status messagesStatus messageVery "dumb" status message indicator attempting to shed some light into why some things work and others don't...Also why hit scraper's doing something it "shouldn't be".




Some of the elements in the settings list have informative mouseover text as well.

The hit table comes under the status information. It's laid out like so:


ColumnLinks toDescriptionMouseover




RequesterRequester PageShows the requester name and links to their page. R and T buttons allow for blocking Requester and Title respectivelyNone




TitleHit preview page OR requester pageShows the hit preview page if one can be created/viewed, OR the requester page if one cannot. Will note if the requester link is substituted. VB and IRC buttons open the hit exporter for forums and IRC respectivelyDescription of hit




RewardNoneShows how much the hit paysNone




HITs AvailableNoneShows how many hits are available at the time the page was scrapedNone




TO payRequester TO pageShows the TO value for "pay" for that requesterShows all TO ratings, number of reviews, and number of TOS flags for that requester




Accept HITRequester Preview and Accept (PANDA) page OR requester pageSimilarly to the "title", it shows the panda link OR the requester page. See "title" to know if the requester link is substitutedNone




M?NoneN means a non-masters hit, Y means a masters hitShows all qualifications for the hit




RHitDB search for requester OR nothingIf green, you've done a hit that matches that requester name, click it to view. If red, you haven't, and clicking does nothingNone




THitDB search for title OR nothingIf green, you've done a hit that matches that title, click it to view. If red, you haven't, and clicking does nothingNone




Not QualifiedNoneShows hits you are not qualified for. Only shows up if there are non-qual'd hitsNone





IMPORTANT NOTE

When you're using the advanced TO functions (threshold and hiding no TO), you may still hear "dings" for new hits that are hidden by TO. This is normal functionality, because the hits are still there, you can still view them by clicking the show TO-hidden items button. Unfortunately, the way the script is written, it's extremely difficult to make it so that items that are hidden by TO do not "ding" without a substantial rewrite, and I'm not about to do that cuz it'd just break everything. So if you hear a ding, and there's nothing there, try showing the TO items. If there's still nothing there, enjoy the ding noise. Maybe make a song out of it, do a short little beatbox or something.

v2.1: Fixed a bug with the IRC export
v2.2: Added failover to TO with IRC export
v2.3: Extended failover to full TO compatibility, added failover for shorteners in IRC exporter
v2.3.1: Fixed a derp
v2.3.2: Fixed another derp
v2.3.3: Switched main TO server from Miku's to main TO because Miku's site is down
v2.3.4: Fixed bug where TO incorrectly reported down during export for requesters with no TO
v2.3.5: Bunch of bug fixes, some optimizations, added ability to ignore TO altogether. Also matched changes to clickhappier's irc script
v2.3.6: Clickhappier update! He fixed a bug with the most available thing. Added "Highlight Includelisted", which adds a highlight to any requester on your includelist when regularly scraping. Some other minor verbiage changes as well. Thanks clickhappier!
v2.3.7: Clickhappier comes through again. To keep things short: Several bug fixes, some backend changes, added some mouseover text/descriptive text. Made the blocklist/includelist work/save better. Some graphical changes (table always fits the window and isn't squished, etc)
v2.3.7.1: Minor bug fix
v2.3.7.2: Fixed some CSS issues...mostly anyway
v2.3.7.3: Fixed a derp
v2.3.7.4: Fixed to work with mturk changing stuff behind the scenes, fix credit to clickhappier

That should give you the info you need to get started. Below is the changelog from v1.6 to v2.0:

Changelog:

  • Added in "use blocklist" feature to use/ignore blocklist
  • Fixed a MAJOR bug that's been around forever where using when logged out resulted in unpredictable hit linking. Hopefully that's fixed for good
  • Separated out the new "Squee" and the old "Ding" so that people can pick-and-choose
  • Added in "save state", where HS will remember the values you had entered last time you hit "start", and bring them up next time you load. Note: You may need to set up your defaults on your first run
  • Not using blocklist results in hits that would have been blocked being highlighted in red

Major thanks goes out to clickhappier, my main bug tester/fixer/motivator/pain-in-my-side-trying-to-get-me-to-fix-stuff, for keeping me working on this even when I was ready to quit turking altogether, and for learning along with me about all of this stuff :). Also Kerek, who contributed the code to hide the settings, as well as a lot of the back-end stuff.

Older updates and such are below.

v1.6:

  • Ponified the "ding" (most important)
  • Changed "Sort Types" to a dropdown instead of radio buttons
  • Split "masters" into "require" and "show". Require will require that all hits will be Masters (same as checking the box on mturk). "Show" will elect to show masters hits or hide them if they come up in the search (for when logged out, thanks to Kerek for the suggestion)
  • Added "hide" button to hide the interface (thanks to Kerek for the code and suggestion)
  • Added a very preliminary sort by TO, due to extremely popular demand. See below for notes
  • Added a very preliminary minimum TO, due to extremely popular demand. See below for notes.

Notes:

  1. For the "sort by TO", again it's very preliminary. It will sort the table as the TO results come in, which can result in the table changing after it's been populated if your system is slow. Keep that in mind, there's no way around that if you want to sort by TO.
    It also places the requesters with no TO data on the bottom. I couldn't put them on the top for some reason, so they're down there.
  2. For the "Minimum TO": It operates as you'd expect. Put in a number between 0 and 5, and it will remove all items which are below that number, except for "No data" or "TO down", which are not removed. If you want to see the items again, click the "Show hits below TO threshold" button. That will bring them back, but won't hide them again. I didn't think that would be necessary.

There are still apparently some issues with items duplicating and such, I can't seem to duplicate these issues so I can't test for them. They're fringe cases regardless as far as I can tell. They shouldn't really matter to the majority of people, so I'll solve them as they come up but I'm not gonna spend a huge amount of time troubleshooting.


v1.5:

  • Added a new column, "M?", which shows if hits are Masters or not (more useful for those of us with Masters, but useful for both)
  • Added qual listing, mouseover the "M?" column to see the quals for that hit.
  • Added TO listing, mouseover TO link to see all the ratings
  • Minor tweaking, some verbiage fixes, stuff you probably won't see/notice

v1.5.1: Fixed "notqualified" link processing
v1.5.2: Hopefully fixed the firefox issue, changed the way values are stored/recalled. This will have the unfortunate effect of clearing everyone's blocklist, but hopefully this will not change in the future.
v1.5.3: Fixed storage to properly handle requesters with commas in their names, I didn't realize.
v1.5.4: Updated to fix it not scraping when you're logged out, because apparently that's a thing people do.
v1.5.5: Updated to fix 1.5.4 again, because Amazon changed the way links work. Now, if you're not qualified (IE logged out), clicking the title (and/or exporting) should give you a link to the requester page, instead of a non-functioning link.
v1.5.6: Fixed 1.5.5 again, hopefully now it'll work when you're logged out again. Made it so "qualified" is not default when not logged in, hopefully fixed "duplicate" issue (where a hit will sometimes show up twice)
v1.5.7: https://www.youtube.com/watch?v=ipADNlW7yBM
v1.5.9: added in IRC exporter based on clickhappier's and cristo's work, changed colors of M? columns to make a little more sense.

v1.4:

  • Ability to block by title and requester (so you can block individual hits you've done)
  • Ability to view only certain requesters with Include list (Must add requesters to list individually for the moment, if there's a desire I'll add in a button like the blocklist)
  • Ability to make scraper make a "ding" noise when it finds new work.
  • Tied in with HitDB so clicking the R/T at the end will show you the work you've done for that requester (only for green items, might not work on firefox)
  • Added A-Z sort
  • Added inverse sort
  • Added checkbox for "Correct For Skips" (mouseover the checkbox to see what it does, or try it out! On by default, will change to off by default if necessary).
  • Re-organized a bit of the header section with some | characters to separate things
  • Added some helpful "status" messages to explain some things a bit (IE why it's scraping more than the pages you told it to)
  • Moved the status messages to below the header
  • Made it pull the blocklist every time you run it so you can have multiple instances and they'll work together properly.

v1.4.1: Initial themeing support. Put all the color values up at the top of the code, with descriptors, so they can be changed easily

v1.4.2: Nothing really. Just a bit for some of our...friends...You shouldn't see anything different really.

v1.4.3:Reverted v1.4.2

v1.4.4: Added another descriptive status message.

Older update logs:

Updated to fix an issue with the export not getting the proper quals for the proper hit.

Updated so it wouldn't clobber the normal hit export script

Updated to fix a bug, and now the requester list is case insensitive.

Added description as mouseover text for title link. Hold the mouse over the title to see it.

1.3.0.10: Added ability to block requesters dynamically, and revert to the blocklist set in the code. Default blocklist contains:
"oscar smith", "Diamond Tip Research LLC", "jonathon weber", "jerry torres", "Crowdsource", "we-pay-you-fast", "turk experiment", "jon brelig"

To clear any of those from the default, just remove them from the code (line 18, remove the " marks and comma as well). To add a requester to the block list, click the "BLOCK" button next to their name. To reset to default, click "Reset blocklist" at the top.

1.3.0.11: Added a line (line 24) to change the hit export to text symbol to whatever you'd like.

1.3.0.12: Changed such that the "reset blocklist" is now a confirm dialog in case you misclick.

1.3.0.13: Updated an error with no TO hits.

1.3.0.14: Initial method of editing the existing blocklist to add/remove requesters manually. I'd like a better way of doing it, expect that to be coming.

1.3.0.15: Added "hits available" to default template per request.

1.3.1.0: Major release because of all the changes so far. This one has logical updating of the block list. What's that mean? It means when you click "Edit Blocklist" you'll get a textarea you type in. Remove requesters, add requesters, whatever you'd like. Then just click save and it saves.

1.3.1.1: Updated with Miku's new API link.

1.3.1.2: Fixed correct for skips to accurately reflect the pages you select.