Comingsoon.net Data Scraping: May 2013

Thursday, 30 May 2013

Adult movie scraper (details+covers) -- v2.0.6

The CDUniverse scraper-script will import your adult collection with as much details as possible, and get high resolution front and back cover artwork.

To install the scraper (thanks Mew):

    Download the .xml file (attachment at bottom of this post) to a location you remember.
    Open "MediaPortal - Configuration", go to the "plugins" and select "Moving Pictures" and "Config".
    Select the "Importer Settings" tab.
    In the "Data Sources:" section select the "Manually manage movie data sources" radio button.
    Click the "Movie Details Data Sources" button.
    In the popup click the arrow just to the right of the "+" button and pick "Add a New Data Source".
    Browse to the new .xml scraper file you downloaded in first step and click OK.
    It should appear as 'CDUniverse.com' in the "Source column". You may need to enable it by pressing the "+" button if it is greyed out, and move it to the position in the list that you prefer priority wise.

Note:It is not uncommon for the adult movie industry to use titles that match an actual IMDb movie title very closely. If you do not put this scraper at the highest priority then you might never see the results. You can also fix this another way; please goto the "About" tab -> "Advanced Settings" -> "Matching and Importing" -> "Minimum Possible Match Threshold" and adjust this to '0'. This will cause every internet scraper to work, so disable all the ones you do not use. Placing this adult scraper at the very top is needed for auto-approval to function, as only the first scraper is used for this. So you might want to run it like that for the initial import of your adult collection.

NEW: You can now prefix your filenames with '$$$$$XXX-'. This will freak out almost all of the other scraper-scripts, so that they will not give any results at all. That in turn will allow the CDUniverse scraper-script in a lower priority position to be the only one to come up with a match. You will see 'xXx ' prefixed during the search-node stage when this happens (green circle appears), but during the details-node stage this will then be corrected again (green circle gets the white checkmark).

Technical details on scraper:

    CDUniverse.com is used as source, yielded the best results.
    Both the front and back covers are obtained (you can switch from within GUI), with the front cover setup as default. (This is now no longer possible :( and just a small front cover is obtained)
    There is no support for Release Year to assist in finding a match so to make the title results more relevant the recent ones are listed first which hopefully helps in the auto-approval rate of your collection. -- Adjusted in v1.0.2
    Blu-Ray movies are now supported (mainly for cover images) and are shown with (Blu-Ray) in the title. This will interfere with a title match, so DVD titles are still used by default. Either add the "(Blu-Ray)" string to your filenames, or simply overrule the drop down box selection to pick the BR version (after the importer is done, or it can lead to a crash).
    Folder names are now used to try to find a positive match, if the filename is one of those cryptic aconyms from a scene group.

Known issue:

    CDUniverse adds 'DVD' to the end of every title. Don't know if it adds any other strings as my collection didn't span any of those. I have no clue how to fix the added postfix, so any help in this matter would be appreciated. -- Fixed in v1.0.1
    PID support will not be added, because then backdrop support would be gone. Instead rename your file(s) to match the title that a manual search on CDUniverse gives you, and if that still fails, then please provide me with the filename.

Changelog (July 13th 2012):

    v2.0.0 - Fixed search node to find results again after CDUniverse changed their HTML code once more. Adjusted site_id to be the 7-digit product number only to make it easier on Follw.it developers, but added in code to retain backwards compatability with v1.x method. Also improved the cover node to obtain both front and back covers faster.
    v2.0.1 - Search node fixed again to compensate for changes made by CDUniverse.
    v2.0.2 - Internal tests on changes CDUniverse was working on, but kept undoing.
    v2.0.3 - Internal tests on changes CDUniverse was working on, but kept undoing.
    v2.0.4 - Seems CDUniverse finally decided to roll out the new HTML code, so this one fixes artwork, summary and auto-matches more titles.
    v2.0.5 - Added support for custom filename prefix '$$$$$XXX-', so that CDUniverse can be placed in 2nd or lower scraper-script position and still be the only one to find a match.
    v2.0.6 - Compensated for new HTML code and lost ability to get large front+back covers due to a referral block on their servers. Will have to settle for front cover thumbnail now.

Enjoy.

Source: http://forum.team-mediaportal.com/threads/adult-movie-scraper-details-covers-v2-0-6.74856/

Monday, 27 May 2013

Word of the Day: Scraper Website

A scraper website is a web site that copies content from another website. Occasionally, just a page is ‘scraped’ from another website and used illegally in another website.

Scraper sites often add their own ads to the copied web pages after deleting the ad code from the copied web pages. Often the scraper websites will hit on popular news stories and try to get placed on top ranked search results pages. Sometimes the pages are copied carelessly and contain broken links or incorrect directory paths to the photos and other graphics that are located on the original website’s server. When this happens, the photos or graphics are missing in the scraper site.

Occasionally the scraper website producer will change the page slightly to conform to other original parts of the scraper website. Following are two screen shots that illustrate scraping. The first website screen shot is an original article Elliptical vs. treadmill: Which machine really delivers? published by the Daily Herald March 9, 2009. The second screen shot shows a scraper page from Middletown Gold’s Gym in Middletown, New York How does the elliptical really compare to the treadmill published May 20, 2009. The text is almost identical. If you read the text, you can see that one of the names in the articles is attributed to different geographic locations. For example ‘Arlington Heights Personal Trainer Mark Bostrom’ is changed to ‘Gold’s Gym Personal Trainer Mark Bostrom.’ Some other names in the pages posted on the web are also changed similarly.

A scraper site’s use of information from other sites without permission is in violation of copyright law, unless the websites are public domain websites.

Related posts:

    Welcome to AA-ER.com Website Stats
    Website Seeks Class Action Against Google for Blacklisting
    CNN IMPACT YOUR WORLD Website Vets Out Legitimate Charity Sites for Haiti Relief
    Village of Arlington Heights Unveils New Improved Website Design and Content
    Website Status

Source: http://www.arlingtoncardinal.com/2010/11/word-of-the-day-scraper-website/

Friday, 24 May 2013

Congressional Data Mining: Coming Soon?

By slipping a simple, three-sentence provision into the gargantuan spending bill passed by the House of Representatives last week, a congressman from Silicon Valley is trying to nudge Congress into the 21st Century. Rep. Mike Honda (D-Calif.) placed a measure in the bill directing Congress and its affiliated organs—including the Library of Congress and the Government Printing Office—to make its data available to the public in raw form. This will enable members of the public and watchdog groups to craft websites and databases showcasing government data that are more user-friendly than the government's own.

If the Senate passes the bill with the provision intact, citizens seeking information about Congress' activities—such as bill names and numbers, amendments, votes, and committee reports—won't have to rely on government websites, which often filter information, are incomplete, or are difficult to use. Instead, the underlying data will be available to anyone who wants to build a superior site or tool to sift through it. "The language is groundbreaking in that it supports providing unfiltered legislative information to the public," says Honda's online communications director, Rob Pierson. "Instead of silo-ing the information, and only allowing access through a limited web form, access to the raw data will make it easier for people to learn what their government is doing."

Advertise on MotherJones.com

Successful, privately-created websites that provide the public with information about Congress' actions already exist. OpenCongress.org, GovTrack.us, Legistorm.com, and MAPLight.org all make legislative data available to the public in ways that are easier to navigate than Congress' primary web portal, a system called Thomas. Those sites currently get their data through techies who "scrape" Thomas and other government websites, which means they use bots to process the HTML and gather what is valuable. The process is labor-intensive and imprecise. "It's difficult to keep the data up to date, in some cases impossible, and occasionally there are errors in the data," says Josh Tauberer, the 26-year-old who runs GovTrack.us and does lots of the "scraping" that others use. "This could all be fixed by a bulk data download."

Tauberer expects that the availability of additional and easier-to-use congressional data will spur innovation. "You can expect to see other sites spring up doing new and interesting things with the information." He anticipates charts, graphs, and maps that represent congressional goings-on visually—"ways of visualizing the congressional process that we couldn't yet imagine." Honda, with his Silicon Valley roots, expects that developers and coders will quickly outpace the government's efforts to date. "We hope that we can learn from the wisdom of crowds," says Pierson.

There are government agencies that already provide massive amounts of data via databases. The Census Bureau provides huge amounts of information in raw form, allowing academics, statisticians, and think tank scholars to comb through it in any way they please. The Federal Elections Commission publishes unedited data on campaign contributions, giving rise to sites like OpenSecrets.org, which allows the public to see who is donating to whom, and allows journalists and watchdogs to investigate the influence of money in politics.

"In our Web 2.0 world, we can empower the public by providing them with raw data that they can remix and reuse in new and innovative ways," Honda told Mother Jones in a statement. (Disclosure: In the summer of 2002, I briefly worked as an intern in Honda's district office.) Honda's provision, however, pertains only to legislative data. Federal departments like the Environmental Protection Agency, the Food and Drug Administration, and the Department of Energy have reams of data that political scientists, economists, and researchers of all stripes would love to get their hands on. Many who work at the intersection of technology, politics, and transparency believe that the key player in broadening Honda's effort to include the executive branch will be Vivek Kundra, the former Chief Technology Officer of the District of the Columbia who was named Obama's Chief Information Officer on Thursday. According to the National Journal's Tech Daily Dose, Kundra "told reporters Thursday he will launch data.gov, a Web site intended to 'democratize data' by giving the public raw feeds of information from a range of agencies."

John Wonderlich, the policy director at the Sunlight Foundation, which has created or funded several tools that make government data easier to analyze, is holding out hope that the president's Open Government Directive, which is due at the end of May, will further address the issue of data availability. He applauds Honda for putting Congress, at least, on the right track. "Without Honda's attention to this issue, congressional level attention to bulk data access would be unlikely," he says. "We're happy to see this first step."

Source: http://www.motherjones.com/politics/2009/03/congressional-data-mining-coming-soon

Friday, 17 May 2013

Coming Soon On Saturday B Movie Reel – Black Scorpion

Coming up on the podcast in the next couple of weeks we’ll be covering a title from our vintage series when we discuss the Roger Corman produced film Black Scorpion (1995).

Here’s the description…

As a street-pounding undercover cop, Darcy Walker (Joan Severance) must confront slime balls every day… on both sides of the law. When her father is murdered, Darcy gives up on the establishment and transforms herself into the vigilante heroine Black Scorpion. Blasting through the city in a souped up Porsche Scorpionmobile, she searches for the key to her father’s murder. Along the way, she discovers a conspiracy of the darkest kind. Can she stop her nemesis, the Breath-taker, before he destroys the city?

Video clip below…

You can purchase Black Scorpion on DVD here.

We’ll be discussing this movie on an upcoming episode of our Saturday B Movie Reel podcast.

We have a Fans of Syfy Original Movies Facebook group if you want to discuss the Syfy movies and other fun scifi/fantasy B movies with fellow fans. We also have a Saturday B Movie Reel Facebook fan page if you want to keep up on all our activities.

You can also follow us on Twitter as @SatBMovieReel.

Source: http://tuningintoscifitv.com/2013/05/08/coming-soon-on-saturday-b-movie-reel-black-scorpion/

Monday, 6 May 2013

Microfinance Data Scraping from MIX

When DataKind held our first DataDive last October in NYC, we had the pleasure of working with MIX as one of our non-profit partners. Volunteer data scientists and MIX staff worked together to dig into financial service data in Africa using web scrapers and found a wealth of helpful information. We’re thrilled to share a guest post from our partners at MIX about the entire process. Enjoy!

The poor have complex and sophisticated financial lives, but we lack data on many of the services that they use to meet basic needs. Even very basic data can yield insights on how to expand services or whether providers are targeting the right needs.

A wealth of data on financial institutions is publicly available, but is ‘hidden’ in different places and ‘locked up’ in hard-to-use formats. If we want to build a realistic picture of the financial options available to the poor, we have to look for ways to find and unlock this data.

There are two basic ways that have been used to measure the landscape of financial services in the past:

Send out a questionnaire: this is easy for the surveyor, but a burden on the respondents. It’s hard to get much breadth or depth without strong incentives (or a mandate) to report.

Hire surveyors on the ground: this may be easier for the respondents, but is costly and time consuming and also hard to repeat.

We took a third path: web scraping. A web scraper is a computer program that visits a website, finds the particular data of interest, and saves the data in a more structured format. Web scrapers can extract data from existing websites without placing any burden on respondents or surveyors. Additionally, the scraper scripts can be shared and repeated and improved by others.

Financial institutions need to make good information available for their customers, such as through online branch listings, but to date no one has used these readily available, public sources of data for mapping. Could web scraping help unlock some of this data?

MIX participated in the first DataDive organized by DataKind in order to find out. A fantastic team of data scientists set up several web scrapers over the course of a weekend, work that would have taken the MIX team many many more hours to do manually.

To take this further, MIX and Thomas Levine then collaborated to map the complete financial sector in three countries in Africa, with maps and visualization by Development Seed. The end result is data on over 60,000 points of service in South Africa, Kenya and Rwanda (both coming soon), geo-coded to individual towns and mapped and consolidated for easy access. You can read more about the first round of data on South Africa here or here.

We started with a list of public data sets for each country, generally either lists of branches or mobile banking agents or databases from regulators or networks. Tom then created scrapers for each and housed them publicly on ScraperWiki, with instructions for use and maintenance.

Each scraper had a standard output so that data from different scrapers could be combined; these data are accessible for each country as a Fusion Table or CSV from the website. Once the data were consolidated, the results were geo-coded (with help from others) so they could be plotted on a map for easy navigation and access to data.

Scrapers have some good properties relative to surveys and questionnaires. The data on the websites are subject to review by customers of the various financial institutions, who need to know where their local branch or office is. The scrapers themselves can be subject to peer review; by using and publishing the computer scripts, we show our work. The scrapers can also be run multiple times (if the site doesn’t change) meaning that we can re-generate the data on demand and that we can track changes in the data over time.

Scrapers are not a panacea for all hard-to-reach data. We ran into challenges when institutions had poor or messy data on locations on their websites. Some more ‘grassroots’ providers don’t have a strong online presence and MIX had to find data offline. However, we always had recourse to more traditional data collection methods when we encountered such issues. Using the existing public data now can also reinforce the need for improvements the long run.

Source: http://datakind.org/2012/05/642/

Thursday, 2 May 2013

Better Data Extraction Services In Your Business

Web design page automatic data extraction software for data collection. Data extraction software can be made so much money, but there are two types of programs – tailored and typical.

So for example, we have, a custom Web data extraction program for B-site work, because they have different structures. Customized solutions for such standards are more money, but they are more complex and are designed for unique situations.

Repetitive operation, automates data extraction.

Data extraction software is based on a constant. By a constant, I is a program that does not change anything about the facts, no matter what it means. Such software is rarely a crime. But for the moment, it’s the only way. For other approaches to artificial intelligence and software programs that can use that people make decisions.

The bottom line is that data mining that otherwise expensive operation when the cycle is capable of automatically controlled by man, is software.

In addition, a very dynamic knowledge resources of the Internet and is growing at a rapid pace. Sports, news, finance, and corporate sites to update their websites on an hourly or daily basis. Profiles of different interests and objectives of today use the web reaches millions of users.

It is important to note that only a small part of the web is really useful information. Three general approaches to information stored on the website of the user’s access can be established:

Randomly browsing the Web page contains many hyperlinks.
Query-based search engines – Google or Yahoo (questions of interest in searching for specific keywords) to use to search for documents relating to
Deep searches eBay.com demand product Business.com search engine or directory services, etc.

retrieved by searching the database

Search collecting, filtering and data analysis is defined as data mining. Different data relationships, patterns, or a significant statistical correlation of this type of information can be obtained funding from a wide range of enjoyment.

Government, private companies, research and business development for large companies and organizations are looking for a large amount of information gathering. All data collected by them can be stored for future use. This kind of information is most important when it is needed.

Data mining software integrated with mathematical algorithms and statistical techniques are needed tools. The end product is a simple software package that can be used by non-mathematicians to effectively analyze data. Data mining, market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, Web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications are used in many applications as is.

Business intelligence data mining, market research, industry research, competitive analysis is done. Direct marketing, e – commerce, customer relationship management, healthcare, oil and gas industry, scientific testing, genetics, telecommunications, financial services and utilities, as used in key areas. Information systems and geographic information system that uses different techniques.

Business intelligence data mining decisions is used as a tool in a large area. In fact, the use of data mining in a BI application data and makes it relevant. There are different types of data mining: text mining, web mining, social network data mining, relational databases, data mining, and graphics, audio and video data mining, data mining, which are used in all business intelligence applications.

Source: http://under25dollar.com/better-data-extraction-services-in-your-business-dating-how-to/

Note:

Delta Ray is experienced web scraping consultant and writes articles on web data scraping, website data scraping, data scraping services, web scraping services, website scraping, eBay product scraping, Forms Data Entry etc.