Comingsoon.net Data Scraping: May 2015

Thursday, 28 May 2015

Data Scraping Services - Login to Website Programmatically using C# for Web Scraping

In many scenario the data is available after login that you want to scrape. So to reach at the page where data is located you need to implement code in web scraper that automatically takes usename/email and password to login into website, once login is done you can do crawling and parsing as required.

Many third party web scraping application provides functionality where you can locate login url and set login parameters and that login task will be called when scraper start and do web scraping.

Below is C# example of programmatically login to demo login page

http://demo.webdata-scraping.com/login.php

Below is HTML code of Login form:

<form class="form-signin" id="login" method="post" role="form"> <h3 class="form-signin-heading">Please sign in</h3> <a href="#" id="flipToRecover" class="flipLink"> <div id="triangle-topright"></div> </a> <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus> <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required> <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button> </form>

<form class="form-signin" id="login" method="post" role="form">

            <h3 class="form-signin-heading">Please sign in</h3>

            <a href="#" id="flipToRecover" class="flipLink">

              <div id="triangle-topright"></div>

            </a>

            <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus>

            <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required>

            <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button>

</form>

In this code you can notice there is ID for email input box that is id=”loginEmail” and password input box that is id=”loginPass”

so by taking this ID we will use below two method of webBrowser control and fill the value of each input box using following code

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString(); webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString();

webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

After the value filled to Email and Password input box we will just call click event of submit button which is named as Sign In

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

So this is very basic example how you can login to website programatically when you need to access data that is available after login to website. This is very simple way in which you can work with Web Browser control but there are some other way as well using which you can do same thing.

Source: http://webdata-scraping.com/login-website-programmatically-using-c-web-scraping/

Monday, 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

    Your tasks are represented by classes and not processes

    You can set and forget tasks and scale across multiple servers

    You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098

Friday, 22 May 2015

How to prevent getting blacklisted while scraping

Crawlers can retrieve data much quicker and in greater depth than human searchers, so bad scraping practices can have some impact on the performance of the site.

Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a under powered server would have a hard time keeping up with requests from multiple crawlers.

Since spiders don’t bring direct organic traffic and seemingly affect the performance of the site, most site admins hate spiders and do their best to prevent them.

Lets go through how websites detect and block spiders and also know the techniques to overcome those barriers.

Most websites don’t have anti scraping mechanisms since it would affect the user experience, but some sites do not believe in open data access.

Before going through this article always keep in mind that

    A GOOD SPIDER MUST OBEY A WEBSITE’S CRAWLING POLICIES.

HOW DOES DETECTING ‘SPIDER ACTIVITY’ WORK?

A web server can use different mechanisms to detect a spider from a normal user. Here are some methods used by a site to detect a spider:

•    Unusual traffic/high download rate especially from a single client/or IP address within a short time span raises a bot alert.

•    Repetitive tasks done on website based on an assumption that a human user won’t perform the same repetitive tasks all the time.

•    The site has honeypot traps inside their pages, these honeypots are usually links which aren’t visible to a normal user but only to a spider . When a scraper/spider tries to access the link, the alarms are tripped.

Spend some time and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

EASIEST WAY TO FIND IF A SITE HATES BOTS

Check the robots.txt file if it contains line like these, It means the site doesn’t like bots. However, since most sites want to be on Google (arguably the largest scraper of websites globally ;-)) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

Another way is CAPTCHAs irritating presence in the sites other than in authentication page.

WHAT HAPPENS WHEN YOU GET BANNED

There are two ways to ban a webspider, either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent. Temporary blocks can last minutes or hours.

HOW DO WE KNOW A SITE HAS BLOCKED US?

If any of the following symptoms appear on the site that you are crawling, it is a sign of being blocked or banned.

•    Showing CAPTCHA pages
•    Unusual content delivery delay
•    Frequent response with 404,301,500 errors,

also frequent appearance of these status codes are also indication of blocking.

•    401 Unauthorized
•    403 Forbidden
•    404 Not Found
•    408 Request Timeout
•    429 Too Many Requests

WEB CRAWLING BEST PRACTICES

These are the best practices we can follow to overcome the detection.

1. MAKE CRAWLING SLOWER, DO NOT DDoS THE SERVER, TREAT THEM NICELY

Use auto throttling mechanisms, which will automatically throttle crawling speed based on the load on both spider and the website, you are crawling and also adjust the spider to optimum crawling speed. The faster you crawl, the worse it is for everyone.

Put some random sleeps in between requests, Add some delays after crawled number of pages. Choose the lowest number of concurrent requests possible. These techniques make the spider looks like a human being.

2. DISGUISE YOUR REQUESTS BY ROTATING IP/PROXY

A server can easily detects a bot by checking the requests from a single IP address, So we use different IPs for making request to a server and detection rate become lesser. Make a pool of IPs that you can use and use random ones for each request.

There are several methods can be used to change the IP. Services like VPN ,shared proxies, TOR can help and some third parties are also provides services for IP rotation.

3. USER-AGENT SPOOFING

Since every request made from a client end contains a user-agent header ,Using the same useragent multiple times leads to the detection of a bot. User agent spoofing is the best solution for this. Spoof the User agent by making a list of user agents and pick a random one for each request.

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1; (http://www.google.com/bot.html)

You can check your user-agent string here:

http://www.whatsmyuseragent.com/

A good user-agent string list can be found here:

http://www.useragentstring.com/pages/useragentstring.php

4. BE AWARE OF HONEYPOTS

Some site designers put honeypot traps inside websites to detect web spiders, They may be links that normal user can’t see and a spider can.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

5. DO NOT ALWAYS FOLLOW THE SAME CRAWLING PATTERN

Only robots follow the same crawling pattern,Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding pattern in their actions. Humans wont perform repetitive tasks a lot of times. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human client.

6. ALWAYS RESPECT THE robots.txt

All web spiders are supposed to follow rules that you place in a robots.txt file in a website, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.

Source: http://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

Wednesday, 20 May 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Sunday, 17 May 2015

What is Blog Scraping Service?

Blog scraping is the one of the best service to increase the traffic of the site by commenting about blogs or writing review about blogs in SEO field. Most of the Blogs will allow their reader to review or write their own comments or suggestion or ideas or thoughts in the blogs.

Nowadays in the internet world we can find the number of blogs and sites related to various topics or various products. Main concept of this service is increase traffic of website by commenting others blogs. This is very simple and easiest method. But the main difficultly we face here is getting approval from moderator of the site which may take more time or maybe we won’t get the approval.

Hence Web Scraping seo is planning to provide this blog scraping service without approval as many moderators do not have the time to read and approved each and every comment written by various visitors. We will find the High PR pages on the various blogs related to your website content and write the own comment about those blogs and provide the link of your website or anchor text. We don’t have the option or the way to track the blogs whether it is approved by moderator or not. We will give the link with comments what we have typed on the blogs as a report. It will increase the back link and increase the traffic.

What are the features of Blog scraping Service?

•    Will provide the comments or reviews to blogs which having related niche to your product.
•    Will write comments only high density or high ranking blogs.
•    Fast and More accurate promotion compared to other service
•    Understand the Blogs by reading carefully and comment accordingly
•    This service is optimized and SEO friendly.

What are the benefits of Blog scraping Service?

•    Effect of time spending for this service is very less.
•    This service is best method to increase your site traffic with minimal effect and cost.
•    Increase your web site rank in all search engines.
•    Reach your site to more number of audiences.
•    Increase your product sale.
•    Fast and more results.

What are the advantages of using this service in Web Scraping SEO?

•    Web Scraping SEO is one the top SEO service provider in the SEO Market.
•    Expert people working on Blog commenting service will always do analysis to find the high traffic blogs.
•    Web Scraping SEO will get the approval from Blogs administrator easily.
•    Provides High Quality Service with reasonable price.
•    Provides on time delivery.
•    More flexible to clients.
•    Always met the Client expectation and Provide quality service.

Frequently Asked Questions

Q: Will you provide the approval for each comment you typed on the blogs from blogsite moderator?

A: No, we are only responsible for creating comments for your website but we won’t wait for moderation approval, because Moderator is responsible for Approval, He may take time for approval that is according to Moderator’s scope. We will give only the blog links and the comments to you as a report.

Q: Do you have any system or software to track the approval of blog?

A: We don’t have any system or software to track the approval, we do comments in those top blog sites according to the matching keyword. That is only our job approval is from moderator side.

Q: Why you can’t get the approval for comments from moderator?

A: I can clearly answer this one, Because nowadays everyone is busy particularly the blogsite Moderators for that reason our comments got approved late. But we are not going to wait for that because we have a lot of works to do, But I assure you, that with the final reports that contains how many sites we have uploaded with your comments in MS Excel format will reach you.

Q: How do you select the blogs for commenting?

A: We are going to select top ranking blog sites related to your keywords, According to the benefits of your product we will give proper and attractive comments carefully.

Source: http://www.Web Scrapingseo.com/blog-scraping-service.aspx

Wednesday, 6 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it? Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access. Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format. Once the data is in a structured format, the user can extract or manipulate the data with ease. Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different. Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today. A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer. In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics. Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them? Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads? Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them. The list goes on and on.

I should mention that web scraping is not always a bad thing. Some websites allow web scraping, but many do not. It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information. Most websites have a copyright disclosure statement that legally protects their website information. It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically. In fact, the F5.com website presents the following copyright disclosure: "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws." It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware! There have been many court cases where web scraping turned into felony offenses. One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles. This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted. Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market. Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge. The software company had to pay a $20 million fine and the guilty scraper is serving three years probation. Finally, there's the case of a medical website that hosted sensitive patient information. In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website. The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world. As you can see, it's increasingly important to guard against this activity. After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today. The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family. As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript. It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval. The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript. It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded. The session transaction anomaly detects valid sessions that visit the site much more than other clients. This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table. The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities. But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc. You might be wondering how it does all that stuff as well. Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity