Wednesday, 29 April 2015

Web Data Extraction Solutions for business Automation

Your business today is driven purely and solely by information. This vital component is carved out from data scraped from relevant sources, cleansed and compiled to form the crux of your enterprise analytics plan. Therefore, following a manual practice for Data Extraction often makes it prone to errors which may be detrimental to the health of your business. Various Web Data Extraction Solutions therefore are made available which help to not only automate Website Data Scraping, but in the process also help in the automation of several business processes.  Let us take a look at some of these business processes:

Execute Data Mining

One of the most obvious usages of the data extractor tools is the function of Web Data Mining, which, if done through manual processes is neither cost-effective nor accurate. Therefore, data extraction solutions provide simple and convenient point and click data extraction interfaces. Moreover, these do not require any additional programming knowledge.

Validate Data accuracy

Web Site Data scraping tools use advanced technology to not only extract data, but also validate their accuracy. This is particularly beneficial for businesses which are involved in background screening and credit reporting activities. Automating the process of validating records helps in increasing turnaround time of the business and ensures accuracy, two of the most crucial criteria for achieving success in their line of business.

Be Price Wise

Organizations are always involved in studying the price dynamics in order to better understand the challenges and areas of opportunities. Your awareness on these helps you to provide more competitive pricing for your products and services. Data extraction solutions are equipped with pricing intelligence technology that helps to collect data on the price that your customer’s expect, their feedback on your products and services and helps you develop insights on products being launched by competitors, their price and availability. This automated process therefore ensures the following things for your business:

•    Increased market – share

•    Enhanced product strategy

•    Informed decisions

Be Compliance Ready

If yours is a financial services firm, you must have often found it to be a major challenge to keep yourself abreast of the compliance and risk factors. Tracking the myriad watch lists, sanction lists and federal and state regulations, available on the web is not only time taking but also an expensive endeavor. The automated Web Data Mining tools ensure that you are now able to do this without impacting the time and money aspect of your business. You are now ensured of compliance with all regulations and sanction lists that are updated on a regular basis. Consequently, your business can also breathe easy with a reduced exposure to financial fraud and identity theft.

Easy Access to Customer Feedback

Your customers are in a way the real owners of your business. They define the way you need to design your product strategy. It is therefore crucial that you need to listen to what they have to say.  The automated Web extraction solutions with their ability to tap into several relevant sources help you to get access to customer sentiments and feedback on your products and services. This is a vital aspect in your organization's growth as it helps you to tackle three important factors:

•    Assimilate positive or negative vibes on your newly launched product or service which might require you to revisit your product strategy

•    Provide immediate attention to issues, if any, with any particular product

Create a trend of aggregated customer sentiments, for analysis

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html

Tuesday, 28 April 2015

Scraping a website from a windows service

Question

Hi there.  I have a windows forms application that scrapes a website to retrieve some data.  I would like to implement the same functionality as a windows service.  The reason for this is to allow the program to run 24/7 without  having a user signed in. 

To that end, my current version of the program uses a web browser control (system.windows.forms.webbrowser) to navigate the pages, click the buttons, allow scripts to do their thing, etc.  I cannot figure out a way to do the same without the web browser control, but the web browser control cannot be instantiated in a windows service (because there is no user interface in a web service).

Does anyone have any brilliant ideas on how to get around this?

Thank you very much!

Answers

Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards

All replies

You are not telling if you are using a .NET Express edition or not

You are not telling which Framework

You are not realy saying what data you are getting from the web site.

So

I made an example of service that work on any Studio edition (including the Express)

to install it, I supposed that you have at least the Framework2, so you will use something similar to:

    %SystemRoot%\Microsoft.NET\Framework\v2.0.50727\installutil /i C:\Test\MyWindowService\MyWindowService\bin\Release\MyWindowService.exe

In the example, I supposed that you are downloading some file from the site

You will need a reference to Windows.Form for the timer


Imports System.ServiceProcess

Imports System.Configuration.Install

Public Class WindowsService : Inherits ServiceBase

  Private Minute As Integer = 60000

  Private WithEvents Timer As New Timer With {.Interval = 30 * Minute, .Enabled = True}

  Public Sub New()

    Me.ServiceName = "MyService"

    Me.EventLog.Log = "Application"

    Me.CanHandlePowerEvent = True

    Me.CanHandleSessionChangeEvent = True

    Me.CanPauseAndContinue = True

    Me.CanShutdown = True

    Me.CanStop = True

  End Sub


  Private Sub Timer_Tick(ByVal sender As Object, ByVal e As System.EventArgs) Handles Timer.Tick

    If IO.File.Exists("C:\MyPath.Data") Then IO.File.Delete("C:\MyPath.Data")

    My.Computer.Network.DownloadFile("http://MyURL.com", "C:\MyPath.Data", "MyUserName", "MyPassword")

    'Do Something with the data downloaded

  End Sub

End Class

<Microsoft.VisualBasic.HideModuleName()> _

Module MainModule

  Public TheServiceName As String

  Public Sub main()

    Dim TheServiceApplication As New WindowsService

    TheServiceName = TheServiceApplication.ServiceName

    ServiceBase.Run(TheServiceApplication)

  End Sub

End Module

<System.ComponentModel.RunInstaller(True)> _

Public Class WindowsServiceInstaller : Inherits Installer

  Public Sub New()

    Dim serviceProcessInstaller As ServiceProcessInstaller = New ServiceProcessInstaller()

    Dim serviceInstaller As ServiceInstaller = New ServiceInstaller()

    serviceProcessInstaller.Account = ServiceAccount.LocalSystem

    serviceProcessInstaller.Username = Nothing

    serviceProcessInstaller.Password = Nothing

    serviceInstaller.DisplayName = "My Windows Service"

    serviceInstaller.StartType = ServiceStartMode.Automatic

    serviceInstaller.ServiceName = TheServiceName

    Me.Installers.Add(serviceProcessInstaller)

    Me.Installers.Add(serviceInstaller)

  End Sub

End Class



 Hello Andy,

Thanks for your post.

What do you want to scrape from the page? HttpWebRequest class ans WebClient class may be what you need. More information, please check:

The HttpWebRequest class provides support for the properties and methods defined in WebRequest and for additional properties and methods that enable the user to interact directly with servers using HTTP.

http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx

The WebClient class provides common methods for sending data to or receiving data from any local, intranet, or Internet resource identified by a URI

http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx

If you have any concenrs, please feel free to follow up.

Best regards



Hi Andy,

What about this problem on your side now? If you have any concerns, please feel free to follow up.

Have a nice day.

Best regards



Hi Andy,

When you come back, if you need further assistance about this issue, please feel free to let us know. We will continue to work with this issue.

Have a nice day.

Best regards



Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I am using Visual Studio 2010 Ultimate Edition and the .NET framework 4.0.  Actually, I am upgrading some old code written in VB 6.0, but I can use the latest and greatest thats available.

The application uses a browser control to go to the page, fill in values, click on UI elements, read the HTML that returns, etc.  The purpose of the application is to collection useful information regularily/automatically.

I know how to create a web service, but using the web control in such a service is problematic because the web browser control was meant to be placed on a windows form.  I am not able to create a new instance of it in a project designated as a windows service.



Andy

Thank you for the reply. Sorry it has taken me so long to respond.  I did not receive any notification that someone had replied!

I thought a web request was for web services (retrieving information from them).  I am trying to retreive useful information from a website designed for interaction by a human, such as selecting items from lists and clicking buttons.   I currently use a web browser control to programmatically do what a person would do and get the pages back which in turn get parsed.

Andy



Hi Andy,

There is a tool which could let you manipulate anything you want on the website. This agile HTML parser builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). More information, please check:

http://htmlagilitypack.codeplex.com/

Have a nice day.

Best regards



Thanks for the suggestion.  I will go to that link and see if it will work.  I will update this post with what I find.

I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions? If you have any concerns, please feel free to follow up.

Have a nice day.



Best regards

Hi Liliane

Thank for the follow up reply.  I don't have an answer as of yet.  Implementing this is going to take time and I haven't been given the go-ahead by my boss to spend the time to pursue it.



Hi Andy,

Never minde. You could have a try when you feel free. If you have any further questions about this issue, please feel free to let us know. We will continue to work with you on this issue.

Have a nice day.

Best regards

Source: https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5d565b1-236b-43c2-90c7-f5cc3b2c341b/scraping-a-website-from-a-windows-service

Saturday, 25 April 2015

Data Mining and Market Research

Online market research attributes to success and growth of many businesses. Online market research in simple terms we can say it is the learning of current and the latest market situations which involve surveys, web and data mining modules. To date research by use of the internet it is very important since it depends on data gathered from internet services and then one can recognize that market research keeps the business successful.

A number of managers in small businesses have a mental deem in that online market research is obligatory to big or larger companies. For true you will understand that whether the businesses is medium, large or small actually need online marketing research and this is a reason why the significance of the process and allegation will approve the targeted and potential clients. In this case Data mining progression is employed to streamline on what targeted and potential clientele needs. Areas where data mining is used:

Preferences. In any given service or product, you will learn what a customer looking for and how your product or service is different from other competitors. By use of Data mining you will be able to determine the customer preferences and you will be able to modify your services and products meet the customer choice.

Buying patterns. What is known and created for purchasing patterns from different customers. A situation can be that customers try to spend a lot on certain products and little on others. But through Data mining it is easy to understand such purchasing patterns and finally plan the appropriate techniques to be used in the marketing.

Prices. To find out whether the company is selling its products to the clients or not prices are the key factors to take into account. One should understand on the right selling price of the products. Web scrapping is easier to find the suitable pricing.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/data-mining-market-research/

Tuesday, 21 April 2015

Hand Scraped Flooring For a Natural and Unique Look

An option in hardwood flooring that is being increasingly adopted by those looking for something new, innovative and unique for their homes is hand scraped flooring. This type of wood flooring helps one achieve a distinct natural look on one's floor and also has a couple of advantages.

There are three types of scraping that you can get done on your wooden flooring: light, medium and hard. Preferably, if you have a light colored woodwork, then you should go for light scraping and if your floor has a darker shade, then you should opt for hard scraping. But, irrespective of the type of scraping you go for, you must ensure that the laborers doing the scraping are very skilled and impeccable in their job as hand scraping floors is an art that demands patience, time, talent and hard work.

Nowadays, many people tend to go for machine scraping, attracted by the lower investment involved in it. But such people are unable to achieve the requisite natural effect on their floors as machines create patterns on the floors that are easily detectable. These patterns do not emerge with hand scraping and the consequent look is as random and unique as it gets.

Though such scraped flooring is a costly option in flooring, it demands little maintenance. While with perfectly smooth surfaces, you will be always on the edge ensuring that there are no scratches, with hand scraped floors, you will not have to be concerned about this as any new scratches will only add to the already distressed appearance of the flooring.

Prefinished hand scraped wood flooring is also available in the market nowadays. These eliminate the need of any on-site scraping. But this option is of course unsuitable for those who have already got their floors installed. As it is, if you get on-site scraping done, you will have more control over things as you would be able to see the scraping as it develops and would be therefore in a position to exercise your preferences more.

Source: http://ezinearticles.com/?Hand-Scraped-Flooring-For-a-Natural-and-Unique-Look&id=4581623

Thursday, 9 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching


The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Tuesday, 7 April 2015

Rvest: Easy Web Scraping With R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

  html_node("strong span") %>%

  html_text() %>%

  as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

  html_nodes("#titleCast .itemprop span") %>%

  html_text()

#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"   

#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"   

#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

  html_nodes("table") %>%

  .[[3]] %>%

  html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1...  marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

•    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

•    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

•    Detect and repair text encoding problems with guess_encoding() and repair_encoding().

•    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/