Friday, 29 July 2016

Scraping LinkedIn Public Profiles for Fun and Profit

Scraping LinkedIn Public Profiles for Fun and Profit

Reconnaissance and Information Gathering is a part of almost every penetration testing engagement. Often, the tester will only perform network reconnaissance in an attempt to disclose and learn the company's network infrastructure (i.e. IP addresses, domain names, and etc), but there are other types of reconnaissance to conduct, and no, I'm not talking about dumpster diving. Thanks to social networks like LinkedIn, OSINT/WEBINT is now yielding more information. This information can then be used to help the tester test anything from social engineering to weak passwords.

In this blog post I will show you how to use Pythonect to easily generate potential passwords from LinkedIn public profiles. If you haven't heard about Pythonect yet, it is a new, experimental, general-purpose dataflow programming language based on the Python programming language. Pythonect is most suitable for creating applications that are themselves focused on the "flow" of the data. An application that generates passwords from the employees public LinkedIn profiles of a given company - have a coherence and clear dataflow:

(1) Find all the employees public LinkedIn profiles → (2) Scrap all the employees public LinkedIn profiles → (3) Crunch all the data into potential passwords

Now that we have the general concept and high-level overview out of the way, let's dive in to the details.

Finding all the employees public LinkedIn profiles will be done via Google Custom Search Engine, a free service by Google that allows anyone to create their own search engine by themselves. The idea is to create a search engine that when searching for a given company name - will return all the employees public LinkedIn profiles. How? When creating a Google Custom Search Engine it's possible to refine the search results to a specific site (i.e. 'Sites to search'), and we're going to limit ours to: linkedin.com. It's also possible to fine-tune the search results even further, e.g. uk.linkedin.com to find only employees from United Kingdom.

The access to the newly created Google Custom Search Engine will be made using a free API key obtained from Google API Console. Why go through the Google API? because it allows automation (No CAPTCHA's), and it also means that the search-result pages will be returned as JSON (as oppose to HTML). The only catch with using the free API key is that it's limited to 100 queries per day, but it's possible to buy an API key that will not be limited.

Scraping the profiles is a matter of iterating all over the hCards in all the search-result pages, and extracting the employee name from each hCard. Whats is a hCard? hCard is a micro format for publishing the contact details of people, companies, organizations, and places. hCard is also supported by social networks such as Facebook, Google+, LinkedIn and etc. for exporting public profiles. Google (when indexing) parses hCard, and when relevant, uses them in search-result pages. In other words, when search-result pages include LinkedIn public profiles, it will appear as hCards, and could be easily parsed.

Let's see the implementation of the above:

#!/usr/bin/python
#
# Copyright (C) 2012 Itzik Kotler
#
# scraper.py is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# scraper.py is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with scraper.py.  If not, see <http://www.gnu.org/licenses/>.

"""Simple LinkedIn public profiles scraper that uses Google Custom Search"""

import urllib
import simplejson


BASE_URL = "https://www.googleapis.com/customsearch/v1?key=<YOUR GOOGLE API KEY>&cx=<YOUR GOOGLE SEARCH ENGINE CX>"


def __get_all_hcards_from_query(query, index=0, hcards={}):

    url = query

    if index != 0:

        url = url + '&start=%d' % (index)

    json = simplejson.loads(urllib.urlopen(url).read())

    if json.has_key('error'):

        print "Stopping at %s due to Error!" % (url)

        print json

    else:

        for item in json['items']:

            try:

                hcards[item['pagemap']['hcard'][0]['fn']] = item['pagemap']['hcard'][0]['title']

            except KeyError as e:

                pass

        if json['queries'].has_key('nextPage'):

            return __get_all_hcards_from_query(query, json['queries']['nextPage'][0]['startIndex'], hcards)

    return hcards


def get_all_employees_by_company_via_linkedin(company):

    queries = ['"at %s" inurl:"in"', '"at %s" inurl:"pub"']

    result = {}

    for query in queries:

        _query = query % company

        result.update(__get_all_hcards_from_query(BASE_URL + '&q=' + _query))

    return list(result)

Replace <YOUR GOOGLE API KEY> and <YOUR GOOGLE SEARCH ENGINE CX> in the code above with your Google API Key and Google Search Engine CX respectively, save it to a file called scraper.py, and you're ready!

To kick-start, here is a simple program in Pythonect (that utilizes the scraper module) that searchs and prints all the Pythonect company employees full names:

"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> print

The output should be:

Itzik Kotler

In my LinkedIn Profile, I have listed Pythonect as a company that I work for, and since no one else is working there, when searching for all the employees of Pythonect company - only my LinkedIn profile comes up.
For demonstration purposes I will keep using this example (i.e. "Pythonect" company, and "Itzik Kotler" employee), but go ahead and replace Pythonect with other, more popular, companies names and see the results.

Now that we have a working skeleton, let's take its output and start crunching it. Keep in mind that every "password generation forumla" is merely a guess. The examples below are only a sampling of what can be done. There are, obviously many more possibilities and you are encouraged to experiment. But first, let's normalize the output - this way it's going to be consistent before operations are performed on it:

"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> string.lower(''.join(_.split()))

The normalization procedure is short and simple: convert the string to lowercase and remove any spaces, and so the output should be now:

itzikkotler

As for data manipulation, out of the box (Thanks to The Python Standard Library) we've got itertools and it's combinatoric generators. Let's start by applying itertools.product:

"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> string.lower(''.join(_.split())) -> itertools.product(_, repeat=4) -> print

The code above will generate and print every 4 characters password from the letters: i, t, z, k, o, t, l , e, r. However, it won't cover passwords with uppercase letters in it. And so, here's a simple and straightforward implementation of a cycle_uppercase function that cycles the input letters yields a copy of the input with letter in uppercase:

def cycle_uppercase(i):
    s = ''.join(i)
    for idx in xrange(0, len(s)):
        yield s[:idx] + s[idx].upper() + s[idx+1:]

To use it, save it to a file called itertools2.py, and then simply add it to the Pythonect program after the itertools.product(_, repeat=4) block, as follows:

"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin \
    -> string.lower(''.join(_.split())) \
        -> itertools.product(_, repeat=4) \
            -> itertools2.cycle_uppercase \
                -> print

Now, the program will also cover passwords that include a single uppercase letter in it. Moving on with the data manipulation, sometimes the password might contain symbols that are not found within the scrapped data. In this case, it is necessary to build a generator that will take the input and add symbols to it. Here is a short and simple generator implemented as a Generator Expression:

[_ + postfix for postfix in ['123','!','$']]

To use it, simply add it to the Pythonect program after the itertools2.cycle_uppercase block, as follows:

"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin \
    -> string.lower(''.join(_.split())) \
        -> itertools.product(_, repeat=4) \
            -> itertools2.cycle_uppercase \
                -> [_ + postfix for postfix in ['123','!','$']] \
                    -> print

The result is that now the program adds the strings: '123', '!', and '$' to every generated password, which increases the chances of guessing the user's right password, or not, depends on the password :)

To summarize, it's possible to take OSINT/WEBINT data on a given person or company and use it to generate potential passwords, and it's easy to do with Pythonect. There are, of course, many different ways to manipulate the data into passwords and many programs and filters that can be used. In this aspect, Pythonect being a flow-oriented language makes it easy to experiment and research with different modules and programs in a "plug and play" manner.

Source:http://blog.ikotler.org/2012/12/scraping-linkedin-public-profiles-for.html

Tuesday, 12 July 2016

Python 3 web-scraping examples with public data

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:
The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library (“HTTP for Humans”).

Here’s an overly verbose example:

import requests
base_url = 'http://maps.googleapis.com/maps/api/geocode/json'
my_params = {'address': '100 Broadway, New York, NY, U.S.A',
             'language': 'ca'}
response = requests.get(base_url, params = my_params)
results = response.json()['results']
x_geo = results[0]['geometry']['location']
print(x_geo['lng'], x_geo['lat'])
# -74.01110299999999 40.7079445

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.
But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

import requests
from lxml import html
page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]
print(link.text_content())
# More information...
print(link.attrib['href'])
# http://www.iana.org/domains/example

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

from os.path import basename
from urllib.parse import urljoin
from lxml import html
import requests
base_url = 'http://www.cde.ca.gov/ds/sp/ai/'
page = requests.get(base_url).text
doc = html.fromstring(page)
hrefs = [a.attrib['href'] for a in doc.cssselect('a')]
xls_hrefs = [href for href in hrefs if 'xls' in href]
for href in xls_hrefs:
  print(href) # e.g. documents/sat02.xls
  url = urljoin(base_url, href)
  with open("/tmp/" + basename(url), 'wb') as f:
    print("Downloading", url)
    # Downloading http://www.cde.ca.gov/ds/sp/ai/documents/sat02.xls
    data = requests.get(url).content
    f.write(data)

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.
Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:
1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, "186,569 datasets found". This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.
29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.
90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.
HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.

Source URL :  http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/

Saturday, 9 July 2016

How to Avoid the Most Common Traps in Web Scraping?

A lot of industries are successfully using web scraping for creating massive data banks of applicable and actionable data which can be used on every day basis for further business interests as well as offer superior services to the customers. However, web scraping does have its own roadblocks and problems.

Using automated scraping, you could face many common problems. The web scraping spiders or programs present a definite picture to their targeted websites. Then, they use this behavior for making out between the human users as well as web scraping spiders. According to those details, a website can employ a certain web scraping traps for stopping your efforts. Here are some of the most common traps:

How Can You Avoid These Traps?

Some measures, which you can use to make sure that you avoid general web scraping traps include:

• Begin with caching pages, which you already have crawled and make sure that you are not required to load them again.
• Find out if any particular website, which you try to scratch has any particular dislikes towards the web scraping tools.
• Handle scraping in moderate phases as well as take the content required.
• Take things slower and do not overflow the website through many parallel requests, which put strain on the resources.
• Try to minimize the weight on every sole website, which you visit to scrape.
• Use a superior web scraping tool that can save and test data, patterns and URLs.
• Use several IP addresses to scrape efforts or taking benefits of VPN services and proxy servers. It will assist to decrease the dangers of having trapped as well as blacklisted through a website.

Source URL :http://www.3idatascraping.com/category/web-data-scraping

Friday, 8 July 2016

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.


Source URL :http://yellowpagesdatascraping.blogspot.in/2015/07/ecj-clarifies-database-directive-scope.html