Important note! Before you turn in this lab notebook, make sure everything runs as expected:

First, restart the kernel -- in the menubar, select Kernel$\rightarrow$Restart.
Then run all cells -- in the menubar, select Cell$\rightarrow$Run All.

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE."

Part 2: Mining the web: Web APIs¶

We hope the preceding exercise was painful: even with tools to process HTML, it is rough downloading raw HTML and trying to extract information from it!

Can you think of any other reasons why scraping websites for data in this way is not a good idea?

Luckily, many websites provide an application programming interface (API) for querying their data or otherwise accessing their services from your programs. For instance, Twitter provides a web API for gathering tweets, Flickr provides one for gathering image data, and Github for accessing information about repository histories.

These kinds of web APIs are much easier to use than, for instance, the preceding technique which scrapes raw web pages and then has to parse the resulting HTML. Moreover, there are more scalable in the sense that the web servers can transmit structured data in a less verbose form than raw HTML.

As a starting example, here is some code to look at the activity on Github related to the public version of our course's materials.

import requests

response = requests.get ('https://api.github.com/repos/cse6040/labs-fa17/events')

print ("==> .headers:", response.headers, "\n")

==> .headers: {'Server': 'GitHub.com', 'X-XSS-Protection': '1; mode=block', 'Access-Control-Expose-Headers': 'ETag, Link, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-GitHub-Media-Type': 'github.v3; format=json', 'X-Runtime-rack': '0.028060', 'X-Content-Type-Options': 'nosniff', 'Status': '200 OK', 'X-Poll-Interval': '60', 'Content-Security-Policy': "default-src 'none'", 'Content-Type': 'application/json; charset=utf-8', 'Date': 'Wed, 20 Sep 2017 00:37:31 GMT', 'X-RateLimit-Limit': '60', 'X-GitHub-Request-Id': 'CDDA:3B92:5B2248B:C97424F:59C1B84B', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Last-Modified': 'Tue, 19 Sep 2017 22:48:55 GMT', 'ETag': 'W/"911dffe660c66d48fac1229afff5fd9b"', 'X-RateLimit-Remaining': '56', 'X-Frame-Options': 'deny', 'Vary': 'Accept', 'X-RateLimit-Reset': '1505871386'}

Note the Content-Type of the response:

print (response.headers['Content-Type'])

application/json; charset=utf-8

The response is in JSON format, which is an open format for exchanging semi-structured data. (JSON stands for JavaScript Object Notation.) JSON is designed to be human-readable and machine-readable, and maps especially well in Python to nested dictionaries. Let's take a look.

See also this tutorial for a JSON primer. JSON is among the universal formats for sharing data on the web; see, for instance, https://www.sitepoint.com/10-example-json-files/.

import json
print(type(response.json ()))
print(json.dumps(response.json()[:3], sort_keys=True, indent=2))

<class 'list'>
[
  {
    "actor": {
      "avatar_url": "https://avatars.githubusercontent.com/u/5316640?",
      "display_login": "rvuduc",
      "gravatar_id": "",
      "id": 5316640,
      "login": "rvuduc",
      "url": "https://api.github.com/users/rvuduc"
    },
    "created_at": "2017-09-19T22:48:55Z",
    "id": "6611692513",
    "org": {
      "avatar_url": "https://avatars.githubusercontent.com/u/31073927?",
      "gravatar_id": "",
      "id": 31073927,
      "login": "cse6040",
      "url": "https://api.github.com/orgs/cse6040"
    },
    "payload": {
      "before": "572167f7d65edd7a7aa12a5dd17d703656678d14",
      "commits": [
        {
          "author": {
            "email": "richie@cc.gatech.edu",
            "name": "Richard (Rich) Vuduc"
          },
          "distinct": true,
          "message": "Added Yelp! ATL universities dataset",
          "sha": "fa1e95d4c350506b772e56bd41a01479ee3951f2",
          "url": "https://api.github.com/repos/cse6040/labs-fa17/commits/fa1e95d4c350506b772e56bd41a01479ee3951f2"
        },
        {
          "author": {
            "email": "richie@cc.gatech.edu",
            "name": "Richard (Rich) Vuduc"
          },
          "distinct": true,
          "message": "Merge branch 'master' of github.com:cse6040/labs-fa17",
          "sha": "c138fd4aa94c56debe3e640dfd4a1ba53afcdb88",
          "url": "https://api.github.com/repos/cse6040/labs-fa17/commits/c138fd4aa94c56debe3e640dfd4a1ba53afcdb88"
        }
      ],
      "distinct_size": 2,
      "head": "c138fd4aa94c56debe3e640dfd4a1ba53afcdb88",
      "push_id": 1996190687,
      "ref": "refs/heads/master",
      "size": 2
    },
    "public": true,
    "repo": {
      "id": 100506580,
      "name": "cse6040/labs-fa17",
      "url": "https://api.github.com/repos/cse6040/labs-fa17"
    },
    "type": "PushEvent"
  },
  {
    "actor": {
      "avatar_url": "https://avatars.githubusercontent.com/u/5316640?",
      "display_login": "rvuduc",
      "gravatar_id": "",
      "id": 5316640,
      "login": "rvuduc",
      "url": "https://api.github.com/users/rvuduc"
    },
    "created_at": "2017-09-19T19:38:11Z",
    "id": "6610937299",
    "org": {
      "avatar_url": "https://avatars.githubusercontent.com/u/31073927?",
      "gravatar_id": "",
      "id": 31073927,
      "login": "cse6040",
      "url": "https://api.github.com/orgs/cse6040"
    },
    "payload": {
      "before": "917fffcca44b55042095d5be7de61b65414bad76",
      "commits": [
        {
          "author": {
            "email": "richie@cc.gatech.edu",
            "name": "Richard (Rich) Vuduc"
          },
          "distinct": true,
          "message": "Added Yelp\\! ATL universities raw search results",
          "sha": "572167f7d65edd7a7aa12a5dd17d703656678d14",
          "url": "https://api.github.com/repos/cse6040/labs-fa17/commits/572167f7d65edd7a7aa12a5dd17d703656678d14"
        }
      ],
      "distinct_size": 1,
      "head": "572167f7d65edd7a7aa12a5dd17d703656678d14",
      "push_id": 1995800949,
      "ref": "refs/heads/master",
      "size": 1
    },
    "public": true,
    "repo": {
      "id": 100506580,
      "name": "cse6040/labs-fa17",
      "url": "https://api.github.com/repos/cse6040/labs-fa17"
    },
    "type": "PushEvent"
  },
  {
    "actor": {
      "avatar_url": "https://avatars.githubusercontent.com/u/5316640?",
      "display_login": "rvuduc",
      "gravatar_id": "",
      "id": 5316640,
      "login": "rvuduc",
      "url": "https://api.github.com/users/rvuduc"
    },
    "created_at": "2017-09-14T23:28:12Z",
    "id": "6593207015",
    "org": {
      "avatar_url": "https://avatars.githubusercontent.com/u/31073927?",
      "gravatar_id": "",
      "id": 31073927,
      "login": "cse6040",
      "url": "https://api.github.com/orgs/cse6040"
    },
    "payload": {
      "before": "4365a884df66895ad18e5f41a280cc81797e7923",
      "commits": [
        {
          "author": {
            "email": "richie@cc.gatech.edu",
            "name": "Richard (Rich) Vuduc"
          },
          "distinct": true,
          "message": "Added Lab 5 (regular expressions)",
          "sha": "917fffcca44b55042095d5be7de61b65414bad76",
          "url": "https://api.github.com/repos/cse6040/labs-fa17/commits/917fffcca44b55042095d5be7de61b65414bad76"
        }
      ],
      "distinct_size": 1,
      "head": "917fffcca44b55042095d5be7de61b65414bad76",
      "push_id": 1986327053,
      "ref": "refs/heads/master",
      "size": 1
    },
    "public": true,
    "repo": {
      "id": 100506580,
      "name": "cse6040/labs-fa17",
      "url": "https://api.github.com/repos/cse6040/labs-fa17"
    },
    "type": "PushEvent"
  }
]

Exercise 0. It should be self-evident that the JSON response above consists of a sequence of records, which we will refer to as events. Each event is associated with an actor. Write some code to extract a dictionary of all actors, where the key is the actor's login and the value is the actor's URL.

def extract_actors (json_github_events):
    """Given JSON records for events in a GitHub repo,
    returns a dictionary of the actors and their URLs.
    """
    urls = {}
    for event in json_github_events:
        actor = event['actor']['display_login']
        url = event['actor']['url']
        urls[actor] = url
    return urls

actor_urls = extract_actors(response.json ())

for actor, url in actor_urls.items ():
    print ('{}: {}'.format(actor, url))
    assert url == "https://api.github.com/users/{}".format(actor)

rvuduc: https://api.github.com/users/rvuduc
Augus-Kong: https://api.github.com/users/Augus-Kong

Exercise 1. Write some code that goes to each actor's URL and determines their name. If an actor URL is invalid, that actor should not appear in the output.

def lookup_names (actor_urls):
    """Given a dictionary of (actor, url) pairs, looks up the JSON at
    the URL and extracts the user's name (if any). Returns a new
    dictionary of (actor, name) pairs.
    """
    import re
    
    names = {}
    for actor, url in actor_urls.items ():
        response = requests.get (url)
        
        # Possible error conditions
        if response is None: continue
        if re.search ('application/json', response.headers['Content-Type']) is None: continue
        if 'name' not in response.json (): continue
            
        names[actor] = response.json ()['name']
    return names

actor_names = lookup_names (actor_urls)

for actor, name in actor_names.items ():
    print ("{}: {}".format (actor, name))
    
assert actor_names['rvuduc'] == 'Rich Vuduc (personal account)'

rvuduc: Rich Vuduc (personal account)
Augus-Kong: XIANGYU KONG

That's the end of this notebook. Processing JSON is fairly straightforward, because it maps very naturally to nested dictionaries in Python. You might search the web for other sources of JSON data, including this one, and do your own processing!

part2 (Score: 0.0 / 0.0)

Part 2: Mining the web: Web APIs¶