Using Python to query data from Socrata @ chrisstrelioff.ws

Using Python to query data from Socrata

Feb 17, 2015

warning

This post is more than 5 years old. While math doesn't age, code and operating systems do. Please use the code/ideas with caution and expect some issues due to the age of the content. I am keeping these posts up for archival purposes because I still find them useful for reference, even when they are out of date!

I've started going to Open Oakland meetings on Tuesday nights. The group works on a variety of projects related to making city data more accessible and usable for Oakland citizens by creating websites, or apps. Check out Open Oakland Projects for more information on the types of things going on and come to a meeting if you're interested in helping out.

One of the many tasks involved in getting an app up and running is obtaining the underlying data in some way. In this example, I'll focus on getting information about trees in Oakland from the Oakland data portal at data.oaklandnet.com . This portal is powered by Socrata , which provides an API (see SODA ) for querying the data-- I'll do this using Python. To be clear this type of query can be done for any data hosted at the data portal, so adapt away if there is different information that you want from data.oaklandnet.com , or any other Socrata -powered portal.

Now, let's focus on our specific example: getting information about trees in Oakland . To do this query, we need the url for this specific dataset. This can be found by clicking on the Export button, then the SODA API tab. The API Endpoint has the url that we want: https://data.oaklandnet.com/resource/4jcx-enxf.json which returns data in json (JavaScript Object Notation) format. Again, this url corresponds to data about trees in Oakland, nothing else. Other data will have its own, unique url for queries.

We get the first 5 trees using the $limit and $order arguments to get the desired number, and sorting, of trees and use $offset to make sure that we use proper paging from later queries for trees 6-10, 11-15, etc. The urllib encodes the arguments and we use urllib2 to make the call to the API. The raw form of the data is a list of json data structures-- the json library makes these data into dictionaries in Python (or, strings that can be turned into dictionaries-- see example). The code below queries for tree data using the arguments we pass -- these arguments can be changed to get other information. The libraries used are available by default, at least in Python 2.7, so this should be easy for everyone to use.

from __future__ import print_function

import urllib, urllib2
import json

def get_trees(args):
    """Get trees using passed arguments."""

    api = "https://data.oaklandnet.com/resource/4jcx-enxf.json?"
    try:
        url = api + urllib.urlencode(args)
        data = urllib2.urlopen(url)
        response_data = json.load(data)
        data.close()
    except urllib2.HTTPError, e:
        print("HTTP error: {}".format(e.code))
    except urllib2.URLError, e:
        print("Network error: {}".format(e.reason.args[1]))

    return response_data

# make request for first five trees and order by id
args = {"$order": ":id", "$limit": 5, "$offset": 0}
five_trees = get_trees(args)

# print out the raw data
for n, tree in enumerate(five_trees):
    print("-- Tree {}:\n{}".format(n+1, tree), end='\n\n')

producing...

-- Tree 1:
{u'location_1': {u'latitude': u'37.80505999961946', u'needs_recoding':
False, u'human_address':
u'{"address":"1421","city":"Oakland","state":"Ca","zip":""}',
u'longitude': u'-122.27301999967312'}, u'wellwidth': u'4',
u'welllength': u'4', u'stname': {u'needs_recoding': False,
u'human_address': u'{"address":"FRANK H OGAWA
PZ","city":"","state":"","zip":""}'}, u'lowwell': u'None',
u'objectid': u'1', u'pareawidth': u'0', u'species': u'Platanus
acerifolia'}

-- Tree 2:
{u'location_1': {u'latitude': u'37.80505999961946', u'needs_recoding':
False, u'human_address':
u'{"address":"0","city":"Oakland","state":"Ca","zip":""}',
u'longitude': u'-122.27301999967312'}, u'wellwidth': u'5',
u'welllength': u'5', u'stname': {u'needs_recoding': False,
u'human_address': u'{"address":"11TH
ST","city":"","state":"","zip":""}'}, u'lowwell': u'None',
u'objectid': u'2', u'pareawidth': u'0', u'species': u'Platanus
acerifolia'}

-- Tree 3:
{u'location_1': {u'latitude': u'37.80505999961946', u'needs_recoding':
False, u'human_address':
u'{"address":"561","city":"Oakland","state":"Ca","zip":""}',
u'longitude': u'-122.27301999967312'}, u'wellwidth': u'3',
u'welllength': u'3', u'stname': {u'needs_recoding': False,
u'human_address': u'{"address":"22ND
ST","city":"","state":"","zip":""}'}, u'lowwell': u'Moderate',
u'objectid': u'3', u'pareawidth': u'0', u'species': u'Gleditsia
triacanthos'}

-- Tree 4:
{u'location_1': {u'latitude': u'37.80505999961946', u'needs_recoding':
False, u'human_address':
u'{"address":"3025","city":"Oakland","state":"Ca","zip":""}',
u'longitude': u'-122.27301999967312'}, u'wellwidth': u'3',
u'welllength': u'3', u'stname': {u'needs_recoding': False,
u'human_address':
u'{"address":"BROADWAY","city":"","state":"","zip":""}'}, u'lowwell':
u'Low', u'objectid': u'4', u'pareawidth': u'0', u'species': u'Photinia
sp'}

-- Tree 5:
{u'location_1': {u'latitude': u'37.80505999961946', u'needs_recoding':
False, u'human_address':
u'{"address":"1331","city":"Oakland","state":"Ca","zip":""}',
u'longitude': u'-122.27301999967312'}, u'wellwidth': u'4',
u'welllength': u'9', u'stname': {u'needs_recoding': False,
u'human_address': u'{"address":"HARRISON
ST","city":"","state":"","zip":""}'}, u'lowwell': u'None',
u'objectid': u'5', u'pareawidth': u'0', u'species': u'Lophostemon
confertus'}

We can also extract just the information we want by pulling street number, street name and tree species from the query data:

for n, tree in enumerate(five_trees):
    # accessing the fields is fairy complicated
    st_num = eval(tree[u'location_1'][u'human_address'])['address']
    st_name =eval(tree[u'stname'][u'human_address'])['address']
    spec = tree[u'species']
    print("-- Tree {}: {}".format(n+1, spec))
    print("address: {} {}".format(st_num, st_name), end='\n\n')

producing...

-- Tree 1: Platanus acerifolia
address: 1421 FRANK H OGAWA PZ

-- Tree 2: Platanus acerifolia
address: 0 11TH ST

-- Tree 3: Gleditsia triacanthos
address: 561 22ND ST

-- Tree 4: Photinia sp
address: 3025 BROADWAY

-- Tree 5: Lophostemon confertus
address: 1331 HARRISON ST

Notice that I had to do some fairly complicated things to get at the street number and name. In particular, I had to use eval to create a Python dictionary from a string. As I did above, I suggest querying a small number of records and playing around with data returned to get a feel for what is there.

Some final notes:

The default $limit is set to 1,000 records, with a maximum of 50,000. This means the full tree data set, which has 38,613 records, can pulled with one query to the api:
```
args2 = {"$limit": 40000, "$offset": 0, "$order": ":id"}
all_data = get_trees(args2)
```
Another Python example using pandas and bokeh, these are Python packages that you would have to install before doing the example, is available at Socrata Python example .
Socrata has a broader query language that is SQL-like, allowing to more complicated queries, called SoQL (Socrata query language). Check the links out for more of what is possible.
The code for this example is available at this gist as ex004_socrata_api.py

That's it for this post. As always, leave comments and corrections in the comments below.