"Learn to code," they said. So that is what I did. This blog post is about my first foray into using an API for data science, and describes in detail the issues that I encountered, as well as the code that I used. I hope that it may be helpful to someone else out there who is just starting out.

This Transmission Is Coming to You

The first thing I did was open the API technical documentation. I explored the endpoints and call options. I took a look at the sample Python code that was provided for making a request to the API. It looked like this:

from urllib.request import Request, urlopen  # Python 3
req = Request('https://gs-api.greatschools.org/nearby-schools?distance=50')
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXX')
content = urlopen(req).read()

The above code does three things:

  1. Assigns to the variable "req" a function "Request" (from "requests" library)
  2. Passes an API authentication key to #1
  3. Assigns to the variable "content" a function that opens the request response, and a method that reads the response, once opened

Since Python is the only language I knew at the time, I took the suggested code, sent it out into the web, and to my astonishment, something came back! I printed the variable "content" to see what came back, and lo and behold, a set or a dictionary with the key "schools" appeared, and a value for that key, which in turn was a list or an array of some kind, which contained another dictionary. It looked like this:

b'{"schools":[{"universal-id":"3605242","nces-id":"00941886","state-id":"660102997771","name":"The Karafin School","school-summary":"The Karafin School, a private school located in Mount Kisco, NY, serves grades 9-12 in the .","type":"private","level-codes":"h","level":"9,10,11,12","street":"Po Box 277","city":"Mount_Kisco","state":"NY","fipscounty":36119,"zip":"10549","phone":"(914) 666-9211","fax":null,"county":"Westchester","lat":41.204262,"lon":-73.727074,"district-name":null,"district-id":0,"web-site":null,"distance":0.3482603427492015},{"universal-id":"3607320"...

I wasn't sure what I was looking at. I ran a type() function on it, which returned "bytes" (not a data type I had ever heard of). I didn't know what to do with all of these nested dictionaries, and I started to panic a little bit, since I had a deadline. I googled around for help, and I found that Pandas comes with a method called .read_json(). that appeared to be what I wanted. "These days so many tech things are fully automated, so maybe it will just work nicely and not be any trouble," I thought to myself with the optimism of inexperience.

So I tried it:

data = pd.read_json(content)

Pandas.read_json() did not do what I thought it would do, and what some of the internet blogs told me it would do. Instead of a nice Pandas dataframe, with dictionary key:value pairs displayed as columns and values respectively, I got a hackneyed dataframe with only one column, the singular column index "schools," and all of the data I wanted so desperately, mashed together in a series of nested dictionaries. It looked like this (except much longer):

0 {'universal-id': '3605242', 'nces-id': '009418...
1 {'universal-id': '3607320', 'nces-id': 'A01079...
2 {'universal-id': '3601714', 'nces-id': '361995...
3 {'universal-id': '3601716', 'nces-id': '361995...
4 {'universal-id': '3615676', 'nces-id': None, '...
5 {'universal-id': '3608063', 'nces-id': 'A03023...
6 {'universal-id': '3601711', 'nces-id': '361995...

Pandas Attack

My goal as a newbie data scientist was to ingest data from the API, and ultimately to get that data into a Pandas dataframe, where I could use all of the data science tools in my familiar toolbox. But there I was, with my data in a dataframe (or rather a Pandas series), which was totally useless to me.

Should I iterate over the Pandas series somehow? Write a for loop "for item in df.iterrows:" that passes a dictionary to each nested dictionary in my one column, and creates another array elsewhere out of the key:value pairs? In any case, I knew I would somehow have to break out the nested dictionaries. But the more I thought about it, the more it seemed ridiculous to convert an array to dataframe, then pass dictionary to make it an array again. I concluded that the point of intervention had to come before Pandas ever entered the stage.

With the little knowledge I had of data structures, it was clear to me that the JSON was already some kind of array-like structure, which could yield columns and values, if I just knew how to elicit this structure from JSON. What if instead of reading the JSON into Pandas as a first step, I tried to parse the JSON first, and then (and only then) convert it to a Pandas dataframe thereafter?

So that's what I did. Googling, I soon found a great tutorial video for how to parse JSON from a good samaritan, John Watson Rooney, on YouTube:

I wrote a for loop to parse the JSON bytes, which I planned to insert into my API request. That way the JSON would get parsed before ever becoming a Pandas dataframe, and the parsing would happen automatically for every response from the API. I would initiate a blank master list or use a list comprehension, parse the JSON, append all of the parsed arrays together into a master list, and finally convert the master list via good ol' pd.DataFrame(). The parse code looked like this:

def parse_json(response):		
    school_list = []		
        for item in response['schools']:				
            school = 
	return school_list
Example of a Parse JSON Script

No Pagination, No "Next-URL", No Nothin'

Luckily I soon encountered new problems, which would dwarf my old problems!

Once inserted into my API request, the parse_json function did just that! It parsed the JSON. Each JSON array corresponding to each nested dictionary from each API response was appended to a master list, and the master list converted into a beautiful Pandas dataframe. All of the columns I wanted were there! Yay! No sooner did I catch a taste of victory than I noticed that my dataframe only had 50 rows in it, whereas I had run an API request for all schools within a 30 mile radius of my hometown. I knew intutitively that the geographic area I queried should have thousands of schools at the very least. Something was very wrong!

Here is what the code looked like:

def variable_request(baseurl):
    mstr_list = []

   	variable_url = "https://gs-api.greatschools.org/nearby-schools?lat=41.205069&lon=-73.720461&page=0&limit=50&distance=30"

  	r = requests.get(variable_url, headers=header)
   	raw = r.json()

## add parsed json data to master list, and extend master list through each iteration

    return mstr_list
results = variable_request('')

Back to the API documentation I went. There it clearly stated that there was a 50 result per page limit. This is what it said:

To get the first 50 results for example you would issue the query:

To get the next 50 results for example you would issue the query:

All of the API tutorials that I found online assumed that the API response would contain pagination information or "next url" information (including John Watson Rooney's tutorial, which I referenced above). Initially I looked for code that I could adapt to API that I was using, but everything I came across assumed that the API itself would provide more information than my API did.

Since I would not be able to pull pagination numbers from the request response, I would need to create my own. I could see right away I would need to write a script which would increase the page number with each iteration,  and would continue to iterate until there were no more results left. How actually to accomplish this was the tricky part. At first, I tried putting an if-statement into my formula. Something like "if the API response type is a good one, then keep going." If r="[200]", then do your thing, else stop. It didn't work. I wasn't sure what data type this little block of code "[200]" was, whether or how I was writing my if-statement correctly.

Further googling and an email to the API support people, and I concluded that I should write the inverse version of an if-statement. Rather than telling the code to run if a certain condition was true, I would tell it to just run until it didn't get any response at all. Using "while true" is a dangerous proposition, since it can easily become to an infinite loop, and I was honestly a bit concerned about blowing something up, but I put a "break" in it, which I hoped would stop it from exploding.

Here is what the working API request code ultimately looked like:

def variable_request(baseurl):
    page = 0
    header = {'X-API-Key': 'XXXXXXXXXXXXXXXXX'}
    mstr_list = []
##initiate function at page 0. define header with API authentication key and value dictionary object

    while True:
        variable_url = f'https://gs-api.greatschools.org/nearby-schools?lat=41.205069&lon=-73.720461&page={page}&limit=50&distance=30'
## print variable URL to ensure that pagination is working

        r = requests.get(variable_url, headers=header)
        raw = r.json()
## while loop to repeat api request as long as request returns data, then parse json data as defined in parse_json.

        if raw['schools'] == []:
            print("end of results");
## break loop when api no longer returns any schools in raw json data            

        page += 1
## add parsed json data to master list, extend master list through each iteration, increase value of page by 1

    return mstr_list
results = variable_request('')

If you look at the "while True:" block of code, you'll see that the URL (which gets passed into the API request requests.get() in the next block) is encased in an "f-string", which is just another syntax for .format(), and that page={page} is set as the formatted piece. This is the crucial part of the code, which allows it to cycle through pagination, when the API results themselves do not contain any page numbers.

The above code doest the following:

  1. The variable "page" gets initiated at a value of zero
  2. The URL for the API call is surrounded by an f-string, which means that the {page} retrieves the value from #1 and incorporates it into the URL
  3. Validation headers for the API key are defined
  4. The API response is converted to JSON and assigned to variable raw
  5. Raw API response data is passed to "parse_json", the formula we defined
  6. If the school column of the JSON array is blank, the code stops running and prints "end of results"
  7. JSON array gets appended to the master list, which we initiated as blank
  8. The page variable is increased by a value of one
  9. Finally, the formula returns the master list of all parsed JSON data

To an experienced data munger, the script I wrote is probably quite elementary. I assume that it is like API 101, which would make sense, because it was my first ever API call. Nonetheless my joy was not in any way diminished when the script worked, and I ran pd.DataFrame(mstr_list) and got back a beautiful pristine dataframe with thousands upon thousands of schools in it. I was ecstatic!

Do Not Pass Go, Except...

Do not worry, dear reader, that all the fun has come to an end. Not by any stretch! There was much more fun to be had in my first ever API call experience. That is because I needed more data. I now had a master list of schools, with all of the basic identifying information for each school (ID#, name, address, etc.) and school rating. But I wanted more! In order to get student demographic information and teacher pay metrics, I would need to run another API call, this time using the ID# for each school in the request.

As before, I first ran a test call manually. I called the API using one ID number from my master list, and using "Request()" instead of "requests.get()". To my utter astonishment (again) the API returned data. It worked! As before, I felt rushed to get my data into a Pandas dataframe, I guess because it was my digital safespace. If my first request taught me anything, it should have been this: Stay out of Pandas! Pandas is a wrapper, a window dressing of sorts, which takes an array and displays it all pretty, and includes some helpful methods. Only go to Pandas when all of your data is ready!

But hard lessons take time to sink in, and this was now only my second ever API call. Also, this time the request to the API would be notably different. Instead of sending a call with an endpoint that would return thousands of rows of data, I would be calling the API with just one ID# at a time, and it would return data for just that one ID#. Intuitively, this difference made me feel like I could safely convert each response to a one-row Pandas dataframe as the first step, and then append all of the dataframes together! It just felt like it would be more manageable.

But before I could try gluing all of these little Pandas frames together, I would need to fix what I was looking at, first. Because the one-row dataframe was not exactly one row. In fact, it was kind of messed up. Every column contained a column index and values, except for one. One column contained what I thought should be column indexes, except the values were arranged within a separate column. Given my Excel background, it looked like a pivot table to me, which I desired to unpivot.

The  dataframe looked something like this (just with more columns, obv):

ethnicity_name enrollment teacher_salary ethnicity_percent
African_American 513 102000 .10
Caucasian 513 102000 .26
Hispanic 513 102000 .64

You can see that every single value in every column, except for "ethnicity_percent" is duplicative (I will explain further down the post why the data from the API came back like this). I wanted to "unstack" the first column (ethnicity_name) into three additional columns, and fill the values of those columns with the values from "ethnicity_percent." In other words, I wanted the final dataframe to look like this:

enrollment teacher_salary African_American Caucasian Hispanic
513 102000 .10 .26 .64

With my familiar Pandas toolbox, I could readily think of the following two ways of doing unstacking in Pandas, both of which seemed extremely hackneyed to me:

  1. pd.get_dummies and then map on the original names column to the values column
  2. turn the 'ethnicity_name' column into an index, and unstack, and then map values by passing a dictionary to the columns

Besides being hackneyed workarounds in Pandas, the problem with the above approach would be automation. Remember that I had thousands of schools that I needed to pull data for, and by extenstion, I would need to repeat these steps for every single school. Thus I would need to write a script, which would create a variable mapping dictionary out of the values in "ethnicity_percent" and map them the the columns I created. And that was before I even had all of the data I needed. At first I went to stack overflow with my question, and a good samaratin of the internets suggested I use .pivot():

df.pivot(['enrollment', 'teacher_salary'], 'ethnicity_name', 'ethnicity_percent')

Pivoting the dataframe was enticing, because it would not require writing a variable mapping dictionary, nor would it require creating new columns. Given my Excel background, the data that confronted me in this strange little dataframe seemed to beg to be unpivoted, as I mentioned before. So I added this pivot to my code, and it worked! I had the dataframe I wanted. Now I just needed to figure out how to append a few thousand Pandas dataframes together into one big one, and I would be well on my way. This is what I wrote:

def demographics(universal_id):
    demo_mstr = []
    for item in universal_id:
        id = item
        req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
        content = urlopen(req).read()
        data = pd.read_json(content)
        data['id'] = id
        data = data.pivot(['enrollment',
            'average-salary','id'], 'index', 'ethnicity')
    return demo_mstr

The kind and caring people of the internets did not like the idea of gluing Pandas dataframes together. Everywhere I looked, it seemed, whether it was StackOverflow or blogs, suggested that gluing dataframes together was bad code, which would eat up unnecessary computing resources, and would take more time. Everyone suggested that I append the underlying arrays together first, before ever so much as thought about Pandas at all. So that is what I did! And boy, did it not work. Not at all. I kept getting errors related to the dimensions of my array.

That's when it dawned on me that I would need to parse JSON again. Doubly it dawned on me that I made the same mistake that I made the first time, and that I should stay out of Pandas for as long as possible. It was a lesson now firmly tattooed into my brain, and one that I surely would not forget. So that is what I did! I wrote another parse_json(): formula, and I added a column with a f'{}' string, which would receive the ID# for each school. It looked like this:

def demographics(universal_id):
    demo_mstr = []
    for item in universal_id:
        id = item
        req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
        content = urlopen(req).read()
        raw = parse_json(content,id)
    return demo_mstr

When I ran the function on a handful of ID numbers, it worked! I was thrilled. So I fed a test list of more ID#'s, and it stopped working. I got the following error: "HTTPError: HTTP Error 422." I googled around, and it quickly became clear that a 422 error was a type of non-response error. The API was failing to respond to the request for some reason. So I began to manually check each ID# and soon enough I found an identified that returned a 422. Not all of the schools from the master list have correlative demographic data!

While I was very much a beginner, I knew conceptually that I would need to wrap my for-loop in some kind of wrapper, which would tell it to just skip over any inputs that did not yield a response. In a short amount of time, I found many sources telling me to use something called Pass/Except or Try/Except. Which worked! Sort of. The code worked for a while, but then I started getting some strange errors again. Here is the code:

### MSTR API REQUEST FOR DEMOGRAPHICS DATA with dummy list of numbers###

demo_mstr = []
numbers = [3607320,3615676,3601716]
for item in numbers:
        id = item
        variable_url = f'https://gs-api.greatschools.org/schools/{id}/metrics'
        r = requests.get(variable_url, headers=header)
        raw = r.json()

        list_b = parse_json([raw],f'{id}')


    except Exception as e:

Elliott King turned out to be a saint, who was very eager to help me with my code, out of the goodness of his own heart. I tried to offer to pay him for his time, but he flatly refused. Elliott asked me to add "Exception as e:" and "print(e)" to my code, so that the code would print an error report, which would tell me what it wasn't liking. It printed the word "Asian," which had to be somwhere in the parse_json.

Variable Dictionaries

I sifted through the raw JSON (variable name is "raw") and I could see the issue right away, given that one crucial clue from print(e). Some schools recorded Asian students in a category called "Asian," while other schools recorded their Asian students in a category called, "Asian or Pacific Islander." Not only did some schools not have correlative demographics data, but they recorded their data differently!

The other minor ephiphany I had, as I sifted through the JSON: The data contained a nested dictionary. Unlike the first API call, in which the entirety of each and every response was a nested dictionary, in this case, only the "ethnicity" categories were contained in a nested dictionary. Suddenly I understood why Pandas pivoted on "ethnicity." It was trying to deal with a nested dictionary, and I guess how it does that is to create a pivot table. I made a mental note of that for the future, and moved on.

Again, I understood conceptually what I would need to achieve to get my code up and running. I had to parse JSON in such a way that it would anticipate all differences in how ethnicity categories had been recorded. But I didn't know how I could possibly know ahead of time what every category would be. And the only way to know for sure would be to manually run an API call for every single school, and make a note of any differences, which would obviously take way too long. Here I was truly stumped, and Elliott King came to my rescue.

Let's look at the parse_json code:

## define function to parse json data, to retrieve only the desired columns, which will be nested in API call##

def parse_json(response,id):
    """Parse json response data, from an API call on id numbers, and converts json array data into a dictionary"""
    demographics_list = []
    for item in response:
        demographics = {
#             'ethnicity':item['ethnicity'],
#             'White': item['ethnicity']['White'], 
#             'Hispanic': item['ethnicity']['Hispanic'], 
#             'Asian or Pacific Islander': item['ethnicity']['Asian'], 
#             'Two or more races': item['ethnicity']['Two or more races'], 
#             'African American': item['ethnicity']['African American'], 
#             'Native American': item['ethnicity']['Native American'], 
#             'Native Hawaiian or Other Pacific Islander':item['ethnicity']['Native Hawaiian or Other Pacific Islander'],
    return demographics_list

Here I've kept all of the code that I wrote, but I've commented out the code that didn't work with hashtags, so that you can see the difference. All of the ethnicity categories that I tried to retrieve from the nested dictionary under "ethnicity" were dropped, and were replaced by one beautiful little line of code: "**item['ethnicity']. These two little asterixes did the magic trick of telling JSON to just give me a column for every item it encounters under "ethnicity." I can only assume that this trick that Elliott King showed me will come in handy in major ways the future.

One last little hiccup before the final working code would run came when calling on the parse_json(): function inside of the master request for demographic data. Unlike the first parse_json() function, I was passing two arguments to the function instead of just one. For whatever reason, this confused the function as to what sort of data type it was receiving in each argument. Rather than passing the variable "raw" to parse_json(), Elliott suggested I try passing [raw], which worked!

Here is the final working code:


demo_mstr = []
numbers = schools['id']
for item in numbers:
        id = item
        variable_url = f'https://gs-api.greatschools.org/schools/{id}/metrics'
        r = requests.get(variable_url, headers=header)
        raw = r.json()

        list_b = parse_json([raw],f'{id}')


    except Exception as e:

All that was left to do after this code ran was to convert demo_mstr into a Pandas dataframe using pd.DataFrame(demo_mstr). And voila! My first ever API request had worked, and I had all of the data I needed. Then the real fun could begin!