segal.go(u)ld

putting the "u" in segal-gould

automator, technologist, digital humanist, and developer

Redesigned Course Catalog Project: The Server

Context

My initial goal in the summer of 2017 was for the combined features of the course catalog parser and server to come together as my senior project. I aimed to implement improved search functionality as well as a scheduling system for identifying courses which were available to fit within one's own schedule. Because I did not ultimately make this my senior project, I only implemented a limited number of these features.

Utilizing some of my experience with the Twitter JSON API, I knew I wanted to use the Flask web development microframework to make one of my own. JSON is a human-and-machine-readable format and I really liked the idea of some developer one day wanting to use this service to make an application of their own. If such a day ever came, my API would be of use to them.

MongoDB had the kind of searchable database that seemed to scale and interface well with Flask, so I read enough documentation to get it working. Some queries happening in the backend for sorting results were particularly difficult to figure out.

The responsive templates I set up for use with Flask were not intended to be permanent solutions. I have very limited web design experience, and making my site look better than the official one was not particularly difficult. Some day soon, I would like to give the site a more modern feel.

Features

The site is live at courses.segal-gould.com. If your browser can handle the loading time, check out all 4608 courses currently in the database at courses.segal-gould.com/all.

Semesters

Browse the list of all semesters' courses available in the database. Currently the oldest semester available is Fall 2014 due to limitations in the parser.

Then browse all the courses taught in each department during that semester. The names of these departments are the same as the names used to identify them in the URLs of the official course list.

Departments

Another way to browse by department is to ignore each semester and view everything in the database from each department. For instance, there are currently 142 anthropology courses in the database across all semesters.

Professors

If you are looking for a particular course instructor, you can browse the list of all professors in the database of courses. Right now, 577 professors are in the database. Duplicates exist because often professors collaborate with each other to instruct a course, and the system treats the names of multiple professors as one single item.

Locations

Browse by all classrooms. Sometimes courses meet in different locations on different days of the week for lab sessions.

Course Codes

The official course list utilizes a system of course codes formatted as three or four letter department identifiers followed by three or four digit course identifiers (e.g. "CMSC 360"). You can browse all courses codes and see which courses have used them.

Distributions

Bard College requires that students complete a specific number of credits from several distributions. In recent years, the institution modified their distribution requirements naming system from using four characters (e.g. "AART") to using two characters (e.g. "AA"). I developed the redesigned course catalog to work with both naming systems. Often one course fulfills multiple requirements and thus you will find duplicates.

You can browse by all courses which fulfill requirements within the old distributions naming system.

And you can browse using the new distributions naming system.

Users

I implemented a system for registering accounts with safely hashed passwords using bcrypt. You can login after registering and also add or remove courses from your list of favorites. All users may view the list of users as well as all their favorite courses.

Sorting

All lists of courses can be sorted alphabetically by the following criteria:

Search

You can use the search bar on almost any page to look for keywords, descriptions, course titles, scheduled meeting times, and any other information associated with courses. For example, here are the results for a search for "machine" among all computer science courses.

RateMyProfessors Integration

Every course has a link in its description which will take you to a RateMyProfessors search for its instructor. I am not affiliated with RateMyProfessors.

Course Registration Numbers

If you need to know a course's CRN for clerical purposes, the redesigned course catalog lets you access a list of all courses which use a specific CRN through a link in every course table. Although course registration numbers are five digits, the official course catalog still has duplicates.

Application Programming Interface

The JSON API is totally free and currently has no rate limits. Each JSON it returns is structured as a string "result" mapped to an array of course objects. To access the API result for any redesigned course catalog page, just insert /api/ into the URL. For example, the result for a specific course registration number is normally formatted like courses.segal-gould.com/crn/91761. To access the API result for the same content the URL would be courses.segal-gould.com/api/crn/91761.

Future Features to Implement

I managed to implement most of the features I had in mind over that one summer. Since then, I got some nice feedback on the site and have had a few more ideas about what I'd like to add to the project.

Using something like Selenium to automate gathering data on currently offered courses' required textbooks, it should be possible to match those results with official course titles. It would be really easy to find out what books to purchase for a class if Amazon links and ISBNs were available directly from the course list.

I would like to add some means of leaving comments on courses, preferably as verified course instructors. In my experience, details such as class sizes can vary in how set in stone they are depending on the professor. Maybe additional information such as a course's syllabus could be left as a comment as well.

An interactive map could serve to direct students to their classrooms, and schedule-based searches would be really convenient. Under the official system, there is no easy way to identify courses which fit within one's own schedule.

Read more about the documentation and code for this project on GitHub. If you're interested in what I chose to pursue as my actual senior project, you can read up about that on GitHub as well.

Redesigned Course Catalog Project: The Parser

Context

In June 2017 I was preparing to enter my senior year at Bard College. Majoring in computer science at Bard, students are expected to pursue a research-based senior project within the area of expertise of their adviser. My adviser, Professor Sven Anderson, encouraged me to do some work over the summer of 2017 in preparation for the coming semester.

In Professor Anderson's Mobile Application Development course from Fall 2017, the project which my partner and I were assigned to develop was to be a health and fitness tracker for Bard students. It would utilize the dining commons menu which featured caloric data and serving sizes to make informed suggestions about users' fitness routines and meals.

I communicated with the official dining services staff because I hoped to gain access to their API with which they pushed daily menu updates to their web site, but nobody I spoke to could confirm the existence of any API at all. Meanwhile, just observing the HTML of their site was enough to tell me that there was some "public" API being utilized in the background.

Because the application depended on daily menu access, I developed a system using Selenium to automatically scrape menu items on a daily basis and Flask to serve them in a JSON API. With that experience in web automation and scraping, I knew I'd be returning to web development.

Project

The official Bard College course list is built upon a web programming language called WebFOCUS from around 2003. Compared to competing liberal arts institutions, its design and features are lacking. My intention was not initially to redesign the course list, and instead I just meant to serialize it for my own records. As a self-described datahoarder, I like having backups.

I went so far as to try manually transcribing the online course lists for each semester into JSON format, and after around 40 hours of copy-pasting I decided there had to be a better solution. BeautifulSoup is a Python library for getting data from awful web pages. I knew its limits and decided it would be of use in this project.

By far, the most frustrating part of utilizing BeautifulSoup to parse the course list is that across semesters and often within individual pages there exists limited consistency in the format of the tables containing course data. In 2014 a new naming system for course distribution fulfillments was introduced, further complicating the mess of old HTML tags and widely varied table dimensions due to differences in scheduling format.

Because the course list works using embedded iframes which contain the individual "course lists" for each department per semester, I wrote the program to take as its input text files each containing a list of individual course list URLs. This way, all the course list pages which used the same distributions naming system and scheduling format could be downloaded and parsed separately from those which did not.

Soon after starting my senior year, I chose to put this on hold in favor of another topic for my senior project. I came back to it in August 2018 following my graduation from Bard College because in just that single year, further inconsistencies broke the parser. I modified it to make some fixes but one remains which would necessitate a complete rewrite of the way in which the parser accesses individual course descriptions from course list pages.

The parser was designed to take entire HTML pages as its input and serialize each course within each page into a JSON object. Unfortunately, because some course list pages no longer utilize the same table format across all courses they contain, the parser fails on pages with those inconsistencies. Thankfully, that only accounts for around 30 courses out of over 3900 in total from Fall 2014 to Fall 2018.

Feel free to read more about this project's code and documentation on GitHub.