By Aimee Morgan (Engineering Fellow, Hackbright Academy – summer 2013 class)
Last week was the much anticipated start of full time work on our individual projects. Although there were a few moments of terror on Monday morning (“where do I start?”), it was mostly a very good week. I exceeded my own expectations in terms of what I got done and had a lot of fun doing it. I feel a little weird about saying that – I know that it’s still early on and I haven’t had time to get sick of my project yet.
I expect that at some point over the next few weeks the honeymoon period will end and I’ll hit a wall (or five). But for now, things are good. My general stress level is exponentially lower now that I’m free to work at my own pace and decide on my own tasks for any given day.
Some accomplishments so far:
1. Developed a new data model for the NYPL menu collection dataset.
2. Loaded all of the data into a Postgresql database. Since I was reconfiguring the data model, this wasn’t as simple as importing the CSV(comma separate values) files I got from NYPL – one of those files was split between two database tables, and a third table was cobbled together out of columns from two different files.
3. Also, in the “valuable life skills” department: learned how to dump a backup of my database, then reimport it. (More on this later.)
4. Set up ORM using sqlalchemy and wrote some basic methods for retrieving information from the database.
5. Decided for sure that I will use Flask for my web app and deploy on Heroku. Read a lot of Heroku documentation, which is surprisingly good. Impressed with Heroku so far and I think it will work well for my beginner-level deployment needs.
6. Worked through large chunks of the Flask mega-tutorial – we used Flask for several exercises earlier this month but just barely scratched the surface.
7. Read a whole lot about natural language processing (mostly this) and took a lot of notes on how I might use NLP techniques on my data.
8. Started work on a Python script to normalize / de-duplicate the database table containing information on restaurants. Since the restaurants table includes a column that serves as a foreign key in the menus table (so that menus are linked to the restaurants that issued them), I can’t just go in there and delete rows without updating the corresponding information in the menus table to point to the authoritative version.
9. Started compiling lexicons to use in the data processing functions (for example: if a dish contains one or more words that appears in this particular list, it is a dish that contains meat).
There was one major frustration this week, but not one that was directly related to my project: yesterday morning I had to wipe out my hard drive and reinstall everything from scratch. Let this be a lesson to you: if you choose the lazy method of setting up a Windows 7 / Ubuntu dual boot system (in which your hard drive is not actually repartitioned and your Linux install lives as one giant 40gb file in a Windows directory), it will eventually come back to haunt you. By which I mean, Windows will spontaneously eat your Linux install. And when that happens, not even a spouse with expert-level Linux skills will be able to help you.
Being that I am a well-trained Hackbright student, I’ve been pushing all my project work to Github, so nothing was lost. (If anyone is interested, my Github is at https://github.com/aimeemorgan; almost everything I’ve done this summer is there.) This is where the “learning to restore a Postgres database from a dump” came in.
This disaster prompted me to finally part ways with Windows and go Ubuntu-only, so I suppose it’s a net positive. I’d been keeping Windows 7 around because there is proprietary software for my camera that I prefer to any of the Linux alternatives, but my husband just built a Windows box for gaming so I can use that when I need it.
My concerns for the coming weeks:
1. Natural language processing is a huge timesuck. And I mean that in the best possible way; I find it 100% fascinating. If I don’t watch myself, I’ll spend all of the next four weeks working on that and end up with no user interface whatsoever.
2. Another huge timesuck: playing with database queries. For example, I was fascinated to discover that the word “local” appears in descriptions of menu items only 78 times, while “imported” appears 2833 times — no doubt because the dataset is dominated by menus that predate the locavore movement.
Looking forward to another week of hacking on this project, although lots of other Hackbright events will keep me away from the keyboard. We’ve got field trips to SurveyMonkey, Facebook, and Google on the calendar. And tomorrow morning is a workshop on negotiating salaries. I’m so glad Hackbright has chosen to incorporate negotiation skills into the curriculum – let’s just say that a career spent in academic libraries has not prepared me well for that kind of thing.
This post was originally posted at Aimee Morgan’s blog.