22 April 2015

Install

If you haven’t already installed csvkit, start by installing it.

Get the Files

NYC’s Department of City Planning publishes incredibly useful property maps of NYC. Not for nothing, these are available on Socrata, but if you find them on NYC Open Data you’re way (way) better off going back to the agency that provides the data. Among other things, City Planning provides clear context for their data.

Today the link for the most up to date MapPLUTO data is http://www.nyc.gov/html/dcp/html/bytes/dwn_pluto_mappluto.shtml but that may change. Download the CSV format. You’ll see why in a moment.

Set Up

For everyone’s sanity, I’d like everyone to use the same file and folder names. So …

  1. Download the data (or copy it from Baker)
  2. Make a folder called “pluto”
  3. Move the data into that folder and unzip it
  4. Open a terminal from that folder.

Getting Around

Open your terminal. It is probably in your Applications folder under Utilities. But you might be able to right click (or control click) and launch the terminal right from a directory. Most of these commands should be familiar from your homework.

  • ls should show you a list of files. If it doesn’t, use pwd to find out what directory you are in.
  • If you aren’t in your pluto folder, use your finder to figure out where that folder is, and use cd to move into it.
  • use wc --help to figure out what wc does
  • wc -l BK.csv and wc -l *.csv – what do you think the * does?
  • head BK.csv or tail BK.csv (what do head and tail do?)
  • what does split do? Split is incredibly useful if you find yourself in over your head. You can at least open a chunk of the data and figure out if it is even something you can work with.
  • split -l 5000 MN.csv Manhattan_chunks
  • Use wc -l Manhattan* to see the length of each.
  • Play with tab completion. What happens if you just type Mantab</kb>?
  • We actually don’t need those, so do ls Manhattan* and see what that gets you, then do rm Manhattan* to delete the files. I always use ls as a sanity check when I’m about to delete files, especially when I’m using a regular expression to catch more than one file.
  • du -h ./*.csv – this gives us the size of each of these files in -h a human readable format. So these are big files, even without the polygons.

csvkit

First question: let’s get a sense of what is in each of these files. Try running

csvcut -n BK.csv 

How many columns are there?

If we only want to work with the following columns:

  • LandUse
  • OwnerType
  • ZoneDist
  • AssessTot
  • ExemptTot
  • Council
  • ZipCode
  • Address
  • CDTaxLotTaxBlock
  • XCoord
  • YCoord

What column numbers are we working with?

  • LandUse 26
  • OwnerType 28 (Try csvcut -n BK.csv | grep Own)
  • ZoneDist
  • AssessTot 55
  • ExemptTot 57
  • Council 8
  • ZipCode 9
  • Address 12
  • CD 4
  • Lot 3
  • Block 2
  • XCoord 72
  • YCoord 73

So let’s check that:

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv

Oh man oh man oh man, there’s a ton of crap spewing down my screen like I’m caught in the matrix!!

That’s cool. We can wait. Your computer is actually working. If you need it to stop for some reason, you can always type ctrlc, which will stop the command running in the terminal. If you let csvcut run its course, it will stop eventually. This matrix looking mess on your screen is called stdout. Sometimes that’s all you want – to see the output of a command. But sometimes what you want (need, even) is to store the output in a file for later.

Redirection

We’re not going to go deep into this, but one of the nice things I can do at the command line is take the output of a command and, instead of writing it to stdout (aka the screen), I can redirect it.

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv | head

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv > smaller_MN.csv

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 BX.csv > smaller_BX.csv

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 SI.csv > smaller_SI.csv

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 QN.csv > smaller_QN.csv

csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 BK.csv > smaller_BK.csv

That is going to take a few minutes. That’s okay.

Now try du -h ./*.csv – we’ve already shrunk our files way, way down.

That’s huge. These are now much, much more manageable files. And in this case, we are only interested in Vacant Land. From the data dictionary, I know that that’s “LandUse” code 11. So we can use another command, csvgrep to find a pattern in the data:

csvcut -n smaller_MN.csv

csvgrep -c 1 -m 11 smaller_MN.csv > vacant_MN.csv

##Stacking Now, I actually used this:

csvstack -g BK,BX,MN,QN,SI smaller_BK.csv smaller_BX.csv smaller_MN.csv smaller_QN.csv smaller_SI.csv > FiveBoro.csv to get a single file for all five boros.

And then I used grep to find the vacant lots: csvcut -n FiveBoro.csv

csvgrep -c 2 -m 11 FiveBoro.csv > vacant_FiveBoro.csv

TK TK TK

*LandUse: *Only 11 (vacant land)

OwnerType: P (Private Ownership) BLANK (possibly privately-owned) I want to combine C, M, O, and X into another category, which is ownership that’s mixed or public

ZoneDist I am primarily concerned with lots currently listed as R1-1 - R10H (Residential) M1-1/R5 - M1-6/R10 (Mixed Manufacturing and Residential) *Then I want to combine all the other zoning categories: * C1-6 - C8-4 M1-1 - M3-2 ZNA ZR 11-151

Walk-thru 2: DOB Complaints

If that was fun (or, even if it wasn’t) and you want to do more, check out another good run through:

https://github.com/amandabee/CUNY-SOJ-data-storytelling/wiki/Tutorial:-CSVkit

Notes that need revising.

CSVkit

We’re going to use a little Python program called csvkit

pip install --user csvkit

Experimenting

Using the complete 311 files on Baker

Use wc -l and split

and less -N (or = while in less)

head and tail





CUNY Graduate School of Journalism

© Spring 2015 Amanda Hickman