Get the Files
NYC’s Department of City Planning publishes incredibly useful property maps of NYC. Not for nothing, these are available on Socrata, but if you find them on NYC Open Data you’re way (way) better off going back to the agency that provides the data. Among other things, City Planning provides clear context for their data.
Today the link for the most up to date MapPLUTO data is http://www.nyc.gov/html/dcp/html/bytes/dwn_pluto_mappluto.shtml but that may change. Download the CSV format. You’ll see why in a moment.
For everyone’s sanity, I’d like everyone to use the same file and folder names. So …
- Download the data (or copy it from Baker)
- Make a folder called “pluto”
- Move the data into that folder and unzip it
- Open a terminal from that folder.
Open your terminal. It is probably in your Applications folder under Utilities. But you might be able to right click (or control click) and launch the terminal right from a directory. Most of these commands should be familiar from your homework.
lsshould show you a list of files. If it doesn’t, use
pwdto find out what directory you are in.
- If you aren’t in your
plutofolder, use your finder to figure out where that folder is, and use
cdto move into it.
wc --helpto figure out what
wc -l BK.csvand
wc -l *.csv– what do you think the
tail BK.csv(what do
- what does
splitdo? Split is incredibly useful if you find yourself in over your head. You can at least open a chunk of the data and figure out if it is even something you can work with.
split -l 5000 MN.csv Manhattan_chunks
wc -l Manhattan*to see the length of each.
- Play with tab completion. What happens if you just type Mantab</kb>?
- We actually don’t need those, so do
ls Manhattan*and see what that gets you, then do
rm Manhattan*to delete the files. I always use
lsas a sanity check when I’m about to delete files, especially when I’m using a regular expression to catch more than one file.
du -h ./*.csv– this gives us the size of each of these files in
-ha human readable format. So these are big files, even without the polygons.
First question: let’s get a sense of what is in each of these files. Try running
csvcut -n BK.csv
How many columns are there?
If we only want to work with the following columns:
What column numbers are we working with?
- LandUse 26
- OwnerType 28 (Try
csvcut -n BK.csv | grep Own)
- AssessTot 55
- ExemptTot 57
- Council 8
- ZipCode 9
- Address 12
- CD 4
- Lot 3
- Block 2
- XCoord 72
- YCoord 73
So let’s check that:
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv
Oh man oh man oh man, there’s a ton of crap spewing down my screen like I’m caught in the matrix!!
That’s cool. We can wait. Your computer is actually working. If you need it to stop for some reason, you can always type ctrlc, which will stop the command running in the terminal. If you let
csvcut run its course, it will stop eventually. This matrix looking mess on your screen is called
stdout. Sometimes that’s all you want – to see the output of a command. But sometimes what you want (need, even) is to store the output in a file for later.
We’re not going to go deep into this, but one of the nice things I can do at the command line is take the output of a command and, instead of writing it to
stdout (aka the screen), I can redirect it.
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv | head
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 MN.csv > smaller_MN.csv
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 BX.csv > smaller_BX.csv
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 SI.csv > smaller_SI.csv
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 QN.csv > smaller_QN.csv
csvcut -c 26,28,55,57,8,9,12,4,3,2,72,73 BK.csv > smaller_BK.csv
That is going to take a few minutes. That’s okay.
du -h ./*.csv – we’ve already shrunk our files way, way down.
That’s huge. These are now much, much more manageable files. And in this case, we are only interested in Vacant Land. From the data dictionary, I know that that’s “LandUse” code 11. So we can use another command,
csvgrep to find a pattern in the data:
csvcut -n smaller_MN.csv
csvgrep -c 1 -m 11 smaller_MN.csv > vacant_MN.csv
##Stacking Now, I actually used this:
csvstack -g BK,BX,MN,QN,SI smaller_BK.csv smaller_BX.csv smaller_MN.csv smaller_QN.csv smaller_SI.csv > FiveBoro.csv to get a single file for all five boros.
And then I used grep to find the vacant lots:
csvcut -n FiveBoro.csv
csvgrep -c 2 -m 11 FiveBoro.csv > vacant_FiveBoro.csv
TK TK TK
*LandUse: *Only 11 (vacant land)
OwnerType: P (Private Ownership) BLANK (possibly privately-owned) I want to combine C, M, O, and X into another category, which is ownership that’s mixed or public
ZoneDist I am primarily concerned with lots currently listed as R1-1 - R10H (Residential) M1-1/R5 - M1-6/R10 (Mixed Manufacturing and Residential) *Then I want to combine all the other zoning categories: * C1-6 - C8-4 M1-1 - M3-2 ZNA ZR 11-151
Walk-thru 2: DOB Complaints
If that was fun (or, even if it wasn’t) and you want to do more, check out another good run through:
Notes that need revising.
We’re going to use a little Python program called csvkit
pip install --user csvkit
Using the complete 311 files on Baker
wc -l and
less -N (or
= while in