Notes on Using Refine
Once you’ve installed OpenRefine, you may see an error like “… is damaged and can’t be opened. You should move it to the Trash,” the solution to is not at all intuitive: You have to Change your Privacy and Security settings to allow applications downloaded from “Anywhere” – it’s on the “general” tab.
Refine runs a little server that you can access in your browser. It will always be at http://127.0.0.1:3333 once it is running.
When you upload a file, you have to click “Create Project” in the top right to get started. You have a bunch of “parsing” options about how Refine should treat data. I haven’t really plumbed those, but that should not stop you from doing so.
Louisiana
The state publishes a complete list of all 6844 Statewide Public Officials Download the data and pivot to get a count by “Office Title”. You can see very quickly that there’s kind of a problem.
So in refine, we’ll clean it up.
- Use the pull down next to Office Titleand selectFacet > Text Facet
- Find the cluster button. Review, Select All,Merge Selected and Re-Cluster.
- Why aren’t there new matches?
- Look under Edit Cells > Common transformsand think about what “Trim whitespace” is likely to do.
- Cluster again.
- Try a new “Method” – more matches, some of which aren’t real overlaps.
- Re cluster for “City” and “City2”
- Open a Custom numeric facet...for the “Commissioned Date” and see whatvalue.toDate().datePart("year")does.
- Try splitting the zip code.
Check out the faceting documentation for more insights.
Take a look at the undo/redo history – you can create scripts to use over and over.
Keep Learning
If you want more practice with refine, you might find one of these tutorials helpful:
- School of Data uses Asian Development Bank work in Bangledesh, or Provincial Hospitals in Zimbabwe
- Enipedia uses data on Universities from Wikipedia
- Programming Historian uses museum collection metadata
- UC Berkeley has a data cleaning tutorial that just uses Excel.