Off late, I have been learning about ‘data analysis’, and ‘data visualization’. I have been working on datasets, using python, pandas, geopandas, matplotlib, etc …
Since I am a beginner, I am picking up smaller projects, which seem interesting and achievable. After I decide on a project, then I go looking for data relating to that topic. After finishing few projects, I am starting to realize some things about data. Here are some of my observations about data …
- Availability – Datasets not usually available easily. The open data movement has caught steam in several American and European countries, and a lot of agencies are publishing their datasets publicly. But, here the concept of open data is not that big. There are only a few sites which have open data.
- Relevance – Because I am choosing small projects, which I find interesting, finding relevant dataset for it is even more challenging. Sometimes I have ended up collecting my own sample dataset for projects.
- Recentness – Most of the open data sources here will only have datasets from few years back. Recent and up to date datasets are difficult to find.
- Quantity – The quantity of data is not sufficient in many cases. I mean, yes the data is available, but maybe only for 1 week, whereas the minimum amount of data required for a coherent interpretation might be for 1 year.
- Quality – This is a big pain point. Let me explain this with an example:
- I downloaded ‘particulate matter’ data from one of the government’s open data websites. Here were some quality related problems I faced with this data :
- I found data available from 1987 to 2015, but each year was a separate file, and each file could only be downloaded by filling a separate form. There was no consolidated file, or even a single zip file of all the years’ files). downloading them was a pain.
- There were a lot of N/A values. So to adjust the visualization for the N/As had to be considered
- Some years’ file was missing. Now if I am trying to plot a trend of the particulate matter over the years, the gap gear kinda stands out.
- The data columns in the file kept changing every few years (the columns changed in 2004, 2005, 2011, and 2014). Because of this I could not read all the files together in a loop.
- Format of the ‘Sampling Date’ column kept changing every few years ( ‘mmddyyyy’ ‘FullMonthName-Mmmyyyy’ ‘dd-mmm-yyyy’ ). Because of this date operations were difficult.
- In fact, due to so many quality issues in the dataset, I have stalled working on it. Maybe someday, when I have the time and patience to deal with these, I will work on it again.
- I downloaded ‘particulate matter’ data from one of the government’s open data websites. Here were some quality related problems I faced with this data :
- Visualization – The visualization can make or break the data analysis. Choosing
rightbesteffective visualization is of utmost importance. Sometimes a visualization will show you information which would have been hidden, if we just looked at the data. Some graph would work for a certain dataset, and some other plot would work for some other dataset. Thankfully, I am not realizing this just now. This one is something which I have known for a while. While working in the industry, you learn this one very quickly. I still remember having an in-depth discussion with one of my friends (and an ex-teammember) about the usage of a ‘bar chart’ vs ‘stacked bar chart’ for a dataset. Fun times…
I read somewhere that ‘Data is the new Oil’, and I kinda agree with this statement.
- Oil is valuable, so is data
- Oil can cause wars, so can data
- Oil is difficult to find, so is data
- Oil is difficult to refine, so is data
- Oils spills can get one into a sticky situation (remember the oil spills), so can data (remember the data data breaches in corporations)
- Oil is useful, so is data…
My journey of ‘data analysis’ and ‘data visualization’ is ongoing, and I hope to work on more interesting and more complex datasets in the future.
I also believe that this skillset will help me in analyzing data better in personal and professional life, thus enabling me to make more informed decisions.