Tech

Data Discussion

Off late, I have been learning about ‘data analysis’, and ‘data visualization’. I have been working on datasets, using python, pandas, geopandas, matplotlib, etc …

Since I am a beginner, I am picking up smaller projects, which seem interesting and achievable. After I decide on a project, then I go looking for data relating to that topic. After finishing few projects, I am starting to realize some things about data. Here are some of my observations about data …

  • Availability – Datasets not usually available easily. The open data movement has caught steam in several American and European countries, and a lot of agencies are publishing their datasets publicly. But, here the concept of open data is not that big. There are only a few sites which have open data.
  • Relevance – Because I am choosing small projects, which I find interesting, finding relevant dataset for it is even more challenging. Sometimes I have ended up collecting my own sample dataset for projects.
  • Recentness – Most of the open data sources here will only have datasets from few years back. Recent and up to date datasets are difficult to find.
  • Quantity – The quantity of data is not sufficient in many cases. I mean, yes the data is available, but maybe only for 1 week, whereas the minimum amount of data required for a coherent interpretation might be for 1 year.
  • Quality – This is a big pain point. Let me explain this with an example:
    • I downloaded ‘particulate matter’ data from one of the government’s open data websites. Here were some quality related problems I faced with this data :
      1. I found data available from 1987 to 2015, but each year was a separate file, and each file could only be downloaded by filling a separate form. There was no consolidated file, or even a single zip file of all the years’ files). downloading them was a pain.
      2. There were a lot of N/A values. So to adjust the visualization for the N/As had to be considered
      3. Some years’ file was missing. Now if I am trying to plot a trend of the particulate matter over the years, the gap gear kinda stands out.
      4. The data columns in the file kept changing every few years (the columns changed in 2004, 2005, 2011, and 2014). Because of this I could not read all the files together in a loop.
      5. Format of the ‘Sampling Date’ column kept changing every few years ( ‘mmddyyyy’ ‘FullMonthName-Mmmyyyy’ ‘dd-mmm-yyyy’ ). Because of this date operations were difficult.
    • In fact, due to so many quality issues in the dataset, I have stalled working on it. Maybe someday, when I have the time and patience to deal with these, I will work on it again.
  • Visualization – The visualization can make or break the data analysis. Choosing right best effective visualization is of utmost importance. Sometimes a visualization will show you information which would have been hidden, if we just looked at the data. Some graph would work for a certain dataset, and some other plot would work for some other dataset. Thankfully, I am not realizing this just now. This one is something which I have known for a while. While working in the industry, you learn this one very quickly. I still remember having an in-depth discussion with one of my friends (and an ex-teammember) about the usage of a ‘bar chart’ vs ‘stacked bar chart’ for a dataset. Fun times…

I read somewhere that ‘Data is the new Oil’, and I kinda agree with this statement.

  • Oil is valuable, so is data
  • Oil can cause wars, so can data
  • Oil is difficult to find, so is data
  • Oil is difficult to refine, so is data
  • Oils spills can get one into a sticky situation (remember the oil spills), so can data (remember the data data breaches in corporations)
  • Oil is useful, so is data…

My journey of ‘data analysis’ and ‘data visualization’ is ongoing, and I hope to work on more interesting and more complex datasets in the future.

I also believe that this skillset will help me in analyzing data better in personal and professional life, thus enabling me to make more informed decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *

5 + 2 =