{"id":69,"date":"2019-07-05T16:50:27","date_gmt":"2019-07-05T16:50:27","guid":{"rendered":"http:\/\/preritkumar.com\/?p=69"},"modified":"2019-07-07T19:13:57","modified_gmt":"2019-07-07T19:13:57","slug":"data-discussion","status":"publish","type":"post","link":"http:\/\/preritkumar.com\/index.php\/2019\/07\/05\/data-discussion\/","title":{"rendered":"Data Discussion"},"content":{"rendered":"\n<p>Off late, I have been learning about &#8216;data analysis&#8217;, and &#8216;data visualization&#8217;. I have been working on datasets, using python, pandas, geopandas, matplotlib, etc &#8230;<\/p>\n\n\n\n<p>Since I am a beginner, I am picking up smaller projects, which seem interesting and achievable. After I decide on a project, then I go looking for data relating to that topic. After finishing few projects, I am starting to realize some things about data. Here are some of my observations about data &#8230;<\/p>\n\n\n\n<ul><li><strong>Availability<\/strong> &#8211; Datasets not usually available easily. The open data movement has caught steam in several American and European countries, and a lot of agencies are publishing their datasets publicly. But, here the concept of open data is not that big. There are only a few sites which have open data. <\/li><li><strong>Relevance<\/strong> &#8211; Because I am choosing small projects, which I find interesting, finding relevant dataset for it is even more challenging. Sometimes I have ended up collecting my own sample dataset for projects.<\/li><li><strong>Recentness<\/strong> &#8211; Most of the open data sources here will only have datasets from few years back. Recent and up to date datasets are difficult to find.<\/li><li><strong>Quantity<\/strong> &#8211; The quantity of data is not sufficient in many cases. I mean, yes the data is available, but maybe only for 1 week, whereas the minimum amount of data required for a coherent interpretation  might be for 1 year. <\/li><li><strong>Qualit<\/strong>y &#8211; This is a big pain point. Let me explain this with an example:<ul><li>I downloaded &#8216;particulate matter&#8217; data from one of the government&#8217;s open data websites. Here were some quality related problems I faced with this data :<ol><li>I found data available from 1987 to 2015, but each year was a separate file, and each file could only be downloaded by filling a separate form. There was no consolidated file, or even a single zip file of all the years&#8217; files). downloading them was a pain.<\/li><li>There were a lot of N\/A values. So to adjust the visualization for the N\/As had to be considered<\/li><li>Some years&#8217; file was missing. Now if I am trying to plot a trend of the particulate matter over the years, the gap gear kinda stands out.<\/li><li>The data columns in the file kept changing every few years (the columns changed in 2004, 2005, 2011, and 2014). Because of this I could not read all the files together in a loop.<\/li><li>Format of the &#8216;Sampling Date&#8217; column kept changing every few years ( &#8216;mmddyyyy&#8217;    &#8216;FullMonthName-Mmmyyyy&#8217;    &#8216;dd-mmm-yyyy&#8217;  ). Because of this date operations were difficult.<\/li><\/ol><\/li><li>In fact, due to so many quality issues in the dataset, I have stalled working on it. Maybe someday, when I have the time and patience to deal with these, I will work on it again.<\/li><\/ul><\/li><li><strong>Visualization<\/strong> &#8211; The visualization can make or break the data analysis. Choosing <s>right<\/s> <s>best<\/s> effective visualization is of utmost importance. Sometimes a visualization will show you information which would have been hidden, if we just looked at the data. Some graph would work for a certain dataset, and some other plot would work for some other dataset. Thankfully, I am not realizing this just now. This one is something which I have known for a while. While working in the industry, you learn this one very quickly. I still remember having an in-depth discussion with one of my friends (and an ex-teammember) about the usage of a &#8216;bar chart&#8217; vs &#8216;stacked bar chart&#8217; for a dataset. Fun times&#8230;<\/li><\/ul>\n\n\n\n<p><\/p>\n\n\n\n<p>I read somewhere that &#8216;Data is the new Oil&#8217;, and I kinda agree with this statement. <\/p>\n\n\n\n<ul><li>Oil is valuable, so is data<\/li><li>Oil can cause wars, so can data<\/li><li>Oil is difficult to find, so is data<\/li><li>Oil is difficult to refine, so is data<\/li><li>Oils spills can get one into a sticky situation (remember the oil spills), so can data (remember the data data breaches in corporations)<\/li><li>Oil is useful, so is data&#8230;<\/li><\/ul>\n\n\n\n<p>My journey of &#8216;data analysis&#8217; and &#8216;data visualization&#8217; is ongoing, and I hope to work on more interesting and more complex datasets in the future. <\/p>\n\n\n\n<p>I also believe that this skillset will help me in analyzing data better in personal and professional life, thus enabling me to make more informed decisions. <\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>My learnings about data, while working  on &#8216;data analysis&#8217; and &#8216;data visualization&#8217;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[21,44,24,23,28,25,27,22,10],"_links":{"self":[{"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/posts\/69"}],"collection":[{"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":9,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"predecessor-version":[{"id":87,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/posts\/69\/revisions\/87"}],"wp:attachment":[{"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/preritkumar.com\/index.php\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}