Joshua Jorgensen professional photo

Joshua Jorgensen

Senior Consultant

The importance of equity in the data science community

One of the most important things we can do is help our communities - it is key to remember that there is a data science community. One of the most important topics today is about ensuring equitability for all. As a society, it is crucial to ensure that everyone has an equal opportunity to learn and explore; however, it is easy to forget that sometimes data is not easily accessible for all, nor are the technology/methods to analyze this data. Companies are collecting enormous amounts of data and utilizing it to find market trends, drive business decisions, and many other applications. While data can hold key information for companies, other times, it does not. With that in mind, I'd like to ask what would happen if we started making some of this data available to the public and openly discussing how it can best be used and analyzed?

Why should we make data and code public?

We have seen the impact of open source packages and data on the data science community. An open source package is computer software released to the public and grants any user the right to study, change and/or distribute the software and its source code to anyone for any purpose. Having open source packages has allowed individuals to be empowered to create advanced models and truly drive innovation. In addition, this has allowed the community to give back and add to these packages with new ideas, functionality, and methods. By publicly sharing our packages and code, we encourage and empower others to explore new possibilities, pick up where we left off, and offer innovations we didn't know were possible. 

Moreover, combining this practice with publicly available data sources illustrates how many people work together to create highly accurate and innovative models/methods to perform data analytics. We've seen this in practice when hackathons use AI and machine learning to solve real-world problems, which has yielded truly incredible models providing an untold amount of assistance.

Rest assured, I'm not advocating for companies to give away key insights or trade secrets openly. But by opening up more sets of data, packages, and code to the public, we can expect new and innovative ideas to be created and faster evolution of our data science community by allowing those interested to learn and grow with us. We will not only be helping our community but driving exploratory research for real-world problems that can have a positive impact on the world.

I posted my code and data; that's enough, right?

While getting our data and resources out to the public is a commendable first step, it is not enough. What do I mean by that? While I feel posting data and code is a significant first step, it doesn't truly embrace equitability. For example, we all remember the first time we were doing a data project and had questions: What can I do with the data? How do I clean the data? What model should I make? and so on. When just beginning, this quickly becomes daunting. Simply posting our code does not explain the inner workings of the code or illustrate that it is not as daunting as it looks. Nor does merely posting code address that it took a tremendous amount of computing resources to complete that is not accessible to everyone.

While I do not know how to address all of this thoroughly, I believe we can start tackling this monster by empowering individuals to understand and have the ambition to implement this themselves. If everyone wrote small papers explaining their methodology and code, these could be shared publicly. By doing so, we can encourage and empower others to try. Countless times we take for granted ideas and concepts that we have learned from others regarding data manipulation. We can nurture and grow our data science community by paying this forward. Often, individuals will see things such as machine learning as a scary, daunting task. By having more explanations out there, we can begin to demystify these topics, encouraging high school and college students to try their hand in data science and any individual who wants to learn data science, young or old. While this may be a small step for each individual, it can have a significant impact.

The takeaway? Make data publicly available, share code and explain your thinking.

Making sure that data science is equitable for everyone is not easy, and I don't have all the answers to address it. However, one small step we can all make that will have a huge impact on the data science community and the world is to start making more data publicly available and sharing our code/packages and explaining our thinking. By making this available to the community, we can start truly harnessing the power of 1 + 1 = 3 or more in this case 1,000,000 + 1,000,000 = 3,000,000. By harboring creative thinking and new ideas, we can see innovations and ideas that we couldn't think of that will benefit humankind and our own projects.

I genuinely believe that "one random act of kindness will spark another," so I encourage anyone in the data field to pick up your keyboard and write a small paper explaining something you find interesting/exciting. This action may encourage the next generation of data scientists to try it themselves, thus giving back to our forever-growing community.

Learn about our methodology for designing and implementing data-driven insights with CGI Data2Diamonds.

About this author

Joshua Jorgensen professional photo

Joshua Jorgensen

Senior Consultant

Joshua Jorgensen is an innovative Data Science professional with ten years of experience. Joshua has worked in the healthcare, banking and real estate Mass appraisal industries pioneering new and innovative ways to analyze ...