It all started in the summer of 2021, when a group of high school students from across the U.S. attended the Wharton Data Science Academy, a then-remote Wharton Global Youth program led by statistics professor Linda Zhao.
For their final project, the group of five chose to collect and analyze data on gun ownership, resulting in data-driven research on “The Demographic Effects on Gun Ownership in New York City.”
The topic struck a chord with this group of socially-minded youth. “Gun violence is a dangerous and alarming issue that has been growing exponentially over the past couple of years across the U.S., and especially in New York City,” says Gracia Chen, a senior at Miramonte High School in California who likes using mathematical methods to analyze society and human behavior. “We explored gun ownership in New York City to gain a clearer understanding of how different demographic indicators impact whether or not someone chooses to own firearms. We studied New York City because of the high concentration of gun violence in that region.”
In February 2022, the four women on the Data Science Academy team presented their findings during the virtual Women in Data Science (WiDS) @ Penn Conference, hosted by the Wharton School and Penn Engineering. You can watch their video presentation below!
Meanwhile, here are some highlights from their adventures in data wrangling:
📊 Prepping the Numbers. To prepare and clean the data for the project, the Data Science team loaded in two datasets, the first on the number of handgun permits in New York City by zip code and the other on the demographics of the 59 community districts in New York City, filtered to reflect only the most recent data from 2019. The numbers were taken directly from the Citizen’s Committee for the Children of New York and the New York Police Department. “During the first few days we definitely faced more challenges, like finding proper datasets, merging the datasets, and cleaning the data,” notes Anushka Acharya, a junior at Amity Regional High School in Connecticut. “However, as we continued on, we were able to overcome these challenges, and move quicker with the project due to our growing understanding of data science.”
📊 Visualizing the Data. Next came some Exploratory Data Analysis, where the team created various graphic representations of the data to understand it better. These included a bubble map of the 59 community districts representing the number of handgun permits per 100,000 people, a bar graph showing the 10 community districts with the highest number of handgun permits, a bar graph of the 10 districts with the highest percentage of homeowners, and a scatter plot of handgun permits per 100,000 people versus the number of reported violent felonies for each community district.
“The bubble map was a challenge,” recalls Yulan Wang, a junior at Theodore Roosevelt High School in Ohio. “Due to the format of our gun permits data, we could not use a pre-made R-package to merge the coordinates (longitude and latitude) of the districts to make the map. We were on the verge of abandoning the bubble map idea until we came up with the solution to manually add in each of the 59 community district’s coordinates. I stayed on Zoom with Anushka while we divided the task of putting in the coordinates. I remember the excitement I felt when I shared the bubble map with my team members!”
📊 Doing the Analysis. The team started its data analysis with lasso regression, a technique where data values are shrunk toward a central point. “We used the lasso function to select the most significant variables in our data,” notes Joy An, a junior at Choate Rosemary Hall in Connecticut who appreciated applying her computational skills to analyze real-world problems. “Next, we created a linear regression model using the variables that were selected by lasso, and finally we used the Anova function to generate a significance value for each variable and perform backward selection to eliminate insignificant variables.” Anushka recalls what she describes as a memorable outcome. “My favorite insight from the data was the results of the lasso regression. We found that a person’s job and home status have a significant impact on whether or not they own a gun.”
They rounded off the analysis using random forest, an algorithm that builds decision trees resulting in more accurate predictions about the data. For this, Yulan spent hours sitting at the coffee shop reading an academic paper to understand how random forest works. “It’s easy to run a line of code and apply random forest to the data, but understanding the mechanism behind it is much more difficult,” she notes. “It also made me realize all the different ways data can be analyzed by applying various machine learning methods. When people think of data science, most would think it’s boring doing the same tests over different datasets. But data science can involve so much creativity.”
‘Data Science Can Help Us Better Our World’
The team’s data wrangling and analysis yielded some unique observations about the effects of demographics on gun ownership in New York City. For example, the community districts with the highest concentration of handgun permits are in Staten Island and the Bronx, which have a relatively low population density. The most populated New York City borough, Manhattan, does not appear on the top 10 list.
“It surprised all of us to see that the number of gun permits in the most populated district, Manhattan, was not more than the least populated districts like Throgs Neck in Staten Island, a family-oriented area with more homeowners. It’s interesting to think about the causes behind this — the age group of the people living there and city apartment restrictions on guns,” says Yulan.
The project also underscored the importance of identifying biases in data. For instance, gun ownership is not accurately reflected by gun permits, since some people have illegally owned guns. Also, this particular study only focused on handguns. Other types of guns may be affected more by other demographic factors.
This team of aspiring data scientists has a message for other teens about the power of data.
“Data science can help us better our world by studying issues that need to be addressed, like gun ownership and gun violence,” says Yulan, who is currently using Natural Language Processing methods to research Twitter public discourse on COVID vaccines as a way to understand conversations about the vaccine and provide information to public health policymakers. She hopes to pursue a career in cognitive science. “No matter what your interests are — whether that’s business, health care, law, psychology, or literature — data science can be utilized and implemented into your studies. Data science can reveal trends and insights that we can all use to make better decisions.”
Gracia would encourage high school students to take on a bigger data science project and explore their skills. “Personally, this gun ownership project was a fun experience for me,” she says, adding that this was her first time analyzing data on this scale. “I think that the group effort made it manageable, exciting, and created a more in-depth analysis than I could have done on my own.”
Anushka, who lately spends at least part of her days analyzing data to identify relationships between gene expression and chromatin accessibility, shares one final personal takeaway from her Data Academy experience: “Data science is an exciting field that anyone can and should be a part of.”