Author: Chaojie Gong CMSC320 Final Tutorial
Lots of people think the universities in the United States have a comprehensive good reputation over the world. Therefore, their prioritized option is to pursue a degree in the U.S.
In fact, with over 4,000 colleges and universities, the United States has more institutions of higher learning than any other country in the world. Many of them are highly ranked, offering top-notch educational programs, opportunities for hands-on learning, and cutting-edge research at the graduate and undergraduate levels. Many professors at U.S. institutions have terminal degrees in their field of expertise, are internationally recognized for their scholarship, and represent a diversity of ethnicities and cultural backgrounds. Besides, a significant number of the teaching staff have traveled or lived abroad, which contributes to an enriched classroom experience. Moreover, graduates from a U.S. university or college often find enormous success in the international job market. Employers recognize the value of such an education and the unique skills and qualities that these graduates possess. In short, a degree from a U.S. institution opens doors and is recognized around the world.
The QS World University Rankings comprises of the 150 universities of the top international study destination, US. More than 1.18 million international students were studying in the US in 2017. 77% of these have come from Asia. As per the Institute of International Education’s Open Doors report, the most popular courses are Business and Management, Computer Science, Engineering, and Mathematics. Apart from this, the most popular study destinations for students are New York, Texas, and California.
The main highlight of the US universities is their focus on research-oriented learning. Researchers are always at the forefront and are always look out to develop something new. Innovation and creativity always remain at the core of their educational philosophy. In the US, regular testing/homework and classroom participation is mandatory for getting a good result. Students are encouraged to discuss the issues and focus on providing ideas.
In this project, I will try to analyze the world university ranking data to give me a better idea of how U.S. universities have such a good reputation and the whole picture of top university performance in other countries and their distribution over the world.
College and university rankings are rankings of institutions in higher education which have been ranked on the basis of various combinations of various factors. None of the rankings give a comprehensive overview of the strengths of the institutions ranked because all select a range of easily quantifiable characteristics to base their results on. Rankings have most often been conducted by magazines, newspapers, websites, governments, or academics. In addition to ranking entire institutions, organizations perform rankings of specific programs, departments, and schools. Various rankings consider combinations of measures of funding and endowment, research excellence and influence, specialization expertise, admissions, student options, award numbers, internationalization, graduate employment, industrial linkage, historical reputation and other criteria. Various rankings mostly evaluating on institutional output by research. Some rankings evaluate institutions within a single country, while others assess institutions worldwide.
Every published ranking uses multiple factors. Some factors are arguably less causative and/or less correlated than others. Some rankings rely on publicly available data, while others give weight to surveys and/or comments from students, parents, and admission staff.
Errors or misreporting can happen, which may affect results. A recent book indicates that true shifts in the top 25-30 schools would require significant funds over time, and thus are unlikely to occur. Easily gathered data may not be the most valuable. Arbitrary weighting of specific factors may also skew results.
In here, I will mainly use the data from the Center for World University Rankings (CWUR). CWUR publishes the only global university ranking that measures the quality of education and training of students as well as the prestige of the faculty members and the quality of their research without relying on surveys and university data submissions.
CWUR uses seven objective and robust indicators to rank the world's universities:
1) Quality of Education, measured by the number of a university's alumni who have won major academic distinctions relative to the university's size (25%) 2) Alumni Employment, measured by the number of a university's alumni who have held top executive positions at the world's largest companies relative to the university's size (25%) 3) Quality of Faculty, measured by the number of faculty members who have won major academic distinctions (10%) 4) Research Performance: i) Research Output, measured by the the total number of research papers (10%) ii) High-Quality Publications, measured by the number of research papers appearing in top-tier journals (10%) iii) Influence, measured by the number of research papers appearing in highly-influential journals (10%) iv) Citations, measured by the number of highly-cited research papers (10%)
! pip3 install lxml
import pandas as pd
url = "https://cwur.org/2020-21.php"
tables = pd.read_html(url)
df = tables[0]
df
# I will focus on the top 10 countries that have the most universities posted on the ranking list
df_Country = df["Location"].value_counts()
df_Country.head(10)
From the table, we can clearly see the number of top universities located in each country. Not surprisingly, the U.S. occupied a large proportion of the list. It surpasses the second country which is China by almost 100 units. Also, the table helps me to have a better understanding of how much weight the other countries take up.
# Calculate the comprehensive score according to each country
df["Mean"] = "Nah"
mean_USA = df.loc[df["Location"] == "USA", "Score"].sum()
mean_USA /= df_Country[0]
df.loc[df["Location"] == "USA", "Mean"] = mean_USA
mean_China = df.loc[df["Location"] == "China", "Score"].sum()
mean_China /= df_Country[1]
df.loc[df["Location"] == "China", "Mean"] = mean_China
mean_Japan = df.loc[df["Location"] == "Japan", "Score"].sum()
mean_Japan /= df_Country[2]
df.loc[df["Location"] == "Japan", "Mean"] = mean_Japan
mean_UnitedKingdom = df.loc[df["Location"] == "United Kingdom", "Score"].sum()
mean_UnitedKingdom /= df_Country[3]
df.loc[df["Location"] == "United Kingdom", "Mean"] = mean_UnitedKingdom
mean_France = df.loc[df["Location"] == "France", "Score"].sum()
mean_France /= df_Country[4]
df.loc[df["Location"] == "France", "Mean"] = mean_France
mean_Germany = df.loc[df["Location"] == "Germany", "Score"].sum()
mean_Germany /= df_Country[5]
df.loc[df["Location"] == "Germany", "Mean"] = mean_Germany
mean_Italy = df.loc[df["Location"] == "Italy", "Score"].sum()
mean_Italy /= df_Country[6]
df.loc[df["Location"] == "Italy", "Mean"] = mean_Italy
mean_India = df.loc[df["Location"] == "India", "Score"].sum()
mean_India /= df_Country[7]
df.loc[df["Location"] == "India", "Mean"] = mean_India
mean_SouthKorea = df.loc[df["Location"] == "South Korea", "Score"].sum()
mean_SouthKorea /= df_Country[8]
df.loc[df["Location"] == "South Korea", "Mean"] = mean_SouthKorea
mean_Brazil = df.loc[df["Location"] == "Brazil", "Score"].sum()
mean_Brazil /= df_Country[9]
df.loc[df["Location"] == "Brazil", "Mean"] = mean_Brazil
df
df.rename(columns={'Location':'Country'}, inplace=True)
df.head(10)
# Integrate the number of university and comprehensive score upon each country
d = {'Country': ["USA", "China", "Japan", "United Kingdom", "France", "Germany", "Italy", "India", "South Korea", "Brazil"],
'Number of University on the list': df_Country.head(10),
'Mean Score': [mean_USA, mean_China, mean_Japan, mean_UnitedKingdom, mean_France, mean_Germany, mean_Italy, mean_India,
mean_SouthKorea, mean_Brazil]}
df_1 = pd.DataFrame(data=d)
df_1.sort_values("Mean Score", inplace = True)
df_1
However, the quantity can not directly reflects the quality. After knowing how many universities each country has, I calculate the average score by summing up the total university score sorted by each country and divide by the total number of universities. And I add the result to the table. After sorting up by the mean score, now I have a more comprehensive result and knowing which country has the best university education standard.
# Add an external country average income resource to help with the analysis
url = "https://www.worlddata.info/average-income.php"
table_income = pd.read_html(url)
df_2 = table_income[0]
df_2["Country"] = df_2["Country"].replace(["United States"], "USA")
df_2.head()
# Use inner-join to combine two tables for better visualization
df_income = pd.merge(df_1, df_2, on ='Country', how ='inner')
df_income.rename(columns={'Average incomeannually':'Country Average Income Annually ($)'}, inplace=True)
df_income.pop("Monthly")
df_income.pop("Rank")
df_income['Country Average Income Annually ($)'] = df_income['Country Average Income Annually ($)'].str.replace('$', '')
df_income['Country Average Income Annually ($)'] = df_income['Country Average Income Annually ($)'].str.replace(',', '')
df_income['Country Average Income Annually ($)'] = df_income['Country Average Income Annually ($)'].astype(int)
df_income.sort_values("Country Average Income Annually ($)", inplace = True)
df_income
The importance of the earnings benefit of schooling is vital for a variety of social issues. These include economic and social policy, racial and ethnic discrimination, gender discrimination, income distribution, and the determinants of the demand for education. This link between education and earnings is formally made in the calculation of the rate of return to investment in education.
I think there exists some kind of relationship between a country's education level and people's income. Therefore, I used an external resource of people's average income by country and put this as a variable into the table.
# Make a scatter plot to visualize the data through the chart
import matplotlib.pyplot as plt
import numpy as np
x = df_income['Number of University on the list']
y = df_income['Country Average Income Annually ($)']
plt.xlabel("Number of University on the list")
plt.ylabel("Country Average Annual Income")
plt.title("Country Top University Number vs Country Income Level")
plt.plot(x, y, 'o')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b)
plt.grid()
plt.show()
According to the linear regression plot we have, we can observe that there are three outliers corresponding to India, China, and Brazil. I realized that China has the second top university quantity in the world, while it does not match the country's average income level. Also, one thing noticeable is that among the three outliers: India, China, and brazil, those three are all developing countries, which implicitly explains the reason why their performance is far away from the line of best fit.
# Compare the data with previous year's data
url = "https://cwur.org/2019-20.php"
tables1 = pd.read_html(url)
df_previous = tables1[0]
df.rename(columns={'Score':'Score 2020-2021'}, inplace=True)
df_previous.rename(columns={'Score':'Score 2019-2020'}, inplace=True)
df_previous
# Eliminate the unnecessary factor
df_previous.pop("World Rank")
df_previous.pop("Location")
df_previous.pop("National Rank")
df_previous.pop("Quality\xa0of Education")
df_previous.pop("Alumni Employment")
df_previous.pop("Quality\xa0of Faculty")
df_previous.pop("Research Performance")
# Use inner-join to combine two tables for better visualization
df_new = pd.merge(df, df_previous, on ='Institution', how ='inner')
df_new
# Compute each university's rating change
df_new["Floating Ratio"] = (df_new["Score 2020-2021"] - df_new["Score 2019-2020"])/df_new["Score 2020-2021"] * 100
df_new
# Use boxplot to visualize the data
boxplot = df_new.boxplot(column=['Floating Ratio'])
The boxplot shows the floating score if we compare it with the data from 2019-2020. There exist a few outliers, but not many. It reflects the university has big changes in terms of the ranking score. If we put a certain condition on the table such as below, we can easily find out the list.
# Add condition to filter the universities have the large ranking change
df_new.loc[df_new["Floating Ratio"] > 5]
# Add a map to help with better data visualization
! pip install pycountry-convert
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2
def get_continent(col):
cn_a2_code = country_name_to_country_alpha2(col)
cn_continent = country_alpha2_to_continent_code(cn_a2_code)
return (cn_a2_code, cn_continent)
# Add the country code and continent code according to the country name
import pandas as pd
pd.options.mode.chained_assignment = None
df["Country Code"] = "Unknown"
df["Continent Code"] = "Unknown"
count = 0
while count < df.shape[0]:
cn_a2_code = country_name_to_country_alpha2(df["Country"][count])
cn_continent = country_alpha2_to_continent_code(cn_a2_code)
df["Country Code"][count] = cn_a2_code
df["Continent Code"][count] = cn_continent
count += 1
df
# Combine the country code and continent code for further process
i = 0
df["Code"] = "Unknown"
while i < df.shape[0]:
a = df["Country Code"][i]
b = df["Continent Code"][i]
df["Code"][i] = (a, b)
i += 1
df.pop("National Rank")
df.pop("Alumni Employment")
df.pop("Quality\xa0of Education")
df.pop("Quality\xa0of Faculty")
df.pop("Research Performance")
df
# Get the latitude and longitude upon given country
! pip install geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent = "--")
def geolocate_latitude(country):
loc = geolocator.geocode(country)
return loc.latitude
def geolocate_longitude(country):
loc = geolocator.geocode(country)
return loc.longitude
# Add the latitude and longitude to the dataframe
df["Latitude"] = 0
df["Longitude"] = 0
for p in range(0, 20):
lat = geolocate_latitude(df["Country Code"][p])
df["Latitude"][p] = lat
df.head(20)
# From here I will add the map feature on top 20 university on the list since the whole ranking list size is very large
for q in range(0, 20):
log = geolocate_longitude(df["Country Code"][q])
df["Longitude"][q] = log
df.head(20)
# Combine the latitude and longitude to cooperate the further process action
r = 0
df["Geolocate"] = "Unknown"
while r < 20:
e = df["Latitude"][r]
f = df["Longitude"][r]
df["Geolocate"][r] = (e, f)
r += 1
df.head(20)
# Cut the unnecessary rows from the table
df.drop(df.tail(1980).index, inplace=True)
df
# Visualize the university data on the map
! pip install folium
import folium
from folium.plugins import MarkerCluster
world_map= folium.Map(tiles="cartodbpositron")
marker_cluster = MarkerCluster().add_to(world_map)
for i in range(len(df)):
lat = df.iloc[i]['Latitude']
long = df.iloc[i]['Longitude']
radius = 10
popup_text = """Country : {}<br>
Institution : {}<br>"""
popup_text = popup_text.format(df.iloc[i]['Country'], df.iloc[i]['Institution'])
folium.CircleMarker(location = [lat, long], radius = radius, popup = popup_text, fill = True).add_to(marker_cluster)
world_map
In the end, I integrate the map feature with the data for better visualization just like what we did on project 4. I take a sample of the top 20 universities from the list and put their location on the map. North America has 16 universities and all of them are from the United States. The United Kingdom has three universities from the top 20 and Japan has one university from the top 20.
Overall, I used what I have learned from CMSC320 and I benefit a lot from this final project. Since the project is n open topic and I could choose the one I am interested in and with no obligation. By the process of building this assignment, I feel like I keep more knowledge in my mind not only by figuring out the approach to deal with the problem, but also by getting more practice from the side of how to start from beginning with zero direction.