Equating SQL and Pandas (Part-3)
2015-02This post is the third part in a four part series covering practice exercises to learn pandas (by comparing it with SQL). The first post and second post cover different exercises.
Here, we try another assignment from coursera introduction to databases course.
SQL Socail-Network Query Exercises (core set)
In this part we'll use a database called 'social'. I downloaded it from the Introduction to databases course in coursera. The database has three tables ( 'Highschooler', 'Friend', 'Likes' ). The schema is shown below.
Highschooler table | ID | name | grade |
Friend table | ID1 | ID2 |
Likes table | ID1 | ID2 |
In [13]: import pandas
import mysql.connector
# set up connections to the DB
conn = mysql.connector.Connect(host='localhost',user='root',\
password='',database='social')
In [14]: import pandas
import mysql.connector
# set up connections to the DB
conn = mysql.connector.Connect(host='localhost',user='root',\
password='',database='social')
#LOAD the SQL tables into DF
qryFr = """
SELECT * From Friend
"""
qryHs = """
SELECT * From Highschooler
"""
qryLi = """
SELECT * From Likes
"""
frndDF = pandas.read_sql( qryFr, conn )
hsDF = pandas.read_sql( qryHs, conn )
likesDF = pandas.read_sql( qryLi, conn )
Question-1 : Find the names of all students who are friends with someone named Gabriel.
Solution using SQL.
In [15]: #SQL query
qry = """
SELECT hs1.name FROM Highschooler hs1
INNER JOIN Friend fr1 ON fr1.ID1=hs1.ID
INNER JOIN Highschooler hs2 ON hs2.ID=fr1.ID2
WHERE hs2.name='Gabriel'
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name
0 Jordan
1 Alexis
2 Cassandra
3 Andrew
4 Jessica
[5 rows x 1 columns]
Solution using Pandas.
Methods used : merge(inner, using lefton, righton), tolist(), isin()
In [16]: # First merge the ID col of hsDF with ID1 of frndDF
resDF = pandas.merge( hsDF, frndDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner')
# select the 'ID2' cols for rows which have name='Gabriel'
frndIDs = resDF[ resDF['name'] == 'Gabriel' ]['ID2']\
.tolist()
# Now from hsDF get the names of frndIDs
frndNames = hsDF[ hsDF['ID'].isin(frndIDs) ]
print frndNames['name']
#SQL query
0 Jordan
3 Cassandra
5 Andrew
8 Alexis
11 Jessica
Name: name, dtype: object
Question-2 : For every student who likes someone 2 or more grades younger than themselves, return that student's name and grade, and the name and grade of the student they like.
Solution using SQL.
In [17]: #SQL query
qry = """
SELECT hs1.name, hs1.grade, hs2.name, hs2.grade FROM Likes l1
INNER JOIN Highschooler hs1 ON l1.ID1=hs1.ID
INNER JOIN Highschooler hs2 ON l1.ID2=hs2.ID
WHERE hs1.grade-hs2.grade>=2
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade name grade
0 John 12 Haley 10
[1 rows x 4 columns]
Solution using Pandas.
Methods used : merge(inner, using lefton and righton)
In [18]: # First merge the ID col of hsDF with ID1 of likesDF
resDF = pandas.merge( hsDF, likesDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner')
# Now merge hsDF again with resDF on ID2
resDF = pandas.merge( resDF, hsDF, \
left_on = 'ID2',\
right_on = 'ID', how='inner')
# Now select rows which satisfy greater than 2 grades condition
resDF = resDF[ resDF['grade_x'] - resDF['grade_y'] >= 2 ].\
reset_index(drop=True)
# select the required cols
resDF = resDF[ [ 'name_x', 'grade_x', 'name_y', 'grade_y' ] ]
print resDF
name_x grade_x name_y grade_y
0 John 12 Haley 10
[1 rows x 4 columns]
Question-3 : For every pair of students who both like each other, return the name and grade of both students. Include each pair only once, with the two names in alphabetical order.
Solution using SQL.
In [19]: #SQL query
qry = """
SELECT hs1.name, hs1.grade, hs2.name, hs2.grade FROM Likes l1
INNER JOIN Likes l2 ON l1.ID1=l2.ID2 AND l1.ID2=l2.ID1
INNER JOIN Highschooler hs1 ON hs1.ID=l1.ID1
INNER JOIN Highschooler hs2 ON hs2.ID=l2.ID1
WHERE hs1.name < hs2.name
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade name grade
0 Cassandra 9 Gabriel 9
1 Jessica 11 Kyle 12
[2 rows x 4 columns]
Solution using Pandas.
Methods used : merge(inner, self, using lefton and righton)
In [20]: # First merge likesDF with itself
resDF = pandas.merge( likesDF, likesDF, \
left_on = 'ID1',\
right_on = 'ID2', how='inner' )
# select the rows where ID2_x = ID1_y
resDF = resDF[ resDF['ID2_x'] == resDF['ID1_y'] ].\
reset_index(drop=True)
# merge hsDF with resDF on ID1_x and also with ID2_x
resDF = pandas.merge( resDF, hsDF, \
left_on = 'ID1_x',\
right_on = 'ID', how='inner' )
resDF = pandas.merge( resDF, hsDF, \
left_on = 'ID2_x',\
right_on = 'ID', how='inner' )
# get name_x, name_y cols from rows where name_x < name_y
resDF = resDF[ resDF['name_x'] < resDF['name_y'] ]\
[ ['name_x', 'grade_x','name_y', 'grade_y'] ]
print resDF
name_x grade_x name_y grade_y
1 Cassandra 9 Gabriel 9
2 Jessica 11 Kyle 12
[2 rows x 4 columns]
Question-4 : Find all students who do not appear in the Likes table (as a student who likes or is liked) and return their names and grades. Sort by grade, then by name within each grade.
Solution using SQL.
In [21]: #SQL query
qry = """
SELECT hs.name, hs.grade FROM Highschooler hs
LEFT JOIN Likes l1 ON l1.ID1=hs.ID OR l1.ID2=hs.ID
WHERE l1.ID1 IS NULL AND l1.ID2 IS NULL
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade
0 Jordan 9
1 Tiffany 9
2 Logan 12
[3 rows x 2 columns]
Solution using Pandas.
Methods used : tolist(), isin()
In [22]: # Get a unique list of all IDs (both ID1 and ID2) in likesDF
likeIds = set( likesDF['ID1'].tolist() + likesDF['ID2'].tolist() )
# get names/grades from hsDF
resDF = hsDF[ ~hsDF['ID'].isin(likeIds) ][['name', 'grade']].\
reset_index(drop=True)
print resDF
name grade
0 Jordan 9
1 Tiffany 9
2 Logan 12
[3 rows x 2 columns]
Question-5 : For every situation where student A likes student B, but we have no information about whom B likes (that is, B does not appear as an ID1 in the Likes table), return A and B's names and grades.
Solution using SQL.
In [23]: #SQL query
qry = """
SELECT hs1.name, hs1.grade, hs2.name, hs2.grade
FROM Highschooler hs1
INNER JOIN Likes l1 ON l1.ID1=hs1.ID
INNER JOIN Highschooler hs2 ON l1.ID2=hs2.ID
WHERE hs2.ID NOT IN ( SELECT ID1 FROM Likes )
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade name grade
0 John 12 Haley 10
1 Brittany 10 Kris 10
2 Alexis 11 Kris 10
3 Austin 11 Jordan 12
[4 rows x 4 columns]
Solution using Pandas.
Methods used : tolist(), merge(inner, using lefton and righton)
In [24]: # Get list of all ID1 from likesDF
likesId1 = set( likesDF['ID1'].tolist() )
# merge likesDF with hsDF on ID1
resDF = pandas.merge( hsDF, likesDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner' )
# Now merge resDF with hsDF on ID2
resDF = pandas.merge( resDF, hsDF, \
left_on = 'ID2',\
right_on = 'ID', how='inner' )
# Now get all rows which dont have likesId1 in ID2 col
resDF = resDF[ ~resDF['ID2'].isin(likesId1) ].\
reset_index(drop=True)
# filter required cols
resDF = resDF[ ['name_x', 'grade_x', 'name_y', 'grade_y' ] ]
print resDF
name_x grade_x name_y grade_y
0 Brittany 10 Kris 10
1 Alexis 11 Kris 10
2 Austin 11 Jordan 12
3 John 12 Haley 10
[4 rows x 4 columns]
Question-6 : Find names and grades of students who only have friends in the same grade. Return the result sorted by grade, then by name within each grade.
Solution using SQL.
In [25]: #SQL query
qry = """
SELECT name, grade FROM Highschooler WHERE ID NOT IN
(SELECT hs1.ID FROM Highschooler hs1
INNER JOIN Friend fr ON fr.ID1=hs1.ID
INNER JOIN Highschooler hs2 ON hs2.ID=fr.ID2
WHERE hs2.grade-hs1.grade != 0 )
ORDER BY grade, name
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade
0 Jordan 9
1 Brittany 10
2 Haley 10
3 Kris 10
4 Gabriel 11
5 John 12
6 Logan 12
[7 rows x 2 columns]
Solution using Pandas.
Methods used : tolist(), merge(), isin(), drop_duplicates()
In [26]: # Merge frndDF with hsDF on ID1 and ID2 seperately
resDF = pandas.merge( hsDF, frndDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner' )
resDF = pandas.merge( resDF, hsDF, \
left_on = 'ID2',\
right_on = 'ID', how='inner' )
# select the rows where friends are in different grades
id1List = resDF[ resDF['grade_x'] - resDF['grade_y'] != 0 ]['ID1']
# Now select rows which are not in rowNums
# also filter the cols
resDF = resDF[ ~resDF['ID1'].isin(id1List) ]\
[ ['name_x', 'grade_x'] ]\
.drop_duplicates()\
.reset_index(drop=True)
print resDF.sort( ['grade_x', 'name_x'] )
name_x grade_x
0 Jordan 9
4 Brittany 10
3 Haley 10
1 Kris 10
2 Gabriel 11
6 John 12
5 Logan 12
[7 rows x 2 columns]
Question-7 : For each student A who likes a student B where the two are not friends, find if they have a friend C in common (who can introduce them!). For all such trios, return the name and grade of A, B, and C.
Solution using SQL.
In [27]: #SQL query
qry = """
SELECT hs1.name, hs1.grade, hs3.name, hs3.grade , hs2.name, hs2.grade
FROM Highschooler hs1
INNER JOIN Likes l1 ON l1.ID1=hs1.ID
LEFT JOIN Friend fl ON fl.ID1=l1.ID1 AND fl.ID2=l1.ID2
INNER JOIN Friend f1 ON f1.ID1=l1.ID1
INNER JOIN Friend f2 ON f2.ID1=l1.ID2 AND f2.ID2=f1.ID2
INNER JOIN Highschooler hs2 ON hs2.ID = f2.ID2
INNER JOIN Highschooler hs3 ON hs3.ID = f2.ID1
WHERE fl.ID2 IS NULL
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade name grade name grade
0 Andrew 10 Cassandra 9 Gabriel 9
1 Austin 11 Jordan 12 Andrew 10
2 Austin 11 Jordan 12 Kyle 12
[3 rows x 6 columns]
Solution using Pandas.
Methods used : merge()
In [28]: # Merge frndDF with hsDF on ID1 and ID2 seperately
resFrndDF = pandas.merge( hsDF, frndDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner' )
resFrndDF = pandas.merge( resFrndDF, hsDF, \
left_on = 'ID2',\
right_on = 'ID', how='inner' )
# Merge likesDF with hsDF on ID1 and ID2 seperately
resLikesDF = pandas.merge( hsDF, likesDF, \
left_on = 'ID',\
right_on = 'ID1', how='inner' )
resLikesDF = pandas.merge( resLikesDF, hsDF, \
left_on = 'ID2',\
right_on = 'ID', how='inner' )
# Merge the resLikesDF, resFrndDF on ID1, ID2
resDF = pandas.merge( resLikesDF, resFrndDF, \
left_on = ['ID1', 'ID2'],\
right_on = ['ID1', 'ID2'], how='left' )
# Get the rows with Null values in ID_x_y
# Merge resDF with frndDF
resDF = resDF[ resDF['ID_x_y'].isnull() ].\
reset_index(drop=True)
# Now we have a DF containing students in like table
# but not are not in frnds table.
# drop un-necessary cols
resDF = resDF[ ['ID_x_x', 'name_x_x', \
'grade_x_x', 'ID_y_x', 'name_y_x', \
'grade_y_x'] ]
# Make a new DF merging resDF with Frnd on ID1, ID_x_x
resDFIdx = pandas.merge( resDF, frndDF,\
left_on = 'ID_x_x',\
right_on = 'ID1', how='inner' )
# Make another DF merging resDF with Frnd on ID1, ID_y_x
resDFIdy = pandas.merge( resDF, frndDF,\
left_on = 'ID_y_x',\
right_on = 'ID1', how='inner' )
# Merge the new DF's on ID2
resDF = pandas.merge( resDFIdx, resDFIdy,\
on = 'ID2', how='inner' )
# to actually get the names of common friends
# merge resDF with hsDF ( on ID2 )
resDF = pandas.merge( resDF, hsDF,\
left_on = 'ID2',\
right_on = 'ID', how='inner' )
# filter the results
resDF = resDF[ (resDF['name_x_x_x'] == resDF['name_x_x_y']) & \
(resDF['name_y_x_x'] == resDF['name_y_x_y']) ]
# get the req cols
resDF = resDF[ ['name_x_x_x', 'grade_x_x_x',\
'name_y_x_x', 'grade_y_x_x',\
'name', 'grade'] ]
print resDF
name_x_x_x grade_x_x_x name_y_x_x grade_y_x_x name grade
1 Andrew 10 Cassandra 9 Gabriel 9
2 Austin 11 Jordan 12 Kyle 12
4 Austin 11 Jordan 12 Andrew 10
[3 rows x 6 columns]
Question-8 : Find the difference between the number of students in the school and the number of different first names.
Solution using SQL.
In [29]: #SQL query
qry = """
SELECT COUNT(DISTINCT ID)-COUNT(DISTINCT name) FROM Highschooler
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
COUNT(DISTINCT ID)-COUNT(DISTINCT name)
0 2
[1 rows x 1 columns]
Solution using Pandas.
Methods used : nunique()
In [30]: # We can use nunique function on columns
print hsDF['ID'].nunique() - hsDF['name'].nunique()
2
Question-9 : Find the name and grade of all students who are liked by more than one other student.
Solution using SQL.
In [31]: #SQL query
qry = """
SELECT name, grade FROM Highschooler
WHERE ID IN
( SELECT ID2 FROM Likes
GROUP BY ID2
HAVING COUNT(ID1) > 1 )
"""
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
name grade
0 Cassandra 9
1 Kris 10
[2 rows x 2 columns]
Solution using Pandas.
Methods used : groupby(), count(), isin()
In [32]: # Groupby likesDF
likeGroups = likesDF.groupby( 'ID2' )
# get the num of likes for each person
nLikes = likeGroups['ID1'].count()
# get ID's where count > 1
nLikes = nLikes[ nLikes > 1 ]
# get the index values from nLikes
# then get corresponding details from hsDF
# about the ID
idList = nLikes.index.values
resDF = hsDF[ hsDF['ID'].isin(idList) ]
print resDF[ ['name', 'grade'] ].reset_index(drop=True)
name grade
0 Cassandra 9
1 Kris 10
[2 rows x 2 columns]