Equating SQL and Pandas (Part-4)
2015-02This post is the final part in the four part series covering practice exercises to learn pandas (by comparing it with SQL). The first post, second post and third post cover different exercises.
Here, we try another assignment from coursera introduction to databases course.
SQL Socail-Network Query Exercises (core set)
In this part we'll use a database called 'social'. I downloaded it from the Introduction to databases course in coursera. The database has three tables ( 'Highschooler', 'Friend', 'Likes' ). The schema is shown below.
Highschooler table | ID | name | grade |
Friend table | ID1 | ID2 |
Likes table | ID1 | ID2 |
In [8]: import pandas
import mysql.connector
# set up connections to the DB
conn = mysql.connector.Connect(host='localhost',user='root',\
                        password='',database='social')In [9]: import pandas
import mysql.connector
# set up connections to the DB
conn = mysql.connector.Connect(host='localhost',user='root',\
                        password='',database='social')
#LOAD the SQL tables into DF
qryFr = """
        SELECT * From Friend
      """
qryHs = """
        SELECT * From Highschooler
      """
qryLi = """
        SELECT * From Likes
      """
frndDF = pandas.read_sql( qryFr, conn )
hsDF = pandas.read_sql( qryHs, conn )
likesDF = pandas.read_sql( qryLi, conn )Question-1 : For every situation where student A likes student B, but student B likes a different student C, return the names and grades of A, B, and C.
Solution using SQL.
In [10]: #SQL query
qry = """
        SELECT hs1.name, hs1.grade, hs2.name, hs2.grade, hs3.name, hs3.grade
        FROM Highschooler hs1
        INNER JOIN Likes l1 ON l1. ID1 = hs1.ID
        INNER JOIN Likes l2 ON l1.ID2=l2.ID1 AND l1.ID1 != l2.ID2
        INNER JOIN Highschooler hs2 ON hs2.ID=l2.ID1
        INNER JOIN Highschooler hs3 ON hs3.ID=l2.ID2
      """
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
      name  grade       name  grade     name  grade
0   Andrew     10  Cassandra      9  Gabriel      9
1  Gabriel     11     Alexis     11     Kris     10
[2 rows x 6 columns]Solution using Pandas.
Methods used : merge(inner, using lefton, righton), drop()
In [11]: # Merge likesDF on itself
resDF = pandas.merge( likesDF, likesDF, \
                     left_on = 'ID2',\
                     right_on = 'ID1', how='inner')
# Now take out the ID's who mutually like each other
resDF = resDF[ resDF['ID1_x'] != resDF['ID2_y'] ]
# drop unnecessary columns and merge resDF with hsDF
# to get the name and grade details from all the IDs
resDF.drop( 'ID1_y', 1, inplace=True )
resDF = pandas.merge( resDF, hsDF, \
                     left_on = 'ID1_x',\
                     right_on = 'ID', how='inner')
resDF.drop( ['ID1_x', 'ID'], 1, inplace=True )
resDF = pandas.merge( resDF, hsDF, \
                     left_on = 'ID2_x',\
                     right_on = 'ID', how='inner')
resDF.drop( ['ID2_x', 'ID'], 1, inplace=True )
resDF = pandas.merge( resDF, hsDF, \
                     left_on = 'ID2_y',\
                     right_on = 'ID', how='inner')
resDF.drop( ['ID2_y', 'ID'], 1, inplace=True )
print resDF
    name_x  grade_x     name_y  grade_y     name  grade
0   Andrew       10  Cassandra        9  Gabriel      9
1  Gabriel       11     Alexis       11     Kris     10
[2 rows x 6 columns]Question-2 : Find those students for whom all of their friends are in different grades from themselves. Return the students' names and grades.
Solution using SQL.
In [12]: #SQL query
qry = """
        SELECT name, grade FROM Highschooler WHERE ID NOT IN
        (SELECT hs1.ID FROM Highschooler hs1
        INNER JOIN Friend f1 ON f1.ID1=hs1.ID
        INNER JOIN Highschooler hs2 ON f1.ID2=hs2.ID
        WHERE hs1.grade-hs2.grade = 0 )
      """
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
     name  grade
0  Austin     11
[1 rows x 2 columns]Solution using Pandas.
Methods used : merge, drop(), tolist(), isin()
In [13]: # Merge hsDF with frndDF
resDF = pandas.merge( hsDF, frndDF, \
                     left_on = 'ID',\
                     right_on = 'ID1', how='inner')
resDF = pandas.merge( resDF, hsDF, \
                     left_on = 'ID2',\
                     right_on = 'ID', how='inner')
resDF.drop( ['ID1', 'ID2'], 1, inplace=True )
# select all students who have frnds in same grade
resDF = resDF[ resDF['grade_x'] - resDF['grade_y'] == 0 ]
# get a list of their IDs
idList = resDF['ID_x'].tolist()
# get the name and grades of students 
# not in the list from hsDF
resDF = hsDF[ ~hsDF['ID'].isin(idList) ]
print resDF[ ['name', 'grade'] ].reset_index(drop=True)
     name  grade
0  Austin     11
[1 rows x 2 columns]Question-3 : What is the average number of friends per student? (Your result should be just one number.)
Solution using SQL.
In [14]: #SQL query
qry = """
        SELECT AVG(T.cnt_by_frnd) FROM ( SELECT ID1, COUNT(ID2) cnt_by_frnd FROM Friend
        GROUP BY ID1 ) T
      """
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
   AVG(T.cnt_by_frnd)
0                 2.5
[1 rows x 1 columns]Solution using Pandas.
Methods used : groupby(), count(), mean()
In [15]: # groupby frndDF on ID1 and get counts of ID2
frndGrps = frndDF.groupby( ['ID1'] )
# get number of frnds for each student
nFrnds = frndGrps['ID2'].count()
print nFrnds.mean()
2.5Question-4 : Find the number of students who are either friends with Cassandra or are friends of friends of Cassandra. Do not count Cassandra, even though technically she is a friend of a friend.
Solution using SQL.
In [16]: #SQL query
qry = """
        SELECT COUNT(ID2) FROM Friend WHERE ID1 IN ( SELECT ID2 FROM Friend 
        WHERE ID2 != ( SELECT ID FROM Highschooler 
        WHERE name = 'Cassandra' ) 
        AND (ID1 = ( SELECT ID FROM Highschooler 
        WHERE name = 'Cassandra' ) ) 
        OR ID1 =  ( SELECT ID FROM Highschooler 
        WHERE name = 'Cassandra' ) )
      """
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
   COUNT(ID2)
0           7
[1 rows x 1 columns]Solution using Pandas.
Methods used : tolist()
In [17]: # Get the ID of 'Cassandra'
idCsn = hsDF[ hsDF['name'] == 'Cassandra' ]\
        ['ID'].values[0]
# get all friends of this particular ID
frndList = frndDF[ frndDF['ID1'] == idCsn ]\
            ['ID2'].tolist()
# get friends of friends as well and make 
# it a set to get unique values
frndList = set( frndList + frndDF\
        [ ( frndDF['ID1'].isin(frndList) )\
         & (frndDF['ID2'] != idCsn) ]['ID2'].\
        tolist() )
print len(frndList)
7Question-5 : Find the name and grade of the student(s) with the greatest number of friends.
Solution using SQL.
In [18]: #SQL query
qry = """
        SELECT name, grade FROM Highschooler
        INNER JOIN ( 
        SELECT ID1 FROM Friend
        GROUP BY ID1
        HAVING COUNT(ID1) =
        ( SELECT COUNT(ID2) FROM Friend
        GROUP BY ID1
        ORDER BY COUNT(ID2) DESC 
        LIMIT 1 )
        ) T
        ON ID = T.ID1
      """
# get the data
qDF = pandas.read_sql( qry, conn )
# print the data
print qDF
     name  grade
0  Andrew     10
1  Alexis     11
[2 rows x 2 columns]Solution using Pandas.
Methods used : groupby(), count(), isin()
In [19]: # Groupby ID1 and get max frnd number
frndGrps = frndDF.groupby( 'ID1' )
maxFrnds = frndGrps['ID2'].count().max()
# get ID with max frnds
countIdSer = frndGrps['ID2'].count()
idMaxFrnd = countIdSer[ countIdSer == maxFrnds ]
# get name, grade for the IDs
resDF = hsDF[ hsDF['ID'].isin(idMaxFrnd.index.values) ]\
        [ ['name','grade'] ]
print resDF.reset_index(drop=True)
     name  grade
0  Andrew     10
1  Alexis     11
[2 rows x 2 columns]