Posted in Python

Python 03: find out intersection of two guest lists

Problem
Two guest list from Meetup: DL and ML. Who is going to both meetups?

Data
Guestlist of DL is an xlsx:

p1t

Guestlist of ML will be extracted from web:
(Meetup groups online, search one randomly you will see, for example)

p2t

So we need the name list from the right side, “going”.
By using “Inspect”, which helps you a lot when targeting a certain tag. Just find the features (say, what tag you need, and the class name of it.).

p3t

Here, for this case, “Irene Li” is in the <a> tag, there is no class name, so we look outside of it: there is the<h5>tag, with the class name being “padding-none member-name”, a unique name. So the idea is, first we find out all the <h5> tags, whose class name is the given one. Then we get the contents of the <a> tag which is inside of <h5>tag.
Normally you can get url contents by urllib, but I need to login (didn’t do research here), so I saved the html file as an input file.

Tool
Python, well, Jupter Notebook, powerful one.

Code

Libs you might need:

import pyexcel as pe
import pyexcel.ext.xls # import it to handle xls file
import pyexcel.ext.xlsx # import it to handle xlsx file
import urllib # you might not need it
from bs4 import BeautifulSoup

get data from DL, the excel sheet:


print 'hello, they are going:'
records = pe.get_records(file_name=&amp;quot;DL.xlsx&amp;quot;)

a=[]
# for each row, we need information of only two colums
for row in records:
    # if the guest is going, then we keep the name
    yes=row['RSVPed Yes']
    if yes == long(1):
        a.append(row['Name'])

print type(a)
print a

Output looks like this:

hello, they are going:
[u'AB', u'AM', u'AK', u'....]

Btw, “type” is useful, I can not clearly remember the obj type sometimes…

get data from the html file:

html = open(&amp;quot;ML.html&amp;quot;,'r').read()

soup = BeautifulSoup(html,&amp;quot;html.parser&amp;quot;)
#for older versions, it should be: soup = BeautifulSoup(html)

# tags = soup('a')

tags = soup.find_all(&amp;quot;h5&amp;quot;, class_=&amp;quot;padding-none member-name&amp;quot;)

# print tag
went = []
for tag in tags:
    atag = tag.find(&amp;quot;a&amp;quot;).contents
#   the type of atag here is list! so we only need the first item!
    went.append(atag[0])
print type(went)
print went

And the output looks like..

[u'Irene Li', u'SY', u'AD',...]

So make sure the two outputs have the same type (list, or set if you want).

Then let’s find out the intersection:

list(set(a) &amp;amp; set(went))

Output:

[u'JG',
 u'GJ',
 u'an',
 u'AM']

Author:

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s