Problem
Two guest list from Meetup: DL and ML. Who is going to both meetups?
Data
Guestlist of DL is an xlsx:
Guestlist of ML will be extracted from web:
(Meetup groups online, search one randomly you will see, for example)
So we need the name list from the right side, “going”.
By using “Inspect”, which helps you a lot when targeting a certain tag. Just find the features (say, what tag you need, and the class name of it.).
Here, for this case, “Irene Li” is in the <a> tag, there is no class name, so we look outside of it: there is the<h5>tag, with the class name being “padding-none member-name”, a unique name. So the idea is, first we find out all the <h5> tags, whose class name is the given one. Then we get the contents of the <a> tag which is inside of <h5>tag.
Normally you can get url contents by urllib, but I need to login (didn’t do research here), so I saved the html file as an input file.
Tool
Python, well, Jupter Notebook, powerful one.
Code
Libs you might need:
import pyexcel as pe import pyexcel.ext.xls # import it to handle xls file import pyexcel.ext.xlsx # import it to handle xlsx file import urllib # you might not need it from bs4 import BeautifulSoup
get data from DL, the excel sheet:
print 'hello, they are going:' records = pe.get_records(file_name=&quot;DL.xlsx&quot;) a=[] # for each row, we need information of only two colums for row in records: # if the guest is going, then we keep the name yes=row['RSVPed Yes'] if yes == long(1): a.append(row['Name']) print type(a) print a
Output looks like this:
hello, they are going:
[u'AB', u'AM', u'AK', u'....]
Btw, “type” is useful, I can not clearly remember the obj type sometimes…
get data from the html file:
html = open(&quot;ML.html&quot;,'r').read() soup = BeautifulSoup(html,&quot;html.parser&quot;) #for older versions, it should be: soup = BeautifulSoup(html) # tags = soup('a') tags = soup.find_all(&quot;h5&quot;, class_=&quot;padding-none member-name&quot;) # print tag went = [] for tag in tags: atag = tag.find(&quot;a&quot;).contents # the type of atag here is list! so we only need the first item! went.append(atag[0]) print type(went) print went
And the output looks like..
[u'Irene Li', u'SY', u'AD',...]
So make sure the two outputs have the same type (list, or set if you want).
Then let’s find out the intersection:
list(set(a) &amp; set(went))
Output:
[u'JG', u'GJ', u'an', u'AM']