About Protocols
Transport Control Protocol..Works on the transport layer.
TCP port numbers…
# Sockets in Python mysock.connect(('www.py4inf.com',80)) mysock.send('GET http://py4inf.com/code/romeo.txt HTTP/1.0\n\n') while True: data = mysock.recv(512) if( len (data) < 1) : break print data mysock.close()
The result:
HTTP/1.1 200 OK
Date: Mon, 09 Nov 2015 21:19:27 GMT
Server: Apache
Last-Modified: Fri, 07 Aug 2015 16:39:14 GMT
ETag: “20a1817f-a7-51cbb46b621a7”
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods
GET
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fai
r sun and kill the envious moon
Who is already sick and pale with griefSocket is a low level ayer.
(It keeps the head info)
Using urllib
import urllib fhand = urllib.urlopen('http://py4inf.com/code/romeo.txt') for line in fhand: print line.strip()
Just like opening a file.
Parsing HTML with BeautifulSoup lib
Regx is for parsing HTML. Or, the easy way is to use “Beautiful Soup”.
place the BeautifulSoup.py in the same folder with your other python code.
download here:http://www.crummy.com/software/BeautifulSoup/
(I am using version 4.1)
unzip the file, use command to install:
>> Python setup.py install
if you are using pydev in eclipse, you will find it automatically detects the changes.
Following the code:
import urllib from bs4 import BeautifulSoup url = raw_input('Enter - ') html = urllib.urlopen(url).read() soup = BeautifulSoup(html,"html.parser") #for older versions, it should be: soup = BeautifulSoup(html) tags = soup('a') for tag in tags: print tag.get('href',None)
The function is to find all hyperlink tags, and get urls of each.
The result:
Enter – http://www.dr-chuck.com/
http://www.dr-chuck.com/csev-blog/
http://www.si.umich.edu/
http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1159280
http://www.dr-chuck.com/csev-blog/
http://www.twitter.com/drchuck/
…
Reference:
https://www.coursera.org/learn/python-network-data/home/welcome
Out of topic:
This is the 20th post of my blog. I am thinking that I will not be a serious blogger, well, not only talk about techniques.
Registered another module Regression Models, but did not have time to start learning seriously. (Winter makes people lazy…)
Busy with preparing a two-week business trip, like visas (wtf, passport courier fees are killing me) and tickets. Will spend 1 week for my holiday during December and then back to work. Hopefully I will survive the whole winter, with more better blogs.
Thank you all.