Python 02: interacts with Internet

About Protocols
Transport Control Protocol..Works on the transport layer.
TCP port numbers…

# Sockets in Python

mysock.connect(('www.py4inf.com',80))
mysock.send('GET http://py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if( len (data) < 1) :
        break
    print data

mysock.close()

The result:

HTTP/1.1 200 OK
Date: Mon, 09 Nov 2015 21:19:27 GMT
Server: Apache
Last-Modified: Fri, 07 Aug 2015 16:39:14 GMT
ETag: “20a1817f-a7-51cbb46b621a7”
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=604800, public
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, x-requested-with, content-type
Access-Control-Allow-Methods
GET
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fai
r sun and kill the envious moon
Who is already sick and pale with griefSocket is a low level ayer.

(It keeps the head info)


Using urllib

import urllib
fhand = urllib.urlopen('http://py4inf.com/code/romeo.txt')

for line in fhand:
    print line.strip()

Just like opening a file.


Parsing HTML with BeautifulSoup lib
Regx is for parsing HTML. Or, the easy way is to use “Beautiful Soup”.

place the BeautifulSoup.py in the same folder with your other python code.
download here:http://www.crummy.com/software/BeautifulSoup/
(I am using version 4.1)

unzip the file, use command to install:
>> Python setup.py install
if you are using pydev in eclipse, you will find it automatically detects the changes.

Following the code:

import urllib
from bs4 import BeautifulSoup

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
#for older versions, it should be: soup = BeautifulSoup(html)

tags = soup('a')

for tag in tags:
    print tag.get('href',None)

The function is to find all hyperlink tags, and get urls of each.

The result:

Enter – http://www.dr-chuck.com/
http://www.dr-chuck.com/csev-blog/
http://www.si.umich.edu/
http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1159280
http://www.dr-chuck.com/csev-blog/
http://www.twitter.com/drchuck/


Reference:
https://www.coursera.org/learn/python-network-data/home/welcome

Out of topic:
This is the 20th post of my blog. I am thinking that I will not be a serious blogger, well, not only talk about techniques.
Registered another module Regression Models, but did not have time to start learning seriously. (Winter makes people lazy…)
Busy with preparing a two-week business trip, like visas (wtf, passport courier fees are killing me) and tickets. Will spend 1 week for my holiday during December and then back to work. Hopefully I will survive the whole winter, with more better blogs.
Thank you all.

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: