Comment parser et scripter le web en python

Comment parser et scripter le web en python

Parser: extraire des informations d'une source

Scripter: automatiser un comportement

But de la conférence : introduction au scripting et parsing du web en python avec des pistes pour les usages avancés.

Outils présentés:

  • beautifulsoup (4)
  • urllib(2)
  • requests
  • robobrowser (amélioration de mechanize)
  • firebug/celui par défaut/de chrome
  • ipython-beautifulsoup

Le web: HTTP + HTML

HTML

HTML

Structure générale :

<tag>
</tag>
<tag>
    <sous_tag>
    </sous_tag>
</tag>
<tag_auto_fermant />
<aussi_valide_desormait>
<tag_avec_attributs attribute="valeur" autre_attribut="pouet">
</tag_avec_attributs>

HTML

Exemple d'une page web classique

<!DOCTYPE html>
<html>
 <head>
  <title>
   Exemple de HTML
  </title>
 </head>
 <body>
  Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.
  <p>
   Ceci est un paragraphe où il n’y a pas d’hyperlien.
  </p>
 </body>
</html>

HTTP

HTTP

http.png

HTTP : Exemple d'échange

Client :

GET / HTTP/1.1
User-Agent: curl/7.35.0
Host: 0.0.0.0:5000
Accept: */*

Serveur :

HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.6
Date: Sun, 06 Mar 2016 18:35:04 GMT
Content-type: text/html; charset=UTF-8
Content-Length: 450

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><html>
<title>Directory listing for /</title>
<body>
<h2>Directory listing for /</h2>
<hr>
<ul>
<li><a href=".ipynb_checkpoints/">.ipynb_checkpoints/</a>
<li><a href="http.png">http.png</a>
<li><a href="requirements.txt">requirements.txt</a>
<li><a href="slides.ipynb">slides.ipynb</a>
<li><a href="slides.slides.html">slides.slides.html</a>
<li><a href="ve/">ve/</a>
</ul>
<hr>
</body>
</html>

HTTP: démo avec nc (netcat)

$ ~ nc -l 5000
GET / HTTP/1.1
Host: localhost:5000
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: csrftoken=PDY9ZJdnPufZJebI3cGrWEi3JeczTlJw

HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.6
Date: Sun, 06 Mar 2016 18:35:04 GMT
Content-type: text/html; charset=UTF-8
Content-Length: 450

<p>Salut toi!</p>

Parser de l'html: BeautifulSoup

Parser de l'html : BeautifulSoup

Transforme de l'html (une structure de donnée) en objects.

Installation:

pip install bs4  # "beautifulsoup" est le package de l'ancienne version

Parser de l'html : BeautifulSoup

Utilisation:

In [19]:
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>du super <b>html !</b></p>", "html.parser")
In [20]:
print soup
<p>du super <b>html !</b></p>
In [21]:
print soup.prettify()
<p>
 du super
 <b>
  html !
 </b>
</p>

Parser de l'html : query

In [32]:
html = """
 <head>
  <title>
   Exemple de HTML
  </title>
 </head>
 <body>
  <p>Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.</p>
  <p>Ceci est un paragraphe où il n’y a pas d’hyperlien.</p>
  <p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>
     et <a href="http://perdu.com">un autre lien</a>
  </p>
 </body>
"""
In [33]:
soup = BeautifulSoup(html)
In [34]:
soup.find("p")  # premier <p>
Out[34]:
<p>Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.</p>
In [35]:
soup.find_all("p")
Out[35]:
[<p>Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.</p>,
 <p>Ceci est un paragraphe o\xf9 il n\u2019y a pas d\u2019hyperlien.</p>,
 <p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>\n     et <a href="http://perdu.com">un autre lien</a>\n</p>]
In [36]:
print soup
<head>
<title>
   Exemple de HTML
  </title>
</head>
<body>
<p>Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.</p>
<p>Ceci est un paragraphe où il n’y a pas d’hyperlien.</p>
<p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>
     et <a href="http://perdu.com">un autre lien</a>
</p>
</body>

In [37]:
soup.find("a")
Out[37]:
<a href="cible.html">hyperlien</a>
In [38]:
soup.find("a", href="http://perdu.com")
Out[38]:
<a href="http://perdu.com">un autre lien</a>
In [39]:
soup.find("p", "style")  # "class" est un mot clef réservé en python
Out[39]:
<p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>\n     et <a href="http://perdu.com">un autre lien</a>\n</p>

Parser de l'HTML : extraire les informations

In [41]:
soup.find("a")["href"]
Out[41]:
u'cible.html'
In [45]:
soup.find("p").text
Out[45]:
u'Ceci est une phrase avec un hyperlien.'
In [49]:
soup.find("p").contents
Out[49]:
[u'Ceci est une phrase avec un ', <a href="cible.html">hyperlien</a>, u'.']
In [51]:
result = []
for p in soup.find_all("p"):
    result.append(p)
    
print result
[<p>Ceci est une phrase avec un <a href="cible.html">hyperlien</a>.</p>, <p>Ceci est un paragraphe o\xf9 il n\u2019y a pas d\u2019hyperlien.</p>, <p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>\n     et <a href="http://perdu.com">un autre lien</a>\n</p>]

Parser de l'HTML : autres trucs et astuces

In [52]:
# shortcuts
soup.p == soup.find("p")
soup("p") == soup.find_all("p")

# in other cases
soup.next_sibling
soup.next_siblings

# css selector
soup.select
soup.select(".style")
Out[52]:
[<p class="style">Ceci est un autre paragraphe <b>qui contient du text en gras</b>\n     et <a href="http://perdu.com">un autre lien</a>\n</p>]

HTTP : parler à un site web

HTTP : parler à un site web

Rappel:

http.png

HTTP : parler à un site web

Dans la librairie python standard: urllib2.

Usage:

In [ ]:
from urllib2 import urlopen

html = urlopen("http://neutrinet.be/index.php").read()  # renvoie un string
soup = BeautifulSoup(html)

Demo

Parser les événements de http://neutrinet.be/index.php?title=Past_Events (avec présentation des outils de débug des navigateurs)

In [57]:
from urllib2 import urlopen
from bs4 import BeautifulSoup


"""
<table>
  <tr>
    <td>colonne 1</td>
    <td>colonne 2</td>
  </tr>
</table>
"""

soup = BeautifulSoup(urlopen("http://neutrinet.be/index.php?title=Past_Events").read())

line = soup.find("table").find_all("tr")[1]
line
Out[57]:
<tr>\n<td><a href="/index.php?title=Event:Meeting_2016/03" title="Event:Meeting 2016/03">Meeting 2016/03</a> </td>\n<td> 25 February 2016 19:30:00 </td>\n<td> </td>\n<td> bram </td>\n<td> 123 rue royale 1000 Bruxelles Belgique\n</td></tr>
In [59]:
line.find("a").text  # line.a.text
Out[59]:
u'Meeting 2016/03'
In [60]:
line.find("a")["href"]
Out[60]:
u'/index.php?title=Event:Meeting_2016/03'
In [62]:
line.find_all("td")[1].text.strip()
Out[62]:
u'25 February 2016 19:30:00'

HTTP : GET vs POST

get_post.jpg

HTTP : GET vs POST

GET


$ ~ curl "localhost:3456?query=plop&chocolat=chocapic"
# dans un autre terminal
$ ~ nc -l 3456
GET /?query=plop&chocolat=chocapic HTTP/1.1
User-Agent: curl/7.35.0
Host: localhost:3456
Accept: */*

POST


$ ~ curl -X POST localhost:3456 --data "query=plop&chocolat=chocapic"
# dans un autre terminal
$ ~ nc -l 3456
POST / HTTP/1.1
User-Agent: curl/7.35.0
Host: localhost:3456
Accept: */*
Content-Length: 28
Content-Type: application/x-www-form-urlencoded

query=plop&chocolat=chocapic

HTTP : requests

urllib2 est juste trop pourri et compliqué, requests a une super api en comparaison.

http://docs.python-requests.org

In [ ]:
import requests
In [76]:
# +- même résultat que urlopen de urllib2
result = requests.get("http://localhost:3456/?query=plop&chocolat=chocapic")
In [74]:
result = requests.post("http://localhost:3456", data={"query": "plop", "chocolat": "chocapic"})
In [75]:
result.content;
result.json;

Demo de request.POST

Sur https://searx.laquadrature.net/ (montrer le <form method="POST">)

In [97]:
import requests

result = requests.post("https://searx.laquadrature.net", data={"q": "chocolat"})

result.content[:500]
Out[97]:
'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\n<head>\n    <meta charset="UTF-8" />\n    <meta name="description" content="Searx - a privacy-respecting, hackable metasearch engine" />\n    <meta name="keywords" content="searx, search, search engine, metasearch, meta search" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="generator" content="searx/0.8.1">\n    <meta name="referrer" content="no-referrer">\n    <meta name="viewport" conte'

HTTP : AJAX

HTTP : AJAX

ajax.png

Exemple en javascript/jquery:

$.get("/stuff.json").success(function(data) { /* do some stuff */ })

HTTP : AJAX

ajax2.png

Robobrowser : stimulate a browser behavior

Robobrowser : stimulate a browser behavior

Comme mechanize mais avec requests et BeautifulSoup. Conserve les sessions et parser les liens pour vous.

https://github.com/jmcarp/robobrowser

from robobrowser import RoboBrowser

browser = RoboBrowser(history=True)
browser.open('http://website.com/')

# comme .find() pour les arguments
form = browser.get_form(action='/search')

form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# comme beautifulsoup
browser.find
browser.find_all

# suivre des liens
browser.follow_link(browser.find("a"))

# comme un navigateur
browser.back()

En plus léger, voir les Sessions de request: http://docs.python-requests.org/en/master/user/advanced/#session-objects

import requests

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

Ipython-beautifulsoup

Ipython-beautifulsoup

Pour les glandeurs et surtout car c'est plus facile. (spoiler: c'est moi qui l'ai codé)

https://github.com/Psycojoker/ipython-beautifulsoup

In [80]:
%load_ext soup
Monkey patch BeautifulSoup with custom rendering
See `configure_ipython_beautifulsoup?` for configuration information
Push 'BeautifulSoup' from 'bs4' into current context
Push 'urlopen' from 'urllib2' into current context
Push 'p' shortcut into current context
Push 'requests' into current context
In [82]:
configure_ipython_beautifulsoup(show_html=True, show_css=True, show_js=True)
In [ ]:
soup = p("http://neutrinet.be/index.php?title=Past_Events")
In [84]:
soup.find("table")
Out[84]:
Name Date Tags Organizer Location
Meeting 2016/03 25 February 2016 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Members meeting 4 21 February 2016 14:00:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2016/02 11 February 2016 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Members meeting 3 17 January 2016 14:00:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2016/01 14 January 2016 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Members meeting 2 20 December 2015 14:00:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2015/15 3 December 2015 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2015/14 3 November 2015 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2015/13 22 October 2015 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Members meetings 18 October 2015 13:02:00 bram 123 rue royale 1000 Bruxelles Belgique
Neutrinet au café numérique S07 7 October 2015 19:00:00 bram 123 rue royale 1000 Bruxelles Belgique
Neutrinet at "Le Temp des Communs" 5 October 2015 20:00:00 bram World Trade Center I Boulevard du Roi Albert II 30 bte 5 B-1000 Bruxelles
Meeting 2015/12 (General Assembly) 24 September 2015 19:30:00 bram World Trade Center, tower I, floor 25, 28-30 boulevard du roi Albert II, Bruxelles
Stand FFDN (donc dont Neutrinet) à la Braderie de Lille 5 September 2015 13:00:00 bram Lille
Meeting 2015/11 20 August 2015 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2015/10 23 July 2015 19:30:00 bram 123 rue royale 1000 Bruxelles Belgique
Meeting 2015/09 8 July 2015 19:00:00 bram 123 rue royale 1000 Bruxelles Belgique
Install Party 2 July 2015 19:00:00 Bram 123 rue royale
Meeting 2015/08 18 June 2015 20:00:00 Rue du Canal 45, BXL
Meeting 2015/07 7 June 2015 14:00:00 Yvan Poissonnerie (Bruxelles)
Meeting 2015/06 21 May 2015 19:30:00 Bram F/LAT 45 rue du canal Bruxelles
Meeting 2015/05 16 April 2015 19:30:00 Bram F/LAT
Meeting 2015/04 2 April 2015 19:30:00 F/LAT
Meeting 2015/03 12 March 2015 00:00:00 F/LAT
Meeting 2015/02 5 February 2015 19:30:00 45 rue du canal
FOSDEM DIY-ISP meetup 1 February 2015 16:00:00 Urlab, Avenue Buyl 131, 1050 Ixelles
Meeting 2015/01 -- Assemblée Générale 15 January 2015 19:30:00 123 Rue Royale, BXL
Meeting 2014/18 17 December 2014 19:30:00 Bram 123 Rue Royale, BXL
Meeting 2014/17 10 December 2014 19:30:00 123 Rue Royale, BXL
Meeting 2014/16 20 November 2014 19:30:00 Neutrinet 123 rue Royale - 1000 Bruxelles - 4eme étage local FEBUL
Meeting 2014/15 16 October 2014 19:30:00 Neutrinet 45 Rue du Canal, BXL
Meeting 2014/14 25 September 2014 19:30:00 Neutrinet 45 rue du Canal, 1000 BXL
Meeting 2014/13 29 August 2014 19:30:00 Neutrinet 123 Rue Royale, BXL
Meeting 2014/12 31 July 2014 19:30:00 Neutrinet 123 Rue Royale BXL
Meeting 2014/11 7 July 2014 19:30:00 123 Rue Royale, BXL
Meeting 2014/5 13 March 2014 19:30:00 123 Rue Royale, BXL
Meeting 2013/11 21 November 2013 18:00:00 AG, meeting 123 rue royale (Brussels)
Meeting 2013/10 7 November 2013 19:30:00 meeting 123 rue royale (Brussels)
Meeting 2013/8 10 October 2013 19:30:00 meeting UrLab
Meeting 2013/7 26 September 2013 19:30:00 meeting UrLab
Meeting 2013/4 15 August 2013 19:30:00 meeting 123 rue royale (Brussels)
Meeting 2013/3 8 August 2013 19:00:00 meeting HSBXL
Meeting 2013/2 25 July 2013 19:30:00 meeting Chez Toone
Meeting 2013/1 18 July 2013 19:00:00 meeting UrLab

<table class="wikitable" style="width:100%;">
 <tr>
  <th style="background:black; color:white;">
   Name
  </th>
  <th style="background:black; color:white;">
   Date
  </th>
  <th style="background:black; color:white;">
   Tags
  </th>
  <th style="background:black; color:white;">
   Organizer
  </th>
  <th style="background:black; color:white;">
   Location
  </th>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2016/03" title="Event:Meeting 2016/03">
    Meeting 2016/03
   </a>
  </td>
  <td>
   25 February 2016 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Members_meeting_4" title="Event:Members meeting 4">
    Members meeting 4
   </a>
  </td>
  <td>
   21 February 2016 14:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2016/02" title="Event:Meeting 2016/02">
    Meeting 2016/02
   </a>
  </td>
  <td>
   11 February 2016 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Members_meeting_3" title="Event:Members meeting 3">
    Members meeting 3
   </a>
  </td>
  <td>
   17 January 2016 14:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2016/01" title="Event:Meeting 2016/01">
    Meeting 2016/01
   </a>
  </td>
  <td>
   14 January 2016 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Members_meeting_2" title="Event:Members meeting 2">
    Members meeting 2
   </a>
  </td>
  <td>
   20 December 2015 14:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/15" title="Event:Meeting 2015/15">
    Meeting 2015/15
   </a>
  </td>
  <td>
   3 December 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/14" title="Event:Meeting 2015/14">
    Meeting 2015/14
   </a>
  </td>
  <td>
   3 November 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/13" title="Event:Meeting 2015/13">
    Meeting 2015/13
   </a>
  </td>
  <td>
   22 October 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Members_meetings" title="Event:Members meetings">
    Members meetings
   </a>
  </td>
  <td>
   18 October 2015 13:02:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Neutrinet_au_caf%C3%A9_num%C3%A9rique_S07" title="Event:Neutrinet au café numérique S07">
    Neutrinet au café numérique S07
   </a>
  </td>
  <td>
   7 October 2015 19:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Neutrinet_at_%22Le_Temp_des_Communs%22" title='Event:Neutrinet at "Le Temp des Communs"'>
    Neutrinet at "Le Temp des Communs"
   </a>
  </td>
  <td>
   5 October 2015 20:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   World Trade Center I Boulevard du Roi Albert II 30 bte 5 B-1000 Bruxelles
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/12_(General_Assembly)" title="Event:Meeting 2015/12 (General Assembly)">
    Meeting 2015/12 (General Assembly)
   </a>
  </td>
  <td>
   24 September 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   World Trade Center, tower I, floor 25, 28-30 boulevard du roi Albert II, Bruxelles
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Stand_FFDN_(donc_dont_Neutrinet)_%C3%A0_la_Braderie_de_Lille" title="Event:Stand FFDN (donc dont Neutrinet) à la Braderie de Lille">
    Stand FFDN (donc dont Neutrinet) à la Braderie de Lille
   </a>
  </td>
  <td>
   5 September 2015 13:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   Lille
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/11" title="Event:Meeting 2015/11">
    Meeting 2015/11
   </a>
  </td>
  <td>
   20 August 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/10" title="Event:Meeting 2015/10">
    Meeting 2015/10
   </a>
  </td>
  <td>
   23 July 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/09" title="Event:Meeting 2015/09">
    Meeting 2015/09
   </a>
  </td>
  <td>
   8 July 2015 19:00:00
  </td>
  <td>
  </td>
  <td>
   bram
  </td>
  <td>
   123 rue royale 1000 Bruxelles Belgique
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Install_Party" title="Event:Install Party">
    Install Party
   </a>
  </td>
  <td>
   2 July 2015 19:00:00
  </td>
  <td>
  </td>
  <td>
   Bram
  </td>
  <td>
   123 rue royale
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/08" title="Event:Meeting 2015/08">
    Meeting 2015/08
   </a>
  </td>
  <td>
   18 June 2015 20:00:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   Rue du Canal 45, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/07" title="Event:Meeting 2015/07">
    Meeting 2015/07
   </a>
  </td>
  <td>
   7 June 2015 14:00:00
  </td>
  <td>
  </td>
  <td>
   Yvan
  </td>
  <td>
   Poissonnerie (Bruxelles)
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/06" title="Event:Meeting 2015/06">
    Meeting 2015/06
   </a>
  </td>
  <td>
   21 May 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   Bram
  </td>
  <td>
   F/LAT 45 rue du canal Bruxelles
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/05" title="Event:Meeting 2015/05">
    Meeting 2015/05
   </a>
  </td>
  <td>
   16 April 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
   Bram
  </td>
  <td>
   F/LAT
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/04" title="Event:Meeting 2015/04">
    Meeting 2015/04
   </a>
  </td>
  <td>
   2 April 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   F/LAT
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/03" title="Event:Meeting 2015/03">
    Meeting 2015/03
   </a>
  </td>
  <td>
   12 March 2015 00:00:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   F/LAT
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/02" title="Event:Meeting 2015/02">
    Meeting 2015/02
   </a>
  </td>
  <td>
   5 February 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   45 rue du canal
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:FOSDEM_DIY-ISP_meetup" title="Event:FOSDEM DIY-ISP meetup">
    FOSDEM DIY-ISP meetup
   </a>
  </td>
  <td>
   1 February 2015 16:00:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   Urlab, Avenue Buyl 131, 1050 Ixelles
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2015/01_--_Assembl%C3%A9e_G%C3%A9n%C3%A9rale" title="Event:Meeting 2015/01 -- Assemblée Générale">
    Meeting 2015/01 -- Assemblée Générale
   </a>
  </td>
  <td>
   15 January 2015 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/18" title="Event:Meeting 2014/18">
    Meeting 2014/18
   </a>
  </td>
  <td>
   17 December 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Bram
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/17" title="Event:Meeting 2014/17">
    Meeting 2014/17
   </a>
  </td>
  <td>
   10 December 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/16" title="Event:Meeting 2014/16">
    Meeting 2014/16
   </a>
  </td>
  <td>
   20 November 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Neutrinet
  </td>
  <td>
   123 rue Royale - 1000 Bruxelles - 4eme étage local FEBUL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/15" title="Event:Meeting 2014/15">
    Meeting 2014/15
   </a>
  </td>
  <td>
   16 October 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Neutrinet
  </td>
  <td>
   45 Rue du Canal, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/14" title="Event:Meeting 2014/14">
    Meeting 2014/14
   </a>
  </td>
  <td>
   25 September 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Neutrinet
  </td>
  <td>
   45 rue du Canal, 1000 BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/13" title="Event:Meeting 2014/13">
    Meeting 2014/13
   </a>
  </td>
  <td>
   29 August 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Neutrinet
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/12" title="Event:Meeting 2014/12">
    Meeting 2014/12
   </a>
  </td>
  <td>
   31 July 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
   Neutrinet
  </td>
  <td>
   123 Rue Royale BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/11" title="Event:Meeting 2014/11">
    Meeting 2014/11
   </a>
  </td>
  <td>
   7 July 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2014/5" title="Event:Meeting 2014/5">
    Meeting 2014/5
   </a>
  </td>
  <td>
   13 March 2014 19:30:00
  </td>
  <td>
  </td>
  <td>
  </td>
  <td>
   123 Rue Royale, BXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/11" title="Event:Meeting 2013/11">
    Meeting 2013/11
   </a>
  </td>
  <td>
   21 November 2013 18:00:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/AG" title="Special:SearchByProperty/Tags/AG">
    AG
   </a>
   ,
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   123 rue royale (Brussels)
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/10" title="Event:Meeting 2013/10">
    Meeting 2013/10
   </a>
  </td>
  <td>
   7 November 2013 19:30:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   123 rue royale (Brussels)
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/8" title="Event:Meeting 2013/8">
    Meeting 2013/8
   </a>
  </td>
  <td>
   10 October 2013 19:30:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   UrLab
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/7" title="Event:Meeting 2013/7">
    Meeting 2013/7
   </a>
  </td>
  <td>
   26 September 2013 19:30:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   UrLab
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/4" title="Event:Meeting 2013/4">
    Meeting 2013/4
   </a>
  </td>
  <td>
   15 August 2013 19:30:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   123 rue royale (Brussels)
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/3" title="Event:Meeting 2013/3">
    Meeting 2013/3
   </a>
  </td>
  <td>
   8 August 2013 19:00:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   HSBXL
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/2" title="Event:Meeting 2013/2">
    Meeting 2013/2
   </a>
  </td>
  <td>
   25 July 2013 19:30:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   Chez Toone
  </td>
 </tr>
 <tr>
  <td>
   <a href="/index.php?title=Event:Meeting_2013/1" title="Event:Meeting 2013/1">
    Meeting 2013/1
   </a>
  </td>
  <td>
   18 July 2013 19:00:00
  </td>
  <td>
   <a href="/index.php?title=Special:SearchByProperty/Tags/meeting" title="Special:SearchByProperty/Tags/meeting">
    meeting
   </a>
  </td>
  <td>
  </td>
  <td>
   UrLab
  </td>
 </tr>
</table>

ipbs.png

Avant de finir, quelques pistes pour aller plus loin

  • Selenium : http://www.seleniumhq.org/
  • asyncio pour les websockets
  • asyncio pour dl comme un gros porc en parallèle
  • multiprocessing.Pool.map pour le parallélisme du pauvre
  • sortez pypy si vous avez besoin de perf
  • regardez avant toute chose si le site propose une API, exemple http://hackeragenda.be/#faq

FAIM

Des questions ?