Introduction

Few month ago a youtuber presented a small bot on twitter which tweet everyday, the sum of delayed train in France. This gaves me the idea of wishing a happy birthday to celebrities with a specific account automatically. In this notebook, we will extract birthday and twitter account from a website using web scrapping and store data in a small database to deplay the bot afterward.

In [ ]:
import sqlite3
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
pattern = re.compile("[0-9]+")

DB creation

First let's create the SQLite db. It's small, easy to deploy without any installation so perfect for this. We will need 1 table to store :

  • name
  • birthday
  • twitter account
  • url on the website we scrap (optionnal but I'll use it for updates)
  • country (for Now I'll send tweet only to French celebrities
  • if the person is dead or not
In [ ]:
# conn = sqlite3.connect('F:/data/birthday.db')
# c = conn.cursor()
# c.execute('''DROP TABLE IF EXISTS birthday''')
# c.execute('''CREATE TABLE birthday
#              (birthday date_text, name text, twitter text, url text, country text, dead integer)''')
# # c.execute("INSERT INTO birthday VALUES ('1990-05-15','Nicolas M.','@Coni631', '', 'France', '0')")
# conn.commit()
# conn.close()

We can now check if it works.

In [ ]:
# conn = sqlite3.connect('F:/data/birthday.db')
# c = conn.cursor()
# c.execute('SELECT * FROM birthday')
# print(c.fetchone())
# conn.close()

And it's fine

Scapping

For the scrapping, we will use https://anniversaire-celebrite.com/ as a website to scrap the info. First of all, we will go throught every day of the year to gather all persons having a birthday at this date. For each persons, we will open their own page to scrap the account if we have it.To do so let's create few fonctions :

In [ ]:
def check_if_exist(name):
    conn = sqlite3.connect('F:/data/birthday.db')
    c = conn.cursor()
    c.execute("SELECT * FROM birthday WHERE name=?", (name,))
    if c.fetchone() is None:
        res = False
    else:
        res = True
    conn.close()
    return res
    
def get_Twitter(url):
    actor_page = requests.get(url)
    DOM = BeautifulSoup(actor_page.text, "html.parser")
    counter = DOM.find("div", attrs={"class" : "twitter"})
    if counter is not None:
        url = counter.find("a")["href"]
        account = "@" + url.split("/")[-1]
        return account
    return ""

def add_data(birthday, name, twitter, url, country, dead):
    conn = sqlite3.connect('F:/data/birthday.db')
    c = conn.cursor()
    q = "INSERT INTO birthday VALUES ('{}','{}','{}', '{}', '{}', '{}')".format(birthday, name, twitter, url, country, dead)
    c.execute(q)
    conn.commit()
    conn.close()

def page_to_actor(DOM, month, day):
    birthday_block = DOM.find_all("div", attrs={"class":"columns"})[3]
    actors = birthday_block.find_all("div", attrs={"class":"column col-2 col-xl-2 col-lg-3 col-md-4 col-xs-6"})
    for actor in actors:
        data = actor_to_data(actor, month, day)
    
def actor_to_data(actor, month, day):
    result = {}
    alts = actor.find_all("img")
    year_txt = actor.find("div", {"class" : "celannee tiny"}).getText()
    year = int(pattern.search(year_txt).group(0))
    result["name"] = re.sub("'","''", alts[0]['alt'])
    result["country"] = re.sub("'","''", alts[1]['alt'])
    result["birthday"] = "{}-{:02d}-{:02d}".format(year, month, day)
    result["dead"] = len(actor.find("div", {"class" : "celage"}).find_all("strong")) - 1
    result["url"] = "https://anniversaire-celebrite.com/" + actor.find("a")["href"]
    
    alread_in = check_if_exist(result["name"])
    if not alread_in:
        if result["dead"] == 0:
            result["twitter"] = get_Twitter(result["url"])
        else:
            result["twitter"] = ""
        add_data(**result)
    else:
        print("Skip " + result["name"])

Now we just have to run it for all date of the year

In [ ]:
done = True
for month in range(1, 13):
    for day in range(1, 32):
        if month == 1 and day == 10:
            done = False
            
        if not done:
            print("extracting {:02d}/{:02d}".format(day, month))
            url = "https://anniversaire-celebrite.com/ok,annee,{:02d},{:02d}.html".format(month, day)
            page = requests.get(url)
            soup = BeautifulSoup(page.text, "html.parser")
            page_to_actor(soup, month, day)

Add new persons

Sometimes, the website add celebrities to their website, we can create a small function to check it and store new ones.

In [ ]:
def getDataFromProfil(url):
    actor_page = requests.get(url)
    DOM = BeautifulSoup(actor_page.text, "html.parser")
    birthday = DOM.find(itemprop="birthDate")["datetime"]
    dead = 1 if DOM.find(itemprop="deathDate") else 0
    counter = DOM.find("div", attrs={"class" : "twitter"})
    if counter is not None:
        url = counter.find("a")["href"]
        account = "@" + url.split("/")[-1]
        return birthday, account, dead
    return birthday, "", dead
In [ ]:
url = "https://anniversaire-celebrite.com/last.html"
page = requests.get(url)
DOM = BeautifulSoup(page.text, "html.parser")
birthday_block = DOM.find_all("div", attrs={"class":"columns"})[3]
actors = birthday_block.find_all("div", attrs={"class":"column col-2 col-xl-2 col-lg-3 col-md-4 col-xs-6"})
for actor in actors:
    result = {}
    alts = actor.find_all("img")
    year_txt = actor.find("div", {"class" : "celannee tiny"}).getText()
    year = int(pattern.search(year_txt).group(0))
    result["name"] = re.sub("'","''", alts[0]['alt'])
    result["country"] = re.sub("'","''", alts[1]['alt'])
    result["url"] = "https://anniversaire-celebrite.com/" + actor.find("a")["href"]
    
    alread_in = check_if_exist(result["name"])
    if not alread_in:
        birthday, twitter, dead = getDataFromProfil(result["url"])
        result["twitter"] = twitter
        result["birthday"] = birthday
        result["dead"] = dead
        print("Add " + result["name"])
        add_data(**result)
    else:
        print("Skip " + result["name"])

Check Profil for update

Another thing to consider is the case where there is changes. For example, we have several celebrities without twitter account. We could also create a function to check all celebrities without official twitter regularly (that's why I saved the url) and add the account if it's added. Some people may also die and we should not wish them a Happy Birthday :(

In [ ]:
conn = sqlite3.connect('F:/data/birthday.db')
c = conn.cursor()
c.execute("SELECT name, url FROM birthday WHERE twitter = '' AND dead == 0")
l = c.fetchall()
conn.close()
In [ ]:
l
In [ ]:
for name, url in l:
    print("Checking", name)
    birthday, twitter, dead = getDataFromProfil(url)
    if twitter != "" or dead == 1:
        conn = sqlite3.connect('F:/data/birthday.db')
        c.execute("UPDATE birthday SET twitter = ?, dead = ? WHERE name = ?", (twitter, dead, name))
        conn.commit()
        conn.close()

Manual update

When there is missing celebrities or a mistake, we can fix manully some informations. For exemple there were several celebrities without the twitter account but I knew that they have one so we can add it

In [ ]:
def manual_update(name, twitter):
    conn = sqlite3.connect('F:/data/birthday.db')
    c = conn.cursor()
    c.execute("SELECT * FROM birthday WHERE name = ?", (name,))
    print(c.fetchone())
    c.execute("UPDATE birthday SET twitter = ? WHERE name = ?", (twitter, name))
    conn.commit()
    c.execute("SELECT * FROM birthday WHERE name = ?", (name,))
    print(c.fetchone())
    conn.close()
In [ ]:
# manual_update(" Cartouche", "@cartouche")
In [ ]:
# manual_update("Sandrine Quétier", "@SandQuetierOff")
In [ ]:
# manual_update(" Comte de Bouderbala", "@comtebouderbala")
In [ ]:
# manual_update(" Jul", "@jul")
In [ ]:
# manual_update(" Natoo", "@Nato_o")
In [ ]:
# manual_update("Laurent Baffie", "@lolobababa")
In [ ]:
# manual_update("Alain Bernard", "@alainbernard")
In [ ]:
# manual_update("Florence Parly", "@florence_parly")
In [ ]:
# manual_update("Christian Estrosi", "@cestrosi")
In [ ]:
# manual_update("Bernard Laporte", "@BernardLaporte_")
In [ ]:
# manual_update("Patrick Timsit", "@patricktimsit")
In [ ]:
# manual_update("Vincent Moscato", "@VINCENT_MOSCATO")
In [ ]:
# manual_update("Ève Angeli", "@eveangelioff")
In [ ]:
# add_data("1986-08-30", "Raphaël Carlier", "@Raphael_Carlier", "", "France", "0")
In [ ]:
# add_data("1986-07-07", "David Coscas", "@levraimcfly", "", "France", "0")
In [ ]:
# add_data("1992-08-07", "Damien Laguionie", "@terracid", "", "France", "0")
In [ ]:
# add_data("1992-01-08", "Thomas Iturralde", "@laink", "", "France", "0")

Change birthday

Finally, there is maybe wrong date on the website. It was the case with Willy Rovelli. This section is done to update this

In [ ]:
name = "Willy Rovelli"
date = "1980-03-18"
In [ ]:
conn = sqlite3.connect('F:/data/birthday.db')
c = conn.cursor()
c.execute("SELECT * FROM birthday WHERE name = ?", (name,))
print(c.fetchone())
c.execute("UPDATE birthday SET birthday = ? WHERE name = ?", (date, name))
conn.commit()
c.execute("SELECT * FROM birthday WHERE name = ?", (name,))
print(c.fetchone())
conn.close()

Check proposition

One improvement I've done is also host a website (simple one) to let people propose new celebrities. I'll try to review taht sometimes and this part is done for that purpose.

In [ ]:
# conn = sqlite3.connect('site/databases/proposition.db')
# df = pd.read_sql_query("SELECT * FROM birthday", conn)
# conn.close()
# df.head()
In [ ]:
# conn = sqlite3.connect('site/databases/proposition.db')
# c = conn.cursor()
# c.execute('''DROP TABLE IF EXISTS birthday''')
# c.execute('''CREATE TABLE birthday
#              (birthday date_text, name text, twitter text, url text, country text, dead integer)''')
# conn.commit()
# conn.close()

Conclusion

This Notebook is quite simple (and not very clean - a refactoring would be great...) but we saw here how to scrap data from a website and store it in a SQLite database. In the next Notebook, we will quickly explore the data we have.

Regarding the website, it's available @ http://nicolasmine.pythonanywhere.com/.