Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Webscraping: How do I add "Country" & "Competition" column to the output (Python Question)

HW1
9 - Comet

This is not a designer question. Apologies for posting it here. I asked it in the Off-Topic forum but I did not get any response. I guess its not as visited as it is here.

 

I started with Alteryx Python tool however I find using pure python is easier and much more manageable in this case hence the detour.

 

I have a code that scrapes OddsPortal Using Selenium

 

from selenium import webdriver
import pandas as pd

browser = webdriver.Chrome()

class GameData:

    def __init__(self):
        self.dates = []
        self.games = []
        self.scores = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []


def parse_data(url):
    browser.get(url)
    df = pd.read_html(browser.page_source, header=0)[0]
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.dates.append(game_date)
        game_data.games.append(row[2])
        game_data.scores.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])

    return game_data


urls = {"https://www.oddsportal.com/soccer/australia/a-league/results/",
"https://www.oddsportal.com/soccer/europe/champions-league/results/",
"https://www.oddsportal.com/soccer/europe/europa-league/results/"}

if __name__ == '__main__':

    results = None

    for url in urls:
        game_data = parse_data(url)
        result = pd.DataFrame(game_data.__dict__)
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

 

The output is in the format:

|    |   Unnamed: 0 | dates       | games                    | scores   |   home_odds |   draw_odds |   away_odds |
|----|--------------|-------------|--------------------------|----------|-------------|-------------|-------------|
|  0 |            0 | 24 Feb 2018 | Slovacko - Sparta Prague | 1:1      |        4.27 |        3.14 |        1.93 |
|  1 |            1 | 24 Feb 2018 | Brno - Sigma Olomouc     | 1:0      |        2.93 |        3.14 |        2.45 |
|  2 |            2 | 24 Feb 2018 | Liberec - Mlada Boleslav | 1:0      |        1.91 |        3.46 |        3.89 |
|  3 |            3 | 23 Feb 2018 | Dukla Prague - Jablonec  | 0:1      |        2.65 |        3.25 |        2.6  |
|  4 |            4 | 18 Feb 2018 | Sparta Prague - Liberec  | 2:0      |        1.51 |        3.86 |        6.67 |

How can I add the "Country" and "Competition" to the output column?

The inspect element has the league information but I am unsure how to get it.

 

Inspect ElementInspect Element

0 REPLIES 0
Labels
Top Solution Authors