Python爬取数据到数据分析，网易云课堂套餐课爬虫

目标

先爬取数据，再进行数据的分析统计。

目标地址：https://study.163.com/series/1202851606.htm

爬取：

爬取所有的课程标题、链接、价格信息；
爬取课时列表数据；

统计：

统计每个课程的课时数目；
统计每个课程的课时的平均时长、总时长；

1. 爬取套餐页面

先导入数据，用selenium做初始化

import re
import time

import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

driver = Chrome()

url = "https://study.163.com/series/1202851606.htm"

driver.get(url)
driver.maximize_window()
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')

其中driver.maximize_window()用于最大化窗口；driver.execute_script的javascript，将页面下滑到底部。

2. 解析每个子课程的链接

def get_course_id(course_url):
    """从Url中提取课程ID"""
    pattern = r".*/(\d+)\.htm"
    course_ids = re.search(pattern, course_url)
    return course_ids.groups()[0]


course_urls = driver.find_elements(By.XPATH, "//a[@data-name='课程名称']")
courses = []
for course_url in course_urls:
    course_href = course_url.get_attribute("href")
    course_title = course_url.text
    print(course_url.text, course_href)
    cource_id = get_course_id(course_href)
    detail_url = f"https://study.163.com/course/introduction.htm?courseId={cource_id}#/courseDetail?tab=1"
    courses.append([course_href, course_url.text, detail_url])

这里的链接点进去，是课程的主页。我们要爬取的是课时列表，所以通过正则提取课程ID，然后拼接成目录页的URL。

3. 爬取每个详情页面的数据

循环爬取每个页面。

course_infos = []
all_sections = []
for course_href, course_title, detail_url in courses:
    print("爬取：", detail_url, course_title)
    course_info, sections = get_course_info(detail_url)
    course_infos.append(course_info)
    all_sections.extend(sections)

其中的get_course_info函数，获取详情页面的数据。

def get_course_info(detail_url):
    """获取课程信息"""
    print()
    print("detail_url:", detail_url)
    driver.get(detail_url)
    driver.maximize_window()
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')

    wait = WebDriverWait(driver, 10)
    wait.until(lambda x: "人学过" in


    
 x.page_source and "关于我们" in x.page_source)
    time.sleep(2)
    course_title = driver.find_element(By.XPATH, "//h2//span[@class='u-coursetitle_title']").text
    user_count = driver.find_element(By.XPATH, "//span[@class='hot f-fs0']").text
    price = driver.find_element(By.XPATH, "//span[@class='price']").text

    chapters = driver.find_elements(By.XPATH, "//div[@class='chapter']")
    sections = []
    for chapter in chapters:
        print("#" * 10)
        chaptertitle = chapter.find_element(By.XPATH, ".//span[contains(@class, 'chaptertitle')]").text
        chaptername = chapter.find_element(By.XPATH, ".//span[contains(@class, 'chaptername')]").text

        # 课时
        ks = chapter.find_element(By.XPATH, ".//span[contains(@class, 'ks')]").text
        type_title = chapter.find_element(By.XPATH, ".//span[contains(@class, 'type-title')]").text
        ksname = chapter.find_element(By.XPATH, ".//span[contains(@class, 'ksname')]").text
        kstime = chapter.find_element(By.XPATH, ".//span[contains(@class, 'kstime')]").text
        print(chaptertitle, chaptername, ks, type_title, ksname, kstime)
        sections.append([detail_url, course_title, chaptertitle, chaptername, ks, type_title, ksname, kstime])
    course_info = [detail_url, course_title, user_count, price]
    return course_info, sections

要注意的是，爬取的数据有2个粒度；

课程本身的信息；
课程的课时的信息数据；

3. 存储到excel文件

course_columns = ["detail_url", "course_title", "user_count", "price"]
pd.DataFrame(course_infos, columns=course_columns).to_excel("课程信息.xlsx", index=False)

section_columns = ["detail_url", "course_title", "chaptertitle", "chaptername", "ks", "type_title", "ksname", "kstime"]
pd.DataFrame(all_sections, columns=section_columns).to_excel("课程-章节信息.xlsx", index=False)

保存的2个Excel表格：

4. 数据分析

import pandas as pd

df_course = pd.read_excel("课程信息.xlsx", engine="openpyxl")
df_sections = pd.read_excel("课程-章节信息.xlsx", engine="openpyxl")

# 只需要视频的列表
df_sections = df_sections[df_sections["type_title"] == "视频"]


# 时间转换成秒数
def get_time_seconds(time_string):
    hour, seconds = time_string.split(":")
    return int(hour) * 60 + int(seconds)


df_sections["seconds"] = df_sections["kstime"].map(get_time_seconds)
df_agg = df_sections.groupby("course_title").apply(lambda x: pd.Series({
    "课时数目": len(x),
    "总时长分钟": round(sum(x["seconds"]) / 60, 2),
    "平均时长分钟": round(sum(x["seconds"]) / len(x) / 60, 2)
})).reset_index()

df_merge = pd.merge(df_course, df_agg, left_on="course_title", right_on="course_title")
df_merge.to_excel("统计数据.xlsx", index=False)

统计结果：

其中后面几列，是统计出来的数目、总时长、平均时长。

总结

本案例结合了爬取和数据分析。其中爬取用到了selenium技术模块，需要注意的是要做最大化窗口和下拉到底部，防止隐藏元素导致怕娶不到；将数据爬取到列表中，然后存储到Excel文件。最后的数据统计模块，用Pandas读取Excel文件做统计分析，得到需要的统计指标。

如果想学习爬虫，建议学习蚂蚁老师自己的爬虫课程，提供答疑服务和微信交流群。