目标 先爬取数据,再进行数据的分析统计。
目标地址:https://study.163.com/series/1202851606.htm
爬取:
统计:
1. 爬取套餐页面 先导入数据,用selenium做初始化
import re import time import pandas as pd from selenium.webdriver import Chrome from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait driver = Chrome() url = "https://study.163.com/series/1202851606.htm" driver.get(url) driver.maximize_window() driver.execute_script('window.scrollTo(0, document.body.scrollHeight)' )
其中driver.maximize_window()用于最大化窗口;driver.execute_script的javascript,将页面下滑到底部。
2. 解析每个子课程的链接 def get_course_id(course_url): "" "从Url中提取课程ID" "" pattern = r".*/(\d+)\.htm" course_ids = re.search(pattern, course_url) return course_ids.groups()[0] course_urls = driver.find_elements(By.XPATH, "//a[@data-name='课程名称']" ) courses = []for course_url in course_urls: course_href = course_url.get_attribute("href" ) course_title = course_url.text print (course_url.text, course_href) cource_id = get_course_id(course_href) detail_url = f"https://study.163.com/course/introduction.htm?courseId={cource_id}#/courseDetail?tab=1" courses.append([course_href, course_url.text, detail_url])
这里的链接点进去,是课程的主页。我们要爬取的是课时列表,所以通过正则提取课程ID,然后拼接成目录页的URL。
3. 爬取每个详情页面的数据 循环爬取每个页面。
course_infos = [] all_sections = []for course_href, course_title, detail_url in courses: print ("爬取:" , detail_url, course_title) course_info, sections = get_course_info(detail_url) course_infos.append(course_info) all_sections.extend(sections)
其中的get_course_info函数,获取详情页面的数据。
def get_course_info(detail_url): "" "获取课程信息" "" print () print ("detail_url:" , detail_url) driver.get(detail_url) driver.maximize_window() driver.execute_script('window.scrollTo(0, document.body.scrollHeight)' ) wait = WebDriverWait(driver, 10) wait.until(lambda x: "人学过" in
x.page_source and "关于我们" in x.page_source) time.sleep(2) course_title = driver.find_element(By.XPATH, "//h2//span[@class='u-coursetitle_title']" ).text user_count = driver.find_element(By.XPATH, "//span[@class='hot f-fs0']" ).text price = driver.find_element(By.XPATH, "//span[@class='price']" ).text chapters = driver.find_elements(By.XPATH, "//div[@class='chapter']" ) sections = [] for chapter in chapters: print ("#" * 10) chaptertitle = chapter.find_element(By.XPATH, ".//span[contains(@class, 'chaptertitle')]" ).text chaptername = chapter.find_element(By.XPATH, ".//span[contains(@class, 'chaptername')]" ).text # 课时 ks = chapter.find_element(By.XPATH, ".//span[contains(@class, 'ks')]" ).text type_title = chapter.find_element(By.XPATH, ".//span[contains(@class, 'type-title')]" ).text ksname = chapter.find_element(By.XPATH, ".//span[contains(@class, 'ksname')]" ).text kstime = chapter.find_element(By.XPATH, ".//span[contains(@class, 'kstime')]" ).text print (chaptertitle, chaptername, ks, type_title, ksname, kstime) sections.append([detail_url, course_title, chaptertitle, chaptername, ks, type_title, ksname, kstime]) course_info = [detail_url, course_title, user_count, price] return course_info, sections
要注意的是,爬取的数据有2个粒度;
3. 存储到excel文件 course_columns = ["detail_url" , "course_title" , "user_count" , "price" ] pd.DataFrame(course_infos, columns=course_columns).to_excel("课程信息.xlsx" , index=False) section_columns = ["detail_url" , "course_title" , "chaptertitle" , "chaptername" , "ks" , "type_title" , "ksname" , "kstime" ] pd.DataFrame(all_sections, columns=section_columns).to_excel("课程-章节信息.xlsx" , index=False)
保存的2个Excel表格:
4. 数据分析 import pandas as pd df_course = pd.read_excel("课程信息.xlsx" , engine="openpyxl" ) df_sections = pd.read_excel("课程-章节信息.xlsx" , engine="openpyxl" )# 只需要视频的列表 df_sections = df_sections[df_sections["type_title" ] == "视频" ]# 时间转换成秒数 def get_time_seconds(time_string): hour, seconds = time_string.split(":" ) return int(hour) * 60 + int(seconds) df_sections["seconds" ] = df_sections["kstime" ].map(get_time_seconds) df_agg = df_sections.groupby("course_title" ).apply(lambda x: pd.Series({ "课时数目" : len(x), "总时长分钟" : round(sum(x["seconds" ]) / 60, 2), "平均时长分钟" : round(sum(x["seconds" ]) / len(x) / 60, 2) })).reset_index() df_merge = pd.merge(df_course, df_agg, left_on="course_title" , right_on="course_title" ) df_merge.to_excel("统计数据.xlsx" , index=False)
统计结果:
其中后面几列,是统计出来的数目、总时长、平均时长。
总结 本案例结合了爬取和数据分析。其中爬取用到了selenium技术模块,需要注意的是要做最大化窗口和下拉到底部,防止隐藏元素导致怕娶不到;将数据爬取到列表中,然后存储到Excel文件。最后的数据统计模块,用Pandas读取Excel文件做统计分析,得到需要的统计指标。
如果想学习爬虫,建议学习蚂蚁老师自己的爬虫课程,提供答疑服务和微信交流群。