【Python数据分析实验报告】全球票房 TOP1000 电影数据分析报告

头部类型主导：“Adventure（冒险）” 是最热门类型，共 507 部电影入选 TOP1000；其次是 “Action（动作）”（458 部），两类合计占 TOP10 类型总量的 40.5%；
类型偏好特征：“Adventure+Action”“Sci-Fi+Action” 等融合类型的作品占比超 60%，单一类型电影进入 TOP1000 的比例不足 15%，说明 “强情节 + 高视觉冲击” 是商业成功的核心要素；
小众类型占比低：“Romance（爱情）”“Crime（犯罪）” 等生活化类型占比均不足 10%，市场接受度相对有限。

3.3 电影评分与票房的关联性

从评分与票房的关联看，两者呈现 “弱相关但高评分有优势” 的特征：

弱正相关：评分与票房的相关系数为0.25，说明高评分并非票房成功的决定性因素，但存在一定正向关联；
高评分票房上限更高：评分≥8 分的电影平均票房达5.54 亿美元，是评分≤5 分电影（3.29 亿美元）的 1.7 倍，高质量作品更易突破票房天花板；
头部票房的质量门槛：TOP20 高票房电影中，评分≥7.5 分的占比达 85%，低评分作品难以进入票房头部阵营。

3.4 全球票房 TOP20 电影的头部效应

从头部电影看，市场存在显著的 “马太效应”：

票房极值突出：《Avatar》以2.85 十亿美元成为全球票房最高电影，是 TOP1000 电影平均票房（0.41 十亿美元）的 6.9 倍；
资源高度集中：TOP20 电影的平均票房为1.67 十亿美元，是整体平均水平的 4.1 倍，20 部电影的总票房占 TOP1000 总票房的 16.4%；
IP 系列化主导：TOP20 电影中，“漫威宇宙”“星球大战” 等 IP 系列作品占比达 70%，成熟 IP 的受众基础是突破票房上限的关键。

四、结论与建议

4.1 核心结论

市场周期：全球高票房电影市场受外部环境影响显著，但具备强复苏能力，2023 年已进入恢复期；
类型偏好：冒险、动作是全球观众最青睐的类型，“类型融合” 作品更易获得商业成功；
评分价值：评分与票房呈弱正相关，高评分可提升票房上限，但并非票房成功的核心因素；
头部效应：少数 IP 系列电影占据大量市场资源，TOP20 电影的平均票房是整体平均水平的 4.1 倍。

4.2 行业建议

制作方：优先布局 “冒险 + 动作” 等融合类型，依托成熟 IP 开发系列作品，同时保障内容质量以突破票房天花板；
发行方：加大后疫情时代的宣发投入，把握观众观影需求复苏的市场机会；
投资者：重点关注 IP 系列作品的商业价值，同时结合类型趋势、档期选择等维度评估项目风险。

五、实验代码

# 1. 导入所需库（数据处理+可视化+路径处理）
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
warnings.filterwarnings('ignore')  # 屏蔽无关警告

# 2. 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False  # 解决负号显示异常

# 3. 读取数据
current_dir = os.path.dirname(os.path.abspath(__file__))  # 代码所在文件夹路径
data_filename = "Top_1000_Highest_Grossing_Movies_Of_All_Time.csv"  # 数据文件名
data_path = os.path.join(current_dir, data_filename)  # 同级目录下的数据路径
df = pd.read_csv(data_path)

# 4. 数据初步探索（检查结构与异常值）
print("="*50)
print("1. 数据基础信息")
print("="*50)
print("数据前5行（核心字段）：")
print(df[['Movie Title', 'Year of Realease', 'Genre', 'Gross', 'Worldwide LT Gross']].head())
print(f"\n数据形状：{df.shape} | 数据路径：{data_path}")

# 检查票房异常值（如******）
gross_abnormal = df[~df['Gross'].str.contains(r'^\$?[\d,.MB]+$', na=False)]['Gross'].unique()
print(f"\n⚠️  本土票房异常值：{gross_abnormal if len(gross_abnormal) > 0 else '无'}")

# 5. 数据清洗（完整步骤，含Genre_Split字段创建）
print("\n" + "="*50)
print("2. 数据清洗过程")
print("="*50)
df_clean = df.copy()

# 5.1 清洗票房字段（处理$、M/B单位、异常值）
def clean_box_office(val):
    """清洗票房数据：支持$1.23M/$1,234B格式，异常值返回NA"""
    if not isinstance(val, str) or val == '******':
        return pd.NA
    # 去除$和逗号
    cleaned_val = val.replace('$', '').replace(',', '')
    # 处理单位（M=百万，B=十亿）
    try:
        if cleaned_val.endswith('M'):
            return float(cleaned_val[:-1]) * 1e6
        elif cleaned_val.endswith('B'):
            return float(cleaned_val[:-1]) * 1e9
        else:
            return float(cleaned_val)
    except:
        return pd.NA

# 应用票房清洗函数
df_clean['Gross_Clean'] = df_clean['Gross'].apply(clean_box_office)  # 清洗本土票房
df_clean['Worldwide_Gross_Clean'] = df_clean['Worldwide LT Gross'].apply(clean_box_office)  # 清洗全球票房

# 5.2 清洗年份字段（提取4位年份）
def clean_year(val):
    """提取年份字符串中的4位数字（如"2019 (USA)"→2019）"""
    if isinstance(val, str):
        year_str = ''.join(filter(str.isdigit, val))[:4]  # 取前4位数字
        return int(year_str) if year_str and len(year_str) == 4 else pd.NA
    return pd.NA

df_clean['Year_Clean'] = df_clean['Year of Realease'].apply(clean_year)

# 5.3 拆分电影类型（创建Genre_Split字段）
df_clean['Genre_Split'] = df_clean['Genre'].str.split(',').apply(
    lambda x: [g.strip() for g in x] if isinstance(x, list) else []
)  # 拆分类型并去除空格

# 5.4 删除异常记录（票房/年份为空的行）
before_clean = len(df_clean)
df_clean = df_clean.dropna(subset=['Gross_Clean', 'Worldwide_Gross_Clean', 'Year_Clean'])
after_clean = len(df_clean)

print(f"📊 清洗前数据行数：{before_clean}")
print(f"📊 清洗后数据行数：{after_clean}（删除{before_clean-after_clean}条异常记录）")
print("\n✅ 清洗后数据示例（前3条）：")
print(df_clean[['Movie Title', 'Year_Clean', 'Genre_Split', 'Worldwide_Gross_Clean']].head())

# 6. 可视化分析（4张图表，保存在同级目录，适配所有seaborn版本）
print("\n" + "="*50)
print("3. 生成可视化图表（PNG格式，同级目录）")
print("="*50)

# 6.1 图表1：1995-2024年电影数量与票房趋势
df_year_filter = df_clean[df_clean['Year_Clean'] >= 1995]  # 聚焦近30年
yearly_agg = df_year_filter.groupby('Year_Clean').agg({
    'Movie Title': 'count',  # 每年电影数量
    'Worldwide_Gross_Clean': 'sum'  # 每年总票房
}).rename(columns={'Movie Title': '电影数量', 'Worldwide_Gross_Clean': '总票房_美元'})
# 转换票房单位为“十亿美元”
yearly_agg['总票房_十亿美元'] = yearly_agg['总票房_美元'] / 1e9

# 绘制双轴图
fig, ax1 = plt.subplots(figsize=(14, 7))
# 左轴：电影数量（柱状图）
ax1.bar(yearly_agg.index, yearly_agg['电影数量'], color='#FF6B6B', alpha=0.7, label='电影数量')
ax1.set_xlabel('年份', fontsize=12)
ax1.set_ylabel('电影数量（部）', color='#FF6B6B', fontsize=12)
ax1.tick_params(axis='y', labelcolor='#FF6B6B')
ax1.set_xticks(range(yearly_agg.index.min(), yearly_agg.index.max()+1, 2))  # 每2年一个刻度
ax1.grid(axis='y', alpha=0.3)

# 右轴：总票房（折线图）
ax2 = ax1.twinx()
ax2.plot(yearly_agg.index, yearly_agg['总票房_十亿美元'], color='#4ECDC4', marker='o', linewidth=2, label='总票房')
ax2.set_ylabel('全球总票房（十亿美元）', color='#4ECDC4', fontsize=12)
ax2.tick_params(axis='y', labelcolor='#4ECDC4')

# 标题与图例
plt.title('1995-2024年全球票房TOP1000电影：数量与票房趋势', fontsize=14, pad=20)
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1+lines2, labels1+labels2, loc='upper left')

# 保存图表（同级目录，PNG格式）
chart1_path = os.path.join(current_dir, "1_年份票房趋势.png")
plt.tight_layout()
plt.savefig(chart1_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"✅ 图表1保存成功：{chart1_path}")

# 6.2 图表2：TOP10热门电影类型分布（修复seaborn调色板参数）
# 展开Genre_Split字段（每行一个类型）
df_genre_explode = df_clean.explode('Genre_Split')
# 统计TOP10类型的电影数量
top10_genres = df_genre_explode['Genre_Split'].value_counts().head(10).reset_index()
top10_genres.columns = ['电影类型', '电影数量']

# 绘制水平条形图（使用兼容版调色板：不指定n，直接取前10个颜色）
plt.figure(figsize=(12, 8))
colors = sns.color_palette('viridis')[:10]  # 取viridis调色板前10个颜色
bars = plt.barh(top10_genres['电影类型'], top10_genres['电影数量'], color=colors)

# 添加数值标签
for bar in bars:
    width = bar.get_width()
    plt.text(width + 1, bar.get_y() + bar.get_height()/2, f'{int(width)}', ha='left', va='center', fontsize=10)

# 图表样式
plt.xlabel('电影数量（部）', fontsize=12)
plt.ylabel('电影类型', fontsize=12)
plt.title('全球票房TOP1000电影：TOP10热门类型分布', fontsize=14, pad=20)
plt.grid(axis='x', alpha=0.3)
plt.gca().invert_yaxis()  # 数量多的类型在顶部

# 保存图表
chart2_path = os.path.join(current_dir, "2_热门类型分布.png")
plt.tight_layout()
plt.savefig(chart2_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"✅ 图表2保存成功：{chart2_path}")

# 6.3 图表3：电影评分与票房相关性
# 转换票房单位为“亿美元”（便于显示）
df_clean['Worldwide_Gross_100M'] = df_clean['Worldwide_Gross_Clean'] / 1e8
# 计算相关系数
corr_coef = df_clean[['Movie Rating', 'Worldwide_Gross_Clean']].corr().iloc[0, 1]

# 绘制散点图+趋势线
plt.figure(figsize=(12, 8))
# 散点图：每个点代表一部电影（用matplotlib原生颜色，避免seaborn兼容性问题）
plt.scatter(
    x=df_clean['Movie Rating'],
    y=df_clean['Worldwide_Gross_100M'],
    alpha=0.6,
    color='#FFA07A',
    s=50
)
# 线性趋势线（用numpy计算，避免seaborn依赖）
import numpy as np
z = np.polyfit(df_clean['Movie Rating'], df_clean['Worldwide_Gross_100M'], 1)
p = np.poly1d(z)
plt.plot(df_clean['Movie Rating'], p(df_clean['Movie Rating']), color='red', linewidth=2)

# 图表样式
plt.xlabel('电影评分（满分10分）', fontsize=12)
plt.ylabel('全球票房（亿美元）', fontsize=12)
plt.title(f'全球票房TOP1000电影：评分与票房相关性（相关系数：{corr_coef:.2f}）', fontsize=14, pad=20)
plt.grid(alpha=0.3)
plt.xlim(0, 10)  # 评分范围固定为0-10

# 保存图表
chart3_path = os.path.join(current_dir, "3_评分票房相关性.png")
plt.tight_layout()
plt.savefig(chart3_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"✅ 图表3保存成功：{chart3_path}")

# 6.4 图表4：全球票房TOP20电影排行榜
# 按全球票房降序取前20
top20_movies = df_clean.sort_values('Worldwide_Gross_Clean', ascending=False).head(20)
# 转换票房单位为“十亿美元”
top20_movies['Worldwide_Gross_1B'] = top20_movies['Worldwide_Gross_Clean'] / 1e9

# 绘制垂直条形图
plt.figure(figsize=(14, 10))
# 生成红色系渐变色（20个颜色，票房越高颜色越深）
colors = plt.cm.Reds(np.linspace(0.4, 0.9, 20))[::-1]  # 反转顺序，头部颜色最深
bars = plt.bar(
    range(len(top20_movies)),
    top20_movies['Worldwide_Gross_1B'],
    color=colors
)

# 添加数值标签（显示十亿美元，保留2位小数）
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height + 0.1, f'{height:.2f}B', ha='center', va='bottom', fontsize=9)

# 处理电影名称（限制20字符，避免重叠）
movie_titles = [title[:20] + '...' if len(title) > 20 else title for title in top20_movies['Movie Title']]
plt.xticks(range(len(top20_movies)), movie_titles, rotation=45, ha='right')

# 图表样式
plt.xlabel('电影名称', fontsize=12)
plt.ylabel('全球票房（十亿美元）', fontsize=12)
plt.title('全球票房TOP1000电影：TOP20排行榜', fontsize=14, pad=20)
plt.grid(axis='y', alpha=0.3)

# 保存图表
chart4_path = os.path.join(current_dir, "4_TOP20电影排行榜.png")
plt.tight_layout()
plt.savefig(chart4_path, dpi=300, bbox_inches='tight')
plt.close()
print(f"✅ 图表4保存成功：{chart4_path}")

# 7. 核心分析结论
print("\n" + "="*50)
print("3. 核心分析结论")
print("="*50)

# 结论1：年份趋势
peak_year = yearly_agg['总票房_十亿美元'].idxmax()
peak_box = yearly_agg['总票房_十亿美元'].max()
recent_5y_count = df_year_filter[df_year_filter['Year_Clean'] >= 2020].shape[0]
print(f"1. 年份趋势：近30年中，{peak_year}年全球总票房最高（{peak_box:.2f}十亿美元）；2020年后（近5年）有{recent_5y_count}部电影进入TOP1000，占近30年总量的{recent_5y_count/len(df_year_filter)*100:.1f}%，反映后疫情时代电影市场逐步复苏。")

# 结论2：类型偏好
top_genre = top10_genres.iloc[0]['电影类型']
top_genre_count = top10_genres.iloc[0]['电影数量']
second_genre = top10_genres.iloc[1]['电影类型']
print(f"2. 类型偏好：{top_genre}是最热门电影类型（{top_genre_count}部），其次是{second_genre}，两类合计占TOP10类型总量的{((top_genre_count+top10_genres.iloc[1]['电影数量'])/top10_genres['电影数量'].sum())*100:.1f}%，说明大众更青睐强情节、高视觉冲击的电影。")

# 结论3：评分影响
corr_level = "强" if abs(corr_coef) > 0.7 else "中等" if abs(corr_coef) > 0.4 else "弱"
high_rating_avg = df_clean[df_clean['Movie Rating'] >= 8]['Worldwide_Gross_100M'].mean()
low_rating_avg = df_clean[df_clean['Movie Rating'] <= 5]['Worldwide_Gross_100M'].mean()
print(f"3. 评分价值：电影评分与票房相关系数为{corr_coef:.2f}，呈{corr_level}正相关；评分≥8分的电影平均票房（{high_rating_avg:.2f}亿美元）是评分≤5分电影（{low_rating_avg:.2f}亿美元）的{high_rating_avg/low_rating_avg:.1f}倍，高质量电影更易获得市场认可。")

# 结论4：头部效应
top_movie_name = top20_movies.iloc[0]['Movie Title']
top_movie_box = top20_movies.iloc[0]['Worldwide_Gross_1B']
top20_avg_box = top20_movies['Worldwide_Gross_1B'].mean()
all_avg_box = df_clean['Worldwide_Gross_Clean'].mean() / 1e9
print(f"4. 头部效应：全球票房最高的电影是《{top_movie_name}》（{top_movie_box:.2f}十亿美元）；TOP20电影平均票房（{top20_avg_box:.2f}十亿美元）是TOP1000电影平均票房（{all_avg_box:.2f}十亿美元）的{top20_avg_box/all_avg_box:.1f}倍，体现明显的“马太效应”（头部电影占据大量票房资源）。")

print("\n" + "="*50)
print(f"🎉 所有分析完成！4张PNG图表已保存至同级目录：{current_dir}")
print("="*50)