第七阶段：python数据库操作与数据处理

本阶段将深入探讨 Python 在数据库操作和数据处理方面的应用。我们将从 SQL 基础回顾开始，然后实践 SQLite 数据库操作，并初步接触数据处理与分析的核心库：NumPy、Pandas 和 Matplotlib/Seaborn。

7.1 数据库操作

SQL 基础回顾

在深入 Python 数据库编程之前，让我们快速回顾一下 SQL（Structured Query Language） 的基本操作，它们是与关系型数据库交互的核心：

SELECT: 用于从数据库中检索数据。 SQL

SELECT column1, column2 FROM table_name WHERE condition;

INSERT: 用于向表中插入新行。 SQL

INSERT INTO table_name (column1, column2) VALUES (value1, value2);

UPDATE: 用于修改表中现有行的数据。 SQL

UPDATE table_name SET column1 = new_value WHERE condition;

DELETE: 用于从表中删除行。 SQL

DELETE FROM table_name WHERE condition;

SQLite 数据库

SQLite 是一个轻量级的、无服务器的、自给自足的事务性 SQL 数据库引擎。Python 内置的 sqlite3 模块使其成为学习数据库操作的绝佳选择。

下面是一个使用 sqlite3 模块进行数据库操作的完整示例：

Python

import sqlite3

# 1. 连接到数据库
# 如果数据库文件不存在，sqlite3 会自动创建它
conn = sqlite3.connect('example.db')
cursor = conn.cursor() # 获取游标对象，用于执行SQL命令

# 2. 创建表
# 使用 """ 三引号可以方便地书写多行SQL语句
cursor.execute('''
    CREATE TABLE IF NOT EXISTS users (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT NOT NULL,
        email TEXT UNIQUE,
        age INTEGER
    )
''')
conn.commit() # 提交事务，保存更改

print("表 'users' 创建成功或已存在。")

# 3. 插入数据
try:
    cursor.execute("INSERT INTO users (name, email, age) VALUES (?, ?, ?)", ('Alice', 'alice@example.com', 30))
    cursor.execute("INSERT INTO users (name, email, age) VALUES (?, ?, ?)", ('Bob', 'bob@example.com', 25))
    cursor.execute("INSERT INTO users (name, email, age) VALUES (?, ?, ?)", ('Charlie', 'charlie@example.com', 35))
    conn.commit()
    print("数据插入成功。")
except sqlite3.IntegrityError as e:
    print(f"数据插入失败：{e} (可能邮箱重复)")
    conn.rollback() # 出现错误时回滚事务

# 4. 查询数据
print("
--- 查询所有用户 ---")
cursor.execute("SELECT id, name, email, age FROM users")
rows = cursor.fetchall() # 获取所有查询结果
for row in rows:
    print(f"ID: {row[0]}, Name: {row[1]}, Email: {row[2]}, Age: {row[3]}")

print("
--- 查询年龄大于30的用户 ---")
cursor.execute("SELECT name, email FROM users WHERE age > ?", (30,))
for row in cursor.fetchall():
    print(f"Name: {row[0]}, Email: {row[1]}")

# 5. 更新数据
cursor.execute("UPDATE users SET age = ? WHERE name = ?", (31, 'Alice'))
conn.commit()
print("
Alice 的年龄已更新。")

print("
--- 再次查询 Alice 的信息 ---")
cursor.execute("SELECT name, age FROM users WHERE name = 'Alice'")
alice_info = cursor.fetchone() # 获取第一条查询结果
if alice_info:
    print(f"Name: {alice_info[0]}, Updated Age: {alice_info[1]}")

# 6. 删除数据
cursor.execute("DELETE FROM users WHERE name = ?", ('Bob',))
conn.commit()
print("
Bob 的记录已删除。")

print("
--- 删除 Bob 后查询所有用户 ---")
cursor.execute("SELECT name FROM users")
for row in cursor.fetchall():
    print(f"Name: {row[0]}")

# 7. 事务处理示例
# 事务是一系列操作，它们要么全部成功，要么全部失败。
# 确保数据的一致性。
try:
    cursor.execute("INSERT INTO users (name, email, age) VALUES (?, ?, ?)", ('David', 'david@example.com', 40))
    # 假设这里有一个错误，导致事务回滚
    # raise ValueError("模拟一个错误，触发回滚")
    cursor.execute("INSERT INTO users (name, email, age) VALUES (?, ?, ?)", ('Eve', 'eve@example.com', 28))
    conn.commit()
    print("
事务处理：David 和 Eve 插入成功。")
except Exception as e:
    print(f"
事务处理失败：{e}，回滚操作。")
    conn.rollback() # 回滚所有操作

# 再次查询，确认 David 和 Eve 是否被插入
print("
--- 事务处理后查询所有用户 ---")
cursor.execute("SELECT name FROM users")
for row in cursor.fetchall():
    print(f"Name: {row[0]}")


# 8. 关闭数据库连接
conn.close()
print("
数据库连接已关闭。")

ORM 概念（可选，可作为后续学习指引）

ORM（Object-Relational Mapping） 是一种编程技术，用于在面向对象编程语言和关系型数据库之间转换数据。简单来说，它允许开发者使用自己选择的编程语言（如 Python）中的对象来操作数据库，而无需编写原生的 SQL 语句。

作用:

抽象化: 隐藏了数据库的底层细节和 SQL 语法，让开发者可以专注于业务逻辑。
提高开发效率: 减少了重复的 SQL 编写工作。
可维护性: 数据库模型与代码逻辑更紧密结合，易于维护。
跨数据库兼容性: 许多 ORM 框架支持多种数据库，只需修改配置即可切换。

常见 Python ORM 框架:

SQLAlchemy: 功能强大、灵活且广泛使用的 ORM 框架，适用于各种规模的应用。
Django ORM: Django Web 框架自带的 ORM，与 Django 生态系统紧密集成，易于使用。

当你对 SQL 基础和 sqlite3 有了扎实的理解后，探索 ORM 框架将是提升你数据库编程效率和代码质量的下一步。

7.2 数据处理与分析初步

Python 在数据科学领域之所以强大，离不开其丰富的数据处理和分析库。本节将介绍 NumPy、Pandas 和 Matplotlib/Seaborn 的基础知识。

NumPy 基础

NumPy（Numerical Python） 是 Python 科学计算的核心库，提供了高性能的多维数组对象（ndarray）以及用于处理这些数组的工具。

Python

import numpy as np

# 1. 数组的创建
# 从列表创建一维数组
arr1 = np.array([1, 2, 3, 4, 5])
print("arr1:", arr1)
print("arr1 shape:", arr1.shape) # (5,) 表示一维数组，5个元素

# 从嵌套列表创建二维数组
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("
arr2:
", arr2)
print("arr2 shape:", arr2.shape) # (2, 3) 表示2行3列

# 创建全零数组
zeros_arr = np.zeros((2, 4))
print("
zeros_arr:
", zeros_arr)

# 创建全一数组
ones_arr = np.ones((3, 3))
print("
ones_arr:
", ones_arr)

# 创建等差数列
range_arr = np.arange(0, 10, 2) # 从0到10（不包含10），步长为2
print("
range_arr:", range_arr)

# 创建随机数组
random_arr = np.random.rand(2, 3) # 2行3列的随机浮点数（0到1之间）
print("
random_arr:
", random_arr)

# 2. 数组的索引和切片
data = np.array([[10, 20, 30],
                 [40, 50, 60],
                 [70, 80, 90]])

print("
Original data:
", data)

# 索引单个元素
print("Element at (0, 0):", data[0, 0]) # 10
print("Element at (1, 2):", data[1, 2]) # 60

# 行切片
print("First row:", data[0, :]) # [10 20 30]
print("Last row:", data[-1, :]) # [70 80 90]

# 列切片
print("First column:", data[:, 0]) # [10 40 70]
print("Second and third column:", data[:, 1:]) # [[20 30], [50 60], [80 90]]

# 子数组切片
print("Sub-array:
", data[0:2, 1:3]) # [[20 30], [50 60]]

# 布尔索引
print("Elements greater than 50:
", data[data > 50]) # [60 70 80 90]

# 3. 数学运算
arr_a = np.array([[1, 2], [3, 4]])
arr_b = np.array([[5, 6], [7, 8]])

print("
arr_a:
", arr_a)
print("arr_b:
", arr_b)

# 逐元素加法
print("Addition:
", arr_a + arr_b)

# 逐元素乘法
print("Multiplication (element-wise):
", arr_a * arr_b)

# 矩阵乘法
print("Matrix multiplication (dot product):
", np.dot(arr_a, arr_b))

# 标量运算
print("Scalar addition:
", arr_a + 10)
print("Scalar multiplication:
", arr_a * 2)

# 统计函数
print("Sum of all elements in arr_a:", np.sum(arr_a))
print("Mean of arr_a:", np.mean(arr_a))
print("Max of arr_a:", np.max(arr_a))
print("Min of arr_a:", np.min(arr_a))

# 4. 广播机制 (Broadcasting)
# NumPy 广播机制允许在不同形状的数组之间进行算术运算，
# 条件是它们在某些维度上兼容。
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
vector = np.array([10, 20, 30])

print("
Matrix:
", matrix)
print("Vector:", vector)

# 矩阵的每一行都与向量相加
result_broadcast = matrix + vector
print("Matrix + Vector (Broadcasting):
", result_broadcast)

# 另一个广播示例
matrix_b = np.array([[1, 2], [3, 4]])
scalar = 5
print("
Matrix_b:
", matrix_b)
print("Scalar:", scalar)
print("Matrix_b * Scalar:
", matrix_b * scalar)

Pandas 基础

Pandas 是一个基于 NumPy 的数据分析库，提供了高性能、易于使用的数据结构，如 Series（一维带标签数组）和 DataFrame（二维带标签表格数据）。

Python

import pandas as pd
import numpy as np

# 1. Series 的创建
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print("Series s:
", s)
print("Value at index 'c':", s['c'])
print("Values from 'b' to 'd':
", s['b':'d']) # 包含末尾

# 2. DataFrame 的创建
# 从字典创建 DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, np.nan, 35, 28], # np.nan 表示缺失值
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston'],
    'Salary': [70000, 80000, 60000, 90000, np.nan]
}
df = pd.DataFrame(data)
print("
Original DataFrame df:
", df)

# 3. 数据加载 (CSV, Excel)
# 假设我们有一个名为 'sample_data.csv' 的文件
# content of sample_data.csv:
# Name,Age,City,Salary
# Alice,25,New York,70000
# Bob,30,Los Angeles,80000
# Charlie,,Chicago,60000
# David,35,New York,90000
# Eve,28,Houston,
# Frank,40,Seattle,100000

# 创建一个示例 CSV 文件
csv_content = """Name,Age,City,Salary
Alice,25,New York,70000
Bob,30,Los Angeles,80000
Charlie,,Chicago,60000
David,35,New York,90000
Eve,28,Houston,
Frank,40,Seattle,100000
"""
with open('sample_data.csv', 'w') as f:
    f.write(csv_content)

print("
--- 从 CSV 文件加载数据 ---")
df_from_csv = pd.read_csv('sample_data.csv')
print(df_from_csv)

# 对于 Excel 文件，可以使用 pd.read_excel('your_file.xlsx')

# 4. 数据清洗
# 处理缺失值 (NaN)
print("
--- 处理缺失值 ---")
# 检查缺失值
print("Missing values before handling:
", df.isnull().sum())

# 填充缺失值 (例如，用平均值填充 'Age'，用 'Unknown' 填充 'City'，用 0 填充 'Salary')
df_filled = df.copy() # 创建副本以避免修改原始 DataFrame
df_filled['Age'].fillna(df_filled['Age'].mean(), inplace=True)
df_filled['City'].fillna('Unknown', inplace=True)
df_filled['Salary'].fillna(0, inplace=True)
print("
DataFrame after filling missing values:
", df_filled)

# 删除包含缺失值的行
df_dropped_rows = df.dropna()
print("
DataFrame after dropping rows with missing values:
", df_dropped_rows)

# 删除包含缺失值的列
df_dropped_cols = df.dropna(axis=1) # axis=1 表示列
print("
DataFrame after dropping columns with missing values:
", df_dropped_cols)

# 处理重复值
df_duplicates = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'w', 'w']
})
print("
Original DataFrame with duplicates:
", df_duplicates)
print("Are there any duplicates?", df_duplicates.duplicated().any())

df_no_duplicates = df_duplicates.drop_duplicates()
print("DataFrame after dropping duplicates:
", df_no_duplicates)

# 5. 数据筛选
print("
--- 数据筛选 ---")
# 筛选年龄大于等于 30 的人
older_than_30 = df[df['Age'] >= 30]
print("People older than or equal to 30:
", older_than_30)

# 筛选来自 New York 且薪水大于 75000 的人
ny_high_salary = df[(df['City'] == 'New York') & (df['Salary'] > 75000)]
print("People from New York with high salary:
", ny_high_salary)

# 使用 .loc (按标签筛选) 和 .iloc (按位置筛选)
print("
Using .loc and .iloc:")
print("Alice's record by label:
", df.loc[0]) # 获取第一行 (索引为0)
print("Bob's Age and City by label:
", df.loc[1, ['Age', 'City']]) # 获取索引为1的行的 'Age' 和 'City'

print("First two rows by position:
", df.iloc[0:2])
print("First row, first two columns by position:
", df.iloc[0, 0:2])

# 6. 分组聚合
print("
--- 分组聚合 ---")
# 按城市分组，计算平均年龄和平均薪水
avg_by_city = df.groupby('City').agg(
    Avg_Age=('Age', 'mean'),
    Avg_Salary=('Salary', 'mean'),
    Count=('Name', 'count')
)
print("Average Age and Salary by City:
", avg_by_city)

# 按城市和年龄段分组
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100], labels=['<30', '30-40', '>40'])
print("
DataFrame with Age_Group:
", df)

grouped_by_age_city = df.groupby(['City', 'Age_Group']).size().unstack(fill_value=0)
print("
Count of people by City and Age Group:
", grouped_by_age_city)

Matplotlib/Seaborn 简单可视化

Matplotlib 是一个用于创建静态、动态、交互式可视化的库。Seaborn 是基于 Matplotlib 的数据可视化库，它提供了一个高级接口，用于绘制有吸引力且信息丰富的统计图形。

Python

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# 设置 matplotlib 的中文显示
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False # 解决负号显示问题

# 创建一个示例 DataFrame
data = {
    'Category': ['A', 'B', 'C', 'D', 'E'],
    'Value': [10, 25, 15, 30, 20],
    'Quantity': [5, 12, 7, 15, 10],
    'Sales': [100, 250, 150, 300, 200]
}
df_plot = pd.DataFrame(data)

# 1. 折线图 (Line Plot)
print("
--- 绘制折线图 ---")
plt.figure(figsize=(8, 5))
plt.plot(df_plot['Category'], df_plot['Value'], marker='o', linestyle='-', color='skyblue')
plt.title('Categories Value Trend')
plt.xlabel('Category')
plt.ylabel('Value')
plt.grid(True)
plt.show()

# 2. 散点图 (Scatter Plot)
print("
--- 绘制散点图 ---")
plt.figure(figsize=(8, 5))
plt.scatter(df_plot['Value'], df_plot['Sales'], color='salmon', alpha=0.7)
plt.title('Value vs Sales')
plt.xlabel('Value')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

# 3. 柱状图 (Bar Chart)
print("
--- 绘制柱状图 ---")
plt.figure(figsize=(8, 5))
plt.bar(df_plot['Category'], df_plot['Quantity'], color='lightgreen')
plt.title('Category Quantity')
plt.xlabel('Category')
plt.ylabel('Quantity')
plt.show()

# 4. 使用 Seaborn 绘制更美观的图表
print("
--- 使用 Seaborn 绘制图表 ---")

# 散点图 (Seaborn 示例)
plt.figure(figsize=(8, 5))
sns.scatterplot(x='Value', y='Sales', data=df_plot, hue='Category', size='Quantity', sizes=(50, 400), palette='viridis')
plt.title('Value vs Sales (Seaborn)')
plt.xlabel('Value')
plt.ylabel('Sales')
plt.show()

# 柱状图 (Seaborn 示例)
plt.figure(figsize=(8, 5))
sns.barplot(x='Category', y='Sales', data=df_plot, palette='coolwarm')
plt.title('Total Sales by Category (Seaborn)')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()

# 更多图表示例 (假设有数值型数据分布)
np.random.seed(42)
data_dist = pd.DataFrame({
    'Feature1': np.random.normal(loc=0, scale=1, size=100),
    'Feature2': np.random.uniform(low=0, high=10, size=100)
})

# 直方图 (Histogram) - 分布图
print("
--- 绘制直方图 ---")
plt.figure(figsize=(8, 5))
sns.histplot(data_dist['Feature1'], kde=True, color='purple')
plt.title('Distribution of Feature1')
plt.xlabel('Feature1 Value')
plt.ylabel('Frequency')
plt.show()

# 箱线图 (Box Plot) - 数据分布和异常值
print("
--- 绘制箱线图 ---")
plt.figure(figsize=(8, 5))
sns.boxplot(y=data_dist['Feature2'], color='orange')
plt.title('Box Plot of Feature2')
plt.ylabel('Feature2 Value')
plt.show()

# 清理生成的示例 CSV 文件
import os
if os.path.exists('sample_data.csv'):
    os.remove('sample_data.csv')
    print("
'sample_data.csv' 已删除。")

文章版权归作者所有，未经允许请勿转载。如内容涉嫌侵权，请在本页底部进入<联系我们>进行举报投诉!

THE END