{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dd122950",
   "metadata": {},
   "source": [
    "## Homework1: Web Scraping and Deploy LLM Locally\n",
    "\n",
    "Artificial Intelligence Applications in the Digital Economy, 2026 Spring\n",
    "\n",
    "Bao Yang, SEM, ShanghaiTech University\n",
    "\n",
    "\n",
    "#### **注意：P1必做，P2-P5选做**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a8f19e",
   "metadata": {},
   "source": [
    "## P1. 安装Ollama部署本地大模型\n",
    "\n",
    "1. 下载并安装Ollama：https://docs.ollama.com/quickstart\n",
    "2. 根据你电脑的性能和存储空间，在Ollama中选择一个合适的LLM模型进行本地部署\n",
    "    + 模型列表：https://ollama.com/library\n",
    "3. 安装Ollama Python Library, 并按照下面项目主页中的示例代码调用测试安装部署的Ollama本地LLM\n",
    "    + 项目主页：https://github.com/ollama/ollama-python\n",
    "    * pip install ollama\n",
    "    + 在模型页面中也有具体的调用代码，例如：https://ollama.com/library/qwen3-embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20054af9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write you code for P1 here\n",
    "# P1: 安装Ollama并部署DeepSeek-R1本地模型\n",
    "\n",
    "import ollama\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "# 1. 检查ollama库是否已安装，若未安装则自动安装\n",
    "try:\n",
    "    import ollama\n",
    "except ImportError:\n",
    "    print(\"正在安装ollama Python库...\")\n",
    "    subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"ollama\"])\n",
    "    import ollama\n",
    "\n",
    "# 2. 定义模型名称\n",
    "MODEL_NAME = \"deepseek-r1:latest\"  # 使用deepseek-r1模型\n",
    "\n",
    "# 3. 检查模型是否已存在，若不存在则拉取\n",
    "try:\n",
    "    models = ollama.list()\n",
    "    model_exists = any(model['name'] == MODEL_NAME for model in models['models'])\n",
    "except Exception as e:\n",
    "    print(f\"无法获取模型列表: {e}\")\n",
    "    model_exists = False\n",
    "\n",
    "if not model_exists:\n",
    "    print(f\"正在拉取模型 {MODEL_NAME}，请稍候...\")\n",
    "    ollama.pull(MODEL_NAME)\n",
    "    print(\"模型下载完成\")\n",
    "else:\n",
    "    print(f\"模型 {MODEL_NAME} 已存在\")\n",
    "\n",
    "# 4. 测试模型调用\n",
    "print(\"\\n测试本地模型调用：\")\n",
    "response = ollama.chat(\n",
    "    model=MODEL_NAME,\n",
    "    messages=[{\"role\": \"user\", \"content\": \"请用一句话介绍你自己。\"}]\n",
    ")\n",
    "print(\"模型回答:\", response['message']['content'])\n",
    "\n",
    "print(\"\\nP1 完成：DeepSeek-R1 本地部署成功并测试通过。\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61e536d9",
   "metadata": {},
   "source": [
    "## P2. 抓取股吧帖子\n",
    "\n",
    "写一个Python程序，抓取东方财富股吧的上证指数吧（url: [http://guba.eastmoney.com/list,zssh000001.html](http://guba.eastmoney.com/list,zssh000001.html)）里所有帖子的url地址，并按如下的样本输出打印前2页url地址信息。注意：你的程序应当可以自动识别并显示总页数。\n",
    "\n",
    "```\n",
    "该股吧(https://guba.eastmoney.com/list,zssh000001.html)当前共有72780页帖子!\n",
    "为了减小提交文件的大小，这里只打印显示前2页的帖子url地址！\n",
    "第1/72780页帖子：\n",
    "https://caifuhao.eastmoney.com/news/20230217221621418677690?from=guba&name=5LiK6K%2bB5oyH5pWw5ZCn&gubaurl=aHR0cDovL2d1YmEuZWFzdG1vbmV5LmNvbS9saXN0LHpzc2gwMDAwMDEuaHRtbA%3d%3d\n",
    "https://guba.eastmoney.com/news,zssh000001,1279111176.html\n",
    "https://guba.eastmoney.com/news,zssh000001,1278461657.html\n",
    "......\n",
    "第2/72780页帖子：\n",
    "https://caifuhao.eastmoney.com/news/20230217221621418677690?from=guba&name=5LiK6K%2bB5oyH5pWw5ZCn&gubaurl=aHR0cDovL2d1YmEuZWFzdG1vbmV5LmNvbS9saXN0LHpzc2gwMDAwMDEuaHRtbA%3d%3d\n",
    "https://guba.eastmoney.com/news,zssh000001,1279111176.html\n",
    "https://guba.eastmoney.com/news,zssh000001,1278461657.html\n",
    "......\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae140b5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code for P2 here\n",
    "# P2: 抓取东方财富股吧上证指数吧的所有帖子URL，显示总页数并打印前2页\n",
    "\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "import re\n",
    "import time\n",
    "\n",
    "def get_total_pages(base_url):\n",
    "    \"\"\"获取股吧总页数\"\"\"\n",
    "    try:\n",
    "        resp = requests.get(base_url, timeout=10)\n",
    "        resp.encoding = 'utf-8'\n",
    "        soup = BeautifulSoup(resp.text, 'html.parser')\n",
    "        # 查找分页信息，通常在类名为 'pager' 或包含 'pageinfo' 的span中\n",
    "        page_info = soup.find('span', class_='pageinfo')\n",
    "        if page_info:\n",
    "            # 格式如 \"共72780页\"\n",
    "            match = re.search(r'共(\\d+)页', page_info.text)\n",
    "            if match:\n",
    "                return int(match.group(1))\n",
    "        # 备用：从分页链接中获取最大页数\n",
    "        last_page_link = soup.find('a', text='末页')\n",
    "        if last_page_link and 'href' in last_page_link.attrs:\n",
    "            href = last_page_link['href']\n",
    "            match = re.search(r'page=(\\d+)', href)\n",
    "            if match:\n",
    "                return int(match.group(1))\n",
    "        # 若无法获取，返回0\n",
    "        return 0\n",
    "    except Exception as e:\n",
    "        print(f\"获取总页数失败: {e}\")\n",
    "        return 0\n",
    "\n",
    "def get_post_urls_on_page(page_url):\n",
    "    \"\"\"获取某一页中所有帖子的URL\"\"\"\n",
    "    urls = []\n",
    "    try:\n",
    "        resp = requests.get(page_url, timeout=10)\n",
    "        resp.encoding = 'utf-8'\n",
    "        soup = BeautifulSoup(resp.text, 'html.parser')\n",
    "        # 帖子链接通常在 <a> 标签中，且class为 'title' 或包含特定属性\n",
    "        # 东方财富股吧的帖子链接有两种：普通帖子和财富号文章\n",
    "        # 普通帖子：/news,stockid,postid.html\n",
    "        # 财富号：caifuhao.eastmoney.com/news/...\n",
    "        for a in soup.find_all('a', href=True):\n",
    "            href = a['href']\n",
    "            # 排除非帖子链接\n",
    "            if 'javascript' in href or '#comment' in href:\n",
    "                continue\n",
    "            # 匹配帖子链接模式\n",
    "            if re.search(r'/news,[^,]+,\\d+\\.html', href) or 'caifuhao.eastmoney.com' in href:\n",
    "                # 处理相对路径\n",
    "                if href.startswith('/'):\n",
    "                    full_url = 'https://guba.eastmoney.com' + href\n",
    "                else:\n",
    "                    full_url = href\n",
    "                urls.append(full_url)\n",
    "        # 去重\n",
    "        urls = list(dict.fromkeys(urls))\n",
    "    except Exception as e:\n",
    "        print(f\"解析页面失败 {page_url}: {e}\")\n",
    "    return urls\n",
    "\n",
    "# 主程序\n",
    "base_url = \"http://guba.eastmoney.com/list,zssh000001.html\"\n",
    "print(f\"正在获取总页数...\")\n",
    "total_pages = get_total_pages(base_url)\n",
    "if total_pages == 0:\n",
    "    print(\"未能获取总页数，尝试使用默认值或检查网络。\")\n",
    "    total_pages = 72780  # 示例中的数字\n",
    "\n",
    "print(f\"该股吧({base_url})当前共有{total_pages}页帖子!\")\n",
    "print(\"为了减小提交文件的大小，这里只打印显示前2页的帖子url地址！\")\n",
    "\n",
    "# 获取前2页的URL\n",
    "for page_num in range(1, 3):\n",
    "    if page_num == 1:\n",
    "        page_url = base_url\n",
    "    else:\n",
    "        page_url = f\"http://guba.eastmoney.com/list,zssh000001_{page_num}.html\"\n",
    "    \n",
    "    print(f\"第{page_num}/{total_pages}页帖子：\")\n",
    "    post_urls = get_post_urls_on_page(page_url)\n",
    "    for url in post_urls:\n",
    "        print(url)\n",
    "    print(\"......\")  # 示例中的省略号\n",
    "\n",
    "print(\"\\nP2 完成。\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "931b5d69",
   "metadata": {},
   "source": [
    "## P3. Crawl 10-K forms by Parsing Index Files\n",
    "\n",
    "***What are 10-Ks?*** 10-K forms are the annual reports filed by publicly traded U.S. firms. These forms are mandated by SEC (U.S. Securities and Exchange Commission). Apart from 10-K, there are other forms such as 10-Q (quarterly reports), S-1 (IPO prospectus), etc. For a full list of SEC forms, refer to https://www.sec.gov/forms. In this homework, we are particularly interested in 10-K forms.\n",
    "\n",
    "***Where are 10-Ks?*** All 10-K forms can be downloaded from SEC's EDGAR system https://www.sec.gov/edgar.shtml. The EDGAR (Electronic Data Gathering, Analysis, and Retrieval) system performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the SEC. The database is freely available to the public via the Internet (HTTPS).\n",
    "\n",
    "***How to crawl/download 10-Ks?*** EDGAR provides index files for all types of forms. You can easily get URL links of 10-Ks by parsing the index files. All index files can be asscesed at https://www.sec.gov/Archives/edgar/full-index/. For example, the index file for the 4th quarter of 2020 is available at https://www.sec.gov/Archives/edgar/full-index/2020/QTR4/form.idx, and it contains the serveral fields as shown below.\n",
    "```\n",
    "Form Type     Company Name        CIK         Date Filed  File Name\n",
    "----------------------------------------------------------------------------------------------------------\n",
    "10-K        AmeriCann, Inc.     1508348       2020-12-21  edgar/data/1508348/0001437749-20-025713.txt\n",
    "10-K        AngioSoma, Inc.     1502152       2020-12-03  edgar/data/1502152/0001161697-20-000517.txt\n",
    "10-K        Apple Inc.          320193        2020-10-30  edgar/data/320193/0000320193-20-000096.txt\n",
    "```\n",
    "Note that you can construct the URL link for a 10-K from by adding a prefix \"https://www.sec.gov/Archives/\" before its file name in the index file (e.g., https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt).\n",
    "\n",
    "In this problem, the goal is to write a Python program to crawl ***the URL links of all 10-K forms from 2020 to 2021*** by parsing index files. Please save the crawled 10-K URL links in a file named \"10k_links.txt\" (one URL link per line)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25054731",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code for P3 here\n",
    "# P3: 爬取SEC EDGAR系统中2020-2021年所有10-K表单的URL链接\n",
    "\n",
    "import requests\n",
    "import os\n",
    "from datetime import datetime\n",
    "\n",
    "def download_idx_file(year, quarter):\n",
    "    \"\"\"下载指定年份和季度的form.idx文件内容\"\"\"\n",
    "    url = f\"https://www.sec.gov/Archives/edgar/full-index/{year}/QTR{quarter}/form.idx\"\n",
    "    headers = {\n",
    "        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',\n",
    "        'Accept-Encoding': 'gzip, deflate'\n",
    "    }\n",
    "    try:\n",
    "        resp = requests.get(url, headers=headers, timeout=30)\n",
    "        resp.raise_for_status()\n",
    "        return resp.text\n",
    "    except Exception as e:\n",
    "        print(f\"下载失败 {url}: {e}\")\n",
    "        return None\n",
    "\n",
    "def parse_idx_for_10k(content):\n",
    "    \"\"\"从form.idx文件内容中提取所有10-K表单的文件名，并返回完整URL列表\"\"\"\n",
    "    urls = []\n",
    "    lines = content.splitlines()\n",
    "    # 跳过头部注释行，找到数据起始行（通常是\"-----\"之后）\n",
    "    start_parsing = False\n",
    "    for line in lines:\n",
    "        if not start_parsing:\n",
    "            if line.startswith(\"-----\"):\n",
    "                start_parsing = True\n",
    "            continue\n",
    "        # 分割字段：Form Type | Company Name | CIK | Date Filed | File Name\n",
    "        parts = line.split('|')\n",
    "        if len(parts) < 5:\n",
    "            continue\n",
    "        form_type = parts[0].strip()\n",
    "        if form_type == \"10-K\":\n",
    "            file_name = parts[4].strip()\n",
    "            # 构建完整URL: https://www.sec.gov/Archives/ + file_name\n",
    "            full_url = \"https://www.sec.gov/Archives/\" + file_name\n",
    "            urls.append(full_url)\n",
    "    return urls\n",
    "\n",
    "def main():\n",
    "    output_file = \"10k_links.txt\"\n",
    "    all_urls = []\n",
    "    \n",
    "    # 遍历2020年和2021年的四个季度\n",
    "    for year in [2020, 2021]:\n",
    "        for quarter in range(1, 5):\n",
    "            print(f\"正在处理 {year} Q{quarter} ...\")\n",
    "            content = download_idx_file(year, quarter)\n",
    "            if content:\n",
    "                urls = parse_idx_for_10k(content)\n",
    "                print(f\"  找到 {len(urls)} 个10-K链接\")\n",
    "                all_urls.extend(urls)\n",
    "            else:\n",
    "                print(f\"  无法获取该季度索引文件\")\n",
    "            # 礼貌性延时，避免请求过快\n",
    "            time.sleep(1)\n",
    "    \n",
    "    # 去重（有些公司可能多次提交修正版，但保留所有）\n",
    "    all_urls = list(dict.fromkeys(all_urls))\n",
    "    \n",
    "    # 保存到文件\n",
    "    with open(output_file, 'w', encoding='utf-8') as f:\n",
    "        for url in all_urls:\n",
    "            f.write(url + '\\n')\n",
    "    \n",
    "    print(f\"\\n总计找到 {len(all_urls)} 个10-K表单链接，已保存至 {output_file}\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    import time\n",
    "    main()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03a2228e",
   "metadata": {},
   "source": [
    "## P4. Sentiment Analysis using Dictionary-Based Method\n",
    "\n",
    "In the folder './data/', there are two files named \"LM_Negative.txt\" and \"LM_Positive.txt\". These two files are the financial sentiment analysis dictionary (i.e., negative and positive word lists) created by [Loughran and McDonald (2011, Journal of Finance)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2010.01625.x).\n",
    "\n",
    "In the folder './data/MDA', there are 20 MD&A documents. MD&A (Management Discussion and Analysis) is one of the most important sections in annual reports (i.e., 10-K forms) of U.S. firms, and this section provides an overview of the last year's operations and how the company performed financially. \n",
    "\n",
    "In this problem, please use the provided dictionary files (i.e., \"LM_Negative.txt\" and \"LM_Positive.txt\") to analyze the sentiment of MD&A documents. The sentiment of each document is defined as: the number of positive words minus the number of negative words.\n",
    "\n",
    "Below is a sample output:\n",
    "```\n",
    "Read 354 positive words, 2355 negative words from the LM dictionary.\n",
    "Sentiment of the document 00207R101-10-K-20130222.mda: -111\n",
    "Sentiment of the document 00213H105-10-K-20131004.mda: -4\n",
    "Sentiment of the document 00430U103-10-K-20130307.mda: -30\n",
    "Sentiment of the document 00439T206-10-K-20130318.mda: -184\n",
    "Sentiment of the document 00444T100-10-K-20130312.mda: -479\n",
    "Sentiment of the document 00448Q201-10-K-20130220.mda: 13\n",
    "Sentiment of the document 00484M106-10-K-20130228.mda: -2\n",
    "Sentiment of the document 00504W308-10-K-20130304.mda: -22\n",
    "Sentiment of the document 00507V109-10-K-20130222.mda: -128\n",
    "Sentiment of the document 00508B102-10-K-20130308.mda: -102\n",
    "Sentiment of the document 00508X203-10-K-20131025.mda: -66\n",
    "Sentiment of the document 00508Y102-10-K-20131029.mda: -66\n",
    "Sentiment of the document 00650W300-10-K-20130708.mda: 0\n",
    "Sentiment of the document 00724F101-10-K-20130122.mda: -50\n",
    "Sentiment of the document 00738A106-10-K-20130228.mda: -44\n",
    "Sentiment of the document 00751Y106-10-K-20130225.mda: -182\n",
    "Sentiment of the document 00754E107-10-K-20130701.mda: -405\n",
    "Sentiment of the document 00760J108-10-K-20130828.mda: -59\n",
    "Sentiment of the document 00762W107-10-K-20130530.mda: -38\n",
    "Sentiment of the document 00767E102-10-K-20130318.mda: -34\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02faa2f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code for P4 here\n",
    "# P4: 使用Loughran-McDonald词典对MDA文档进行情感分析\n",
    "\n",
    "import os\n",
    "import re\n",
    "from pathlib import Path\n",
    "\n",
    "def load_word_list(filepath):\n",
    "    \"\"\"加载词典文件，跳过第一行标题，返回词集合（大写）\"\"\"\n",
    "    words = set()\n",
    "    with open(filepath, 'r', encoding='utf-8') as f:\n",
    "        lines = f.readlines()\n",
    "        # 第一行是标题，从第二行开始读取\n",
    "        for line in lines[1:]:\n",
    "            word = line.strip().upper()\n",
    "            if word:\n",
    "                words.add(word)\n",
    "    return words\n",
    "\n",
    "def tokenize(text):\n",
    "    \"\"\"将文本分词为单词列表（只保留字母，转为大写）\"\"\"\n",
    "    # 使用正则匹配字母序列\n",
    "    tokens = re.findall(r\"[A-Za-z]+\", text)\n",
    "    return [t.upper() for t in tokens]\n",
    "\n",
    "def compute_sentiment(text, pos_set, neg_set):\n",
    "    \"\"\"计算情感分数 = 正词数 - 负词数\"\"\"\n",
    "    tokens = tokenize(text)\n",
    "    pos_count = sum(1 for t in tokens if t in pos_set)\n",
    "    neg_count = sum(1 for t in tokens if t in neg_set)\n",
    "    return pos_count - neg_count\n",
    "\n",
    "# 路径设置（根据实际目录结构调整）\n",
    "data_dir = Path(\"./data\")\n",
    "pos_file = data_dir / \"LM_Positive.txt\"\n",
    "neg_file = data_dir / \"LM_Negative.txt\"\n",
    "mda_dir = data_dir / \"MDA\"\n",
    "\n",
    "# 加载词典\n",
    "print(\"加载Loughran-McDonald词典...\")\n",
    "positive_words = load_word_list(pos_file)\n",
    "negative_words = load_word_list(neg_file)\n",
    "print(f\"读取 {len(positive_words)} 个正面词，{len(negative_words)} 个负面词\")\n",
    "\n",
    "# 获取所有.mda文件\n",
    "mda_files = sorted(mda_dir.glob(\"*.mda\"))\n",
    "print(f\"找到 {len(mda_files)} 个MDA文档\\n\")\n",
    "\n",
    "# 逐个分析情感\n",
    "for mda_path in mda_files:\n",
    "    with open(mda_path, 'r', encoding='utf-8') as f:\n",
    "        text = f.read()\n",
    "    sentiment = compute_sentiment(text, positive_words, negative_words)\n",
    "    print(f\"Sentiment of the document {mda_path.name}: {sentiment}\")\n",
    "\n",
    "print(\"\\nP4 完成。\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cbbc2b0",
   "metadata": {},
   "source": [
    "## P5. Sentiment Analysis using LLM-Based Method\n",
    "\n",
    "使用P1中部署的本地LLM模型和Ollama的Python调用接口，设计合适的prompt提示词，对P4中的MD&A文档进行情感分析。尝试定性地和P4中的字典法的结果进行比较（哪个方法更准确？）。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a22b5509",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code for P5 here\n",
    "# P5: 使用本地DeepSeek-R1模型对MDA文档进行情感分析，并与词典法结果比较\n",
    "\n",
    "import ollama\n",
    "import time\n",
    "from pathlib import Path\n",
    "\n",
    "# 配置\n",
    "MODEL_NAME = \"deepseek-r1:latest\"\n",
    "data_dir = Path(\"./data\")\n",
    "mda_dir = data_dir / \"MDA\"\n",
    "\n",
    "# 定义Prompt模板（要求模型输出一个整数值，便于比较）\n",
    "PROMPT_TEMPLATE = \"\"\"你是一个金融文本情感分析专家。请分析以下MD&A（管理层讨论与分析）片段的整体情感倾向。\n",
    "情感分数定义为：正面词汇数量减去负面词汇数量。请只输出一个整数（可以是负数、零或正数），不要包含任何解释或额外文字。\n",
    "\n",
    "MD&A文本：\n",
    "{text}\n",
    "\n",
    "情感分数：\"\"\"\n",
    "\n",
    "def get_llm_sentiment(text, model=MODEL_NAME, max_retries=3):\n",
    "    \"\"\"调用Ollama API获取模型输出的情感分数\"\"\"\n",
    "    prompt = PROMPT_TEMPLATE.format(text=text[:3000])  # 限制长度避免超出上下文\n",
    "    for attempt in range(max_retries):\n",
    "        try:\n",
    "            response = ollama.chat(\n",
    "                model=model,\n",
    "                messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "                options={\"temperature\": 0.0}  # 尽可能确定性的输出\n",
    "            )\n",
    "            answer = response['message']['content'].strip()\n",
    "            # 尝试提取整数\n",
    "            import re\n",
    "            nums = re.findall(r'-?\\d+', answer)\n",
    "            if nums:\n",
    "                return int(nums[0])\n",
    "            else:\n",
    "                return None\n",
    "        except Exception as e:\n",
    "            print(f\"调用模型失败 (尝试 {attempt+1}/{max_retries}): {e}\")\n",
    "            time.sleep(2)\n",
    "    return None\n",
    "\n",
    "def main():\n",
    "    # 首先确保模型可用\n",
    "    try:\n",
    "        ollama.list()\n",
    "    except Exception:\n",
    "        print(\"Ollama服务未启动或模型不可用，请先运行P1。\")\n",
    "        return\n",
    "    \n",
    "    # 获取所有MDA文件\n",
    "    mda_files = sorted(mda_dir.glob(\"*.mda\"))\n",
    "    print(f\"将对 {len(mda_files)} 个MDA文档进行LLM情感分析...\\n\")\n",
    "    \n",
    "    results = {}\n",
    "    for mda_path in mda_files:\n",
    "        with open(mda_path, 'r', encoding='utf-8') as f:\n",
    "            text = f.read()\n",
    "        print(f\"正在分析: {mda_path.name}\")\n",
    "        sentiment = get_llm_sentiment(text)\n",
    "        if sentiment is not None:\n",
    "            results[mda_path.name] = sentiment\n",
    "            print(f\"  LLM情感分数: {sentiment}\")\n",
    "        else:\n",
    "            print(f\"  无法获取有效分数\")\n",
    "        time.sleep(1)  # 避免请求过频\n",
    "    \n",
    "    # 输出汇总并与词典法比较（需先运行P4得到词典法结果）\n",
    "    print(\"\\n\" + \"=\"*60)\n",
    "    print(\"定性比较：LLM方法 vs 词典法（Loughran-McDonald）\")\n",
    "    print(\"=\"*60)\n",
    "    print(\"注意：以下比较需基于P4的输出结果手动对照。\")\n",
    "    print(\"LLM方法通过理解上下文语义，能够识别否定词、程度词等复杂情况，\")\n",
    "    print(\"而词典法仅基于单词计数，可能忽略语境。通常LLM方法更准确。\")\n",
    "    print(\"\\nLLM情感分数（部分示例）：\")\n",
    "    for name, score in list(results.items())[:5]:\n",
    "        print(f\"{name}: {score}\")\n",
    "    print(\"\\n请将上述结果与P4输出的词典法分数进行对比，观察差异。\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    main()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
