[代码实战]从0开始实现fMRI字幕预测和视觉问答

适用于想使用脑信号进行图像重建的前期基础学习和不能实现重建（设备/时间问题）转而实现图片caption prediction和视觉问答（visual question answering）的宝子

目前全文

1. 数据集下载

1.1. 综合网址

1.2. 代码下载数据

1.3. 字幕预测数据下载

1.4. 视觉问答数据下载

2. 代码下载

2.1. 验证猜想！对应数据！

2.2. 数据处理/打包/加载

2.3. 验证是否对应

2.4. LLM图片嵌入获取

2.5. 将fMRI嵌入与图片嵌入对齐

2.6. 训练字幕预测网络

2.7. 预测

-实战数据集：The Natural Scenes Dataset (NSD)

-NSD数据集有什么？：有被试（subject/participant）观看视觉刺激图片（stimuli/image）的fMRI记录

-实战任务1字幕预测：使用fMRI数据预测对应观看图片的描述（caption/annotation）。图片来源于MSCOCO！！NSD数据集制作的时候被试看的都是MSCOCO的图片

-实战任务2视觉问答：根据fMRI回答有关图片的提问

-图像重建（我这里不介绍这个任务代码，但可以类比一下实现原理其实和上面两个很相似）

建议有好几张A40，A100，V100，V600的宝宝尝试，而且可能要好几天

1. 数据集下载

1.1. 综合网址

①NSD数据集官网：Natural Scenes Dataset

②NSD官方文档和下载网址：[数据集]The Natural Scenes Dataset (NSD)介绍，申请及使用方法_nsd数据集-CSDN博客（我不能直接发网址，要去填写使用申请，具体的要么直接看官网指引要么看我的博客）

③NSD论文原文：A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence | Nature Neuroscience

（这是Nature页面，需要学校或者自己订阅了这个期刊）

A massive 7T fMRI dataset to bridge cognitive and computational neuroscience | bioRxiv（没订阅可以试试这个下载pdf）

（有七十页，慎看，可看可不看）

④数据都到手了！那么数据的介绍在我另一个博客：[数据集]The Natural Scenes Dataset (NSD)介绍，申请及使用方法_nsd数据集-CSDN博客（如果已经知道就忽略~）

1.2. 代码下载数据

（1）网页手动一个一个下载太麻烦？没关系代码可以一键下载：

# https://github.com/ozcelikfu/brain-diffuser
import os
os.system('ls -l')

# Download Experiment Infos
os.system('aws s3 cp s3://natural-scenes-dataset/nsddata/experiments/nsd/nsd_expdesign.mat nsddata/experiments/nsd/ --no-sign-request ')
os.system('aws s3 cp s3://natural-scenes-dataset/nsddata/experiments/nsd/nsd_stim_info_merged.pkl nsddata/experiments/nsd/ --no-sign-request ')

# Download Stimuli
os.system('aws s3 cp s3://natural-scenes-dataset/nsddata_stimuli/stimuli/nsd/nsd_stimuli.hdf5 nsddata_stimuli/stimuli/nsd/ --no-sign-request ')

# Download Betas
for sub in [1,2,3,4,5,6,7,8]:  # for sub in [1,2,5,7]:
   for sess in range(1,38):
       os.system('aws s3 cp s3://natural-scenes-dataset/nsddata_betas/ppdata/subj{:02d}/func1pt8mm/betas_fithrf_GLMdenoise_RR/betas_session{:02d}.nii.gz nsddata_betas/ppdata/subj{:02d}/func1pt8mm/betas_fithrf_GLMdenoise_RR/ --no-sign-request '.format(sub,sess,sub))

# Download ROIs
for sub in [1,2,3,4,5,6,7,8]:
    os.system('aws s3 cp --no-sign-request  s3://natural-scenes-dataset/nsddata/ppdata/subj{:02d}/func1pt8mm/roi/ nsddata/ppdata/subj{:02d}/func1pt8mm/roi/  --recursive'.format(sub,sub))

对于数据集介绍逃课的宝子，这里可以再说一下：

①首先，如果想要下载别的数据可以从NSD提供的下载页面（是个文件夹的形式）找到自己想要下载的文件夹，然后就像上面一样的写法，只是修改一下地址就能下载。

②nsd_expdesign.mat是什么？

对于变量subjectim，存了八个人（八行）分别看的一万张图片（一万列）。任何一个数字，比如是被试1看的第一张图片对应的NSD自己生成的形状为(73000,425,425,3)的nsd_stimuli.hdf5数据集文件的第2951张(2951,:,:,:)。(425,425,3)是一张图片的形状，这里有73000张图片刺激。因此每个数据是个索引。在这个subjectim中，八个人的前1000列是相同的，是他们的测试集，而后9000张图片是不一样的。

对于变量masterordering，存了洗牌的索引。为什么要洗牌？首先NSD数据集虽然一个人只看1w张图片但是每张图片其实看了三次！因此一个人一共看了3w张！但这不能直接subjectim复制三倍吧。所以对八个人的数据都进行了三倍复制并且洗牌，八个人遵循同样的洗牌顺序。代表了所有八个人看的第一张图片都是这个位置的图片。可能会有点难理解，这是二次索引。通过，找到位置，再找到的值，这个值才是对应nsd_stimuli.hdf5的第0维度的值。

（好可怕，完全无法理解怎么办？没关系之后的代码会帮忙自动对应好的）

③nsd_stim_info_merged.pkl是什么？

是nsd_stimuli.hdf5中73000张图片到COCO图片ID的索引！只需要用前两列，第一列是nsd_stimuli.hdf5的图片顺序，第二列是对应COCO图片集的COCO ID。

④nsd_stimuli.hdf5是什么？

是变量名为imgBrick的形状为(73000,425,425,3)的图片集，是被试看的视觉刺激。注意(425,425,3)不一定是COCO图片原本的大小，NSD论文里面说人工裁剪过或者reshape啥的不记得了。nsd_stimuli.hdf5有36.8 GB，虽然不是很大，但它是一个文件，同时打开估计会内存爆炸的，不建议普通笔记本尝试。

我笔记本反正报错了：

（有36.8G内存的宝子也不一定行，这只是AI估计，事实上挪到A40之后用了80G内存。是CPU的内存，不是显存）

⑤betas_fithrf_GLMdenoise_RR文件夹下有什么？

有750次trial的全脑β值。形状应该不用说明吧？就在上图，是(81,104,83,750)，前三个维度是大脑最后一个维度是time points（trial）

⑥roi文件夹下是什么？

是脑图谱（brain atlas/parcellation）呗~不过我示例代码只用了roi文件夹下的nsdgeneral.nii.gz，如下（这是sub 01的！！其他被试的不一样！注意！）：

其中白色是1，灰色是0，黑色背景是-1。代码统计值个数：

-1 -> 592088
0 -> 91380
1 -> 15724

在NSD论文中可以找到1值是观看七万三千张图片中平均β激活最高的体素。可以通过nii图看出来基本都在额叶。

（2）尽管现在项目中都给的这个下载办法…但我不能下载，我每下载一个文件就会覆盖上一个，因此我是手动在网页下载的！

1.3. 字幕预测数据下载

（1）网址：COCO – Common Objects in Context（只需要用到captions_train2017.json和captions_val2017.json）

（2）预览captions_train2017.json：

{"info": {"description": "COCO 2017 Dataset","url": "http://cocodataset.org","version": "1.0","year": 2017,"contributor": "COCO Consortium","date_created": "2017/09/01"},"licenses": [{"url": "http://creativecommons.org/licenses/by-nc-sa/2.0/","id": 1,"name": "Attribution-NonCommercial-ShareAlike License"},{"url": "http://creativecommons.org/licenses/by-nc/2.0/","id": 2,"name": "Attribution-NonCommercial License"},{"url": "http://creativecommons.org/licenses/by-nc-nd/2.0/","id": 3,"name": "Attribution-NonCommercial-NoDerivs License"},{"url": "http://creativecommons.org/licenses/by/2.0/","id": 4,"name": "Attribution License"},{"url": "http://creativecommons.org/licenses/by-sa/2.0/","id": 5,"name": "Attribution-ShareAlike License"},{"url": "http://creativecommons.org/licenses/by-nd/2.0/","id": 6,"name": "Attribution-NoDerivs License"},{"url": "http://flickr.com/commons/usage/","id": 7,"name": "No known copyright restrictions"},{"url": "http://www.usa.gov/copyright.shtml","id": 8,"name": "United States Government Work"}],"images": [{"license": 3,"file_name": "000000391895.jpg","coco_url": "http://images.cocodataset.org/train2017/000000391895.jpg","height": 360,"width": 640,"date_captured": "2013-11-14 11:18:45","flickr_url": "http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg","id": 391895},{"license": 4,"file_name": "000000522418.jpg","coco_url": "http://images.cocodataset.org/train2017/000000522418.jpg","height": 480,"width": 640,"date_captured": "2013-11-14 11:38:44","flickr_url": "http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg","id": 522418}

...

{"image_id": 193449,"id": 354839,"caption": "a bed covered in heaps of rags and dirty bedding"},{"image_id": 88983,"id": 354849,"caption": "view of the room from a remotes perspective."},{"image_id": 34869,"id": 354896,"caption": "A tray of sprinkled doughnuts with creme icing."},{"image_id": 354174,"id": 354905,"caption": "A man holding a tennis racquet next to a crowd of people."},{"image_id": 410587,"id": 354944,"caption": "A long couch with many pillows, a table and some seat cushions around it."},{"image_id": 388601,"id": 354955,"caption": "A woman sleeping on a giant piece of pizza."},{"image_id": 85340,"id": 355036,"caption": "Two women and a man with sandwich in basket."},{"image_id": 73333,"id": 355075,"caption": "A girl surfing on a small ocean wave."},{"image_id": 51054,"id": 355119,"caption": "A woman sitting on a bed talking on the phone."},{"image_id": 309341,"id": 355134,"caption": "A white surfboard sitting in a room before a set of track lighting."},{"image_id": 314074,"id": 355169,"caption": "People are watching a man cut a wedding cake with a puppet."},{"image_id": 34869,"id": 355205,"caption": "a container of donuts kept closed with a rubberband"},{"image_id": 437594,"id": 355236,"caption": "A woman standing in front of a computer keyboard."},{"image_id": 192306,"id": 355242,"caption": "A man standing on a tennis court near a large net."},{"image_id": 63804,"id": 355259,"caption": "A woman eating a doughnut sitting at a laptop."},{"image_id": 522622,"id": 355302,"caption": "A group of young children running on a field."},{"image_id": 507826,"id": 355392,"caption": "A man holding a tennis racquet on a tennis court."},{"image_id": 388974,"id": 355421,"caption": "A girl holding a hot dog and a cup of juice."},{"image_id": 44504,"id": 355486,"caption": "A display of toothbrushes and other dental hygiene products"},{"image_id": 278462,"id": 355575,"caption": "A surfer surfing on a small wave in the ocean."},{"image_id": 224757,"id": 355577,"caption": "A group of young children standing around a field."},{"image_id": 6473,"id": 355625,"caption": "A blue cake topped with a beach scene."}

分成了上面一堆有网址的和下面一堆的直接annotation

1.4. 视觉问答数据下载

（1）VQA: Visual Question Answering

（2）v2_mscoco_train2014_annotations.json预览：

2. 代码下载

（1）不单独介绍某个项目，可能会混着说，参考代码（简要地说，我使用了BrainCaptioning的数据处理方式和Brain Captioning with GPT-2的模型）：

①Brain Captioning with GPT-2：GitHub – slavaheroes/brain-captioning-with-gpt2: Decoding of fMRI signals of stimuli images into the captions

②BrainCaptioning：GitHub – enomodnara/BrainCaptioning

（简要地说，BrainCaptioning将所有数据按行对齐了，比如fMRI的第一行对应caption第一行对应image第一行，但模型部分一坨；而Brain Captioning with GPT-2有专门的索引文件，没有将数据重排列对齐，但模型部分更好使一点，虽然也没好使到哪里去但凑合吧）

2.1. 验证猜想！对应数据！

（1）我们该如何找到NSD图像中对应的COCO图像并找到COCO图像的caption？

①在nsd_stim_info_merged.csv中随机选一张图：

我选了nsd_stimuli.hdf5中(46003,:,:,:)，从nsd_stim_info_merged.csv中可以看到对应的COCO ID是412931

②用jpg的形式打印一下46003：

import h5py
import numpy as np
from PIL import Image
import os

# 文件路径（要copy记得自己改路径啊）
hdf5_path = r'F:NSDNSD dataset_mass
sd_stimuli.hdf5'
output_dir = r'F:NSDextracted_images'  # 输出目录
os.makedirs(output_dir, exist_ok=True)  # 确保输出目录存在

# 读取HDF5文件并提取图像
try:
    with h5py.File(hdf5_path, 'r') as hdf:
        # 假设图像数据存储在名为'stimuli'的数据集中
        # 如果数据集名称不同，请相应修改
        if 'imgBrick' in hdf:
            image_data = hdf['imgBrick'][46003, :, :, :]  # 提取指定索引的图像

            # 检查图像数据的形状和类型
            print(f"图像数据形状: {image_data.shape}")
            print(f"数据类型: {image_data.dtype}")

            # 确保数据在0-255范围内（假设是uint8）
            if image_data.dtype != np.uint8:
                if np.max(image_data) <= 1.0:  # 如果是0-1范围的浮点数
                    image_data = (image_data * 255).astype(np.uint8)
                else:  # 其他情况可能需要特殊处理
                    image_data = image_data.astype(np.uint8)

            # 保存为.jpg文件
            output_path = os.path.join(output_dir, 'nsd_image_18063.jpg')
            Image.fromarray(image_data).save(output_path)
            print(f"图像已保存至: {output_path}")
        else:
            print("错误: 文件中未找到'stimuli'数据集")
except Exception as e:
    print(f"发生错误: {str(e)}")

③去captions_train2017.json里面找到这张图片的注释（靠412931索引）：

{"image_id": 412931,"id": 611940,"caption": "A tennis player striking the ball with his shadow underneath him."}

看起来非常正确（实际上会有五个注释，我只展示了一个）

④去v2_OpenEnded_mscoco_train2014_questions.json找问题（还是用412931索引）：

{"image_id": 412931, "question": "What game is being played?", "question_id": 412931000}, {"image_id": 412931, "question": "Is the sun shining?", "question_id": 412931001}, {"image_id": 412931, "question": "Is the man inside or outside of the line?", "question_id": 412931002}, {"image_id": 412931, "question": "Can you see the person's head?", "question_id": 412931003}, {"image_id": 412931, "question": "Which sport is this?", "question_id": 412931004}, {"image_id": 412931, "question": "How is the ground like?", "question_id": 412931005}

再去v2_mscoco_train2014_annotations.json找答案（索引是412931）：

"image_id": 412931, "question_type": "what", "question_id": 412931000}, {"question_type": "is the", "multiple_choice_answer": "yes", "answers": [{"answer": "yes", "answer_confidence": "yes", "answer_id": 1}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 2}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 3}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 4}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 5}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 6}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 7}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 8}, {"answer": "yes", "answer_confidence": "maybe", "answer_id": 9}, {"answer": "yes", "answer_confidence": "yes", "answer_id": 10}], "image_id": 412931, "answer_type": "yes/no", "question_id": 412931001}, {"answer_type": "other", "multiple_choice_answer": "inside", "answers": [{"answer": "outside", "answer_confidence": "yes", "answer_id": 1}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 2}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 3}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 4}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 5}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 6}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 7}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 8}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 9}, {"answer": "inside", "answer_confidence": "yes", "answer_id": 10}], "image_id": 412931, "question_type": "is the man", "question_id": 412931002}, {"question_type": "can you", "multiple_choice_answer": "no", "answers": [{"answer": "no", "answer_confidence": "yes", "answer_id": 1}, {"answer": "no", "answer_confidence": "yes", "answer_id": 2}, {"answer": "no", "answer_confidence": "yes", "answer_id": 3}, {"answer": "no", "answer_confidence": "yes", "answer_id": 4}, {"answer": "no", "answer_confidence": "yes", "answer_id": 5}, {"answer": "no", "answer_confidence": "maybe", "answer_id": 6}, {"answer": "no", "answer_confidence": "yes", "answer_id": 7}, {"answer": "no", "answer_confidence": "yes", "answer_id": 8}, {"answer": "no", "answer_confidence": "yes", "answer_id": 9}, {"answer": "no", "answer_confidence": "yes", "answer_id": 10}], "image_id": 412931, "answer_type": "yes/no", "question_id": 412931003}, {"answer_type": "other", "multiple_choice_answer": "tennis", "answers": [{"answer": "tennis", "answer_confidence": "yes", "answer_id": 1}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 2}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 3}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 4}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 5}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 6}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 7}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 8}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 9}, {"answer": "tennis", "answer_confidence": "yes", "answer_id": 10}], "image_id": 412931, "question_type": "which", "question_id": 412931004}, {"question_type": "how", "multiple_choice_answer": "clay", "answers": [{"answer": "hard", "answer_confidence": "yes", "answer_id": 1}, {"answer": "sandy", "answer_confidence": "yes", "answer_id": 2}, {"answer": "pressed dirt", "answer_confidence": "yes", "answer_id": 3}, {"answer": "brown", "answer_confidence": "yes", "answer_id": 4}, {"answer": "clay", "answer_confidence": "yes", "answer_id": 5}, {"answer": "sandy", "answer_confidence": "yes", "answer_id": 6}, {"answer": "clay", "answer_confidence": "yes", "answer_id": 7}, {"answer": "hard clay", "answer_confidence": "yes", "answer_id": 8}, {"answer": "sand", "answer_confidence": "yes", "answer_id": 9}, {"answer": "clay", "answer_confidence": "yes", "answer_id": 10}], "image_id": 412931, "answer_type": "other", "question_id": 412931005}

至此都非常OK

2.2. 数据处理/打包/加载

（1）使用BrainCaptioning/prepare_nsddata_captions.py：

①加载文件（此时需要自行修改地址！别用我的！）：

import os
import sys
import numpy as np
import h5py
import scipy.io as spio
import nibabel as nib
import pandas as pd
import json
from os.path import join as opj
import tqdm


import argparse
parser = argparse.ArgumentParser(description='Argument Parser')
parser.add_argument("-sub", "--sub",help="Subject Number",default=1)
args = parser.parse_args()
sub=int(args.sub)


base_path="F:/"
timeseries_path=opj(base_path,"NSD")
betas_path=opj(base_path,"NSD","nsddata_betas","ppdata","subj{:02d}","func1pt8mm")
stim_info_path=opj(base_path,"NSD","nsd_stim_info_merged.csv")

stimuli_path=opj(base_path,"NSD","NSD dataset_mass")
stim_file_path=opj(stimuli_path,"nsd_stimuli.hdf5")
stim_captions_train_path=opj(base_path,"NSD","annotations",f"captions_train2017.json")
stim_captions_val_path=opj(base_path,"NSD","annotations",f"captions_val2017.json")

stim_file = stim_file_path

#for captions
stim_info=pd.read_csv(stim_info_path)

with open(stim_captions_train_path,'rb') as f:
    train_cap=json.load(f)

with open(stim_captions_val_path,'rb') as f:
    val_cap=json.load(f)


caption_train_df=pd.DataFrame.from_dict(train_cap["annotations"])
caption_val_df=pd.DataFrame.from_dict(val_cap["annotations"])

②对于被试1的37个run和750个trial划分训练集和测试集图片：

def loadmat(filename):
    '''
    this function should be called instead of direct spio.loadmat
    as it cures the problem of not properly recovering python dictionaries
    from mat files. It calls the function check keys to cure all entries
    which are still mat-objects
    '''
    def _check_keys(d):
        '''
        checks if entries in dictionary are mat-objects. If yes
        todict is called to change them to nested dictionaries
        '''
        for key in d:
            if isinstance(d[key], spio.matlab.mio5_params.mat_struct):
                d[key] = _todict(d[key])
        return d

    def _todict(matobj):
        '''
        A recursive function which constructs from matobjects nested dictionaries
        '''
        d = {}
        for strg in matobj._fieldnames:
            elem = matobj.__dict__[strg]
            if isinstance(elem, spio.matlab.mio5_params.mat_struct):
                d[strg] = _todict(elem)
            elif isinstance(elem, np.ndarray):
                d[strg] = _tolist(elem)
            else:
                d[strg] = elem
        return d

    def _tolist(ndarray):
        '''
        A recursive function which constructs lists from cellarrays
        (which are loaded as numpy ndarrays), recursing into the elements
        if they contain matobjects.
        '''
        elem_list = []
        for sub_elem in ndarray:
            if isinstance(sub_elem, spio.matlab.mio5_params.mat_struct):
                elem_list.append(_todict(sub_elem))
            elif isinstance(sub_elem, np.ndarray):
                elem_list.append(_tolist(sub_elem))
            else:
                elem_list.append(sub_elem)
        return elem_list
    data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
    return _check_keys(data)



stim_order_f = r'F:/NSD/nsd_expdesign.mat'
stim_order = loadmat(stim_order_f)


## Selecting ids for training and test data

sig_train = {}
sig_test = {}
num_trials = 37*750
for idx in range(num_trials):
    ''' nsdId as in design csv files'''
    nsdId = stim_order['subjectim'][sub-1, stim_order['masterordering'][idx] - 1] - 1
    if stim_order['masterordering'][idx]>1000:
        if nsdId not in sig_train:
            sig_train[nsdId] = []
        sig_train[nsdId].append(idx)
    else:
        if nsdId not in sig_test:
            sig_test[nsdId] = []
        sig_test[nsdId].append(idx)


train_im_idx = list(sig_train.keys())
test_im_idx = list(sig_test.keys())

此时train_im_idx长度是8859，test_im_idx长度是982，每个人看了8859+982=9841张图片。

train_im_idx示例，每个数字是看图在NSD图片集的索引：

[61882, 828, 67573, 16020, 40422, 51517, 62325, 50610, 55065, 37398, 18039, 67533, 21822, 35405, 21690, 28278, 10459, 2293, 44325, 38218, 30032, 65255, 64919, 12469, 43078...]

test_im_idx示例：

[46002, 48617, 44980, 32625, 53052, 4930, 6431, 70335, 36576, 57046, 7659, 30373, 25959, 65414, 42171, 5602, 21601, 62302, 5301, 15492, 25287, 6558, 16723, 40575, 45595, 9917, 26598, 60305, 4786, 19181, 72080, 36067, 71753, 58144, 11942, 38817...]

对于每个图片看三次的问题，sig_train 保存了某一张图片是在哪些trial中看的，格式如：

14327: [26952, 27123, 27244], 3224: [26961, 27040]

③将fMRI数据与图片数据按行对应，同时区分训练集和测试集：

roi_dir = r'F:/NSD/'.format(sub)
betas_dir = r'F:/NSD/nsddata_betas/ppdata/subj{:02d}/func1pt8mm/'.format(sub)

mask_filename = 'nsdgeneral.nii.gz'
mask = nib.load(roi_dir+mask_filename).get_fdata()
num_voxel = mask[mask>0].shape[0]

fmri = np.zeros((num_trials, num_voxel)).astype(np.float32)
for i in range(37):
    beta_filename = "betas_session{0:02d}.nii.gz".format(i+1)
    beta_f = nib.load(betas_dir+beta_filename).get_fdata().astype(np.float32)
    fmri[i*750:(i+1)*750] = beta_f[mask>0].transpose()
    del beta_f
    print(i)
    
print("fMRI Data are loaded.")

f_stim = h5py.File(r'F:/NSD/NSD dataset_mass/nsd_stimuli.hdf5', 'r')
stim = f_stim['imgBrick'][:]

print("Stimuli are loaded.")

num_train, num_test = len(train_im_idx), len(test_im_idx)
vox_dim, im_dim, im_c = num_voxel, 425, 3
fmri_array = np.zeros((num_train,vox_dim))
stim_array = np.zeros((num_train,im_dim,im_dim,im_c))
for i,idx in enumerate(train_im_idx):
    stim_array[i] = stim[idx]
    fmri_array[i] = fmri[sorted(sig_train[idx])].mean(0)
    print(i)

np.save('NSD/processed_data/subj{:02d}/nsd_train_fmriavg_nsdgeneral_sub{}.npy'.format(sub,sub),fmri_array )
np.save('NSD/processed_data/subj{:02d}/nsd_train_stim_sub{}.npy'.format(sub,sub),stim_array )

print("Training data is saved.")

fmri_array = np.zeros((num_test,vox_dim))
stim_array = np.zeros((num_test,im_dim,im_dim,im_c))
for i,idx in enumerate(test_im_idx):
    stim_array[i] = stim[idx]
    fmri_array[i] = fmri[sorted(sig_test[idx])].mean(0)
    print(i)

np.save('NSD/processed_data/subj{:02d}/nsd_test_fmriavg_nsdgeneral_sub{}.npy'.format(sub,sub),fmri_array )
np.save('NSD/processed_data/subj{:02d}/nsd_test_stim_sub{}.npy'.format(sub,sub),stim_array )

print("Test data is saved.")

fmri = np.zeros((num_trials, num_voxel))的形状是(27750, 15724)，意思是总共有27750次试验，每次试验是15724个体素的β值，这15724是nsdgeneral.nii.gz的1值体素位置对应的被试1的β值。也就是nsd_train_fmriavg_nsdgeneral_sub{}.npy的形状应该是(8859, 15724)（这个我不确定！）。

nsd_train_stim_sub{}.npy是对应好的图片，应该是(8859, 425, 425, 3)吧（不确定！但大概率！有问题自己打印去）

④对应COCO caption：

train_captions=np.empty((len(train_im_idx),5),dtype=np.object)
i=0
for nsdId in tqdm.tqdm(train_im_idx):
    


    cocoId=stim_info[stim_info.nsdId==nsdId].cocoId.values[0]
    split=stim_info[stim_info.nsdId==nsdId].cocoSplit.values[0]

    if split=="train2017":
        captions=caption_train_df[caption_train_df.image_id==cocoId].caption.values

    elif split=="val2017":
        captions=caption_val_df[caption_val_df.image_id==cocoId].caption.values
    train_captions[i,:]=captions[:5]
    i+=1

np.save('NSD/processed_data/subj{:02d}/nsd_train_cap_sub{}.npy'.format(sub,sub),train_captions )

test_captions=np.empty((len(test_im_idx),5),dtype=np.object)
i=0
for nsdId in tqdm.tqdm(test_im_idx):
    


    cocoId=stim_info[stim_info.nsdId==nsdId].cocoId.values[0]
    split=stim_info[stim_info.nsdId==nsdId].cocoSplit.values[0]

    if split=="train2017":
        captions=caption_train_df[caption_train_df.image_id==cocoId].caption.values

    elif split=="val2017":
        captions=caption_val_df[caption_val_df.image_id==cocoId].caption.values
    test_captions[i,:]=captions[:5]
    i+=1
    
np.save('NSD/processed_data/subj{:02d}/nsd_test_cap_sub{}.npy'.format(sub,sub),test_captions )

print("Caption data are saved.")

nsd_test_cap_sub1.npy展示（好像形状是982，5）：

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_captions = np.empty((len(test_im_idx), 5), dtype=np.object)
100%|██████████| 982/982 [00:01<00:00, 889.94it/s]
[['White cows eating grass under trees and the sky'
  'Many cows in a pasture with trees eating grass.'
  'A herd of cows graze on a field of sparse grass.'
  'a herd of white cows grazing on brush among the trees'
  'A herd of mostly white cows in a field with some trees.']
 ['A plane on the runway under cloudy skies.'
  'Airplane boards on an extremely dark and gloomy day.'
  'There is a plane pulled into a port under the clouds.'
  'An American Airlines airplane is preparing for take off. '
  'An airplane sits at the airport waiting to be loaded.']
 ['A passenger jet coming in for a landing over a big city.'
  'An aeroplane flying in the sky over the buildings at sunset.'
  'An airplane flying in the air above a city.'
  'an airplane flying about many tall buildings and cars '
  'A blue jet airliner flying over a city.']
 ...
 ['A  dog sitting on a purple folding lawn chair'
  'A brown and white dog sitting in a purple and white strip lawn chair.'
  'A dog relaxing comfortably in a collapsable chair.'
  'A dog sitting in a purple and white striped chair.'
  'The dog lies inside a purple and white collapsable chair.']
 ['A male skier with an Olympic number bib on and flags and a Vancouver 2010 sign in the background.'
  'A man is competing in the 2010 Olympics skiing.'
  'A man is snow boarding in winter Olympic.'
  'A man on snow skis is pushing himself with ski poles.'
  'A skier in a bright colored suit in the snow.']
 ['A man wearing a tie and glasses looks down'
  'A man in a tie looking at the camera '
  'A man wearing a tie and glasses smiling for the camera.'
  'A man with a tie and glasses sitting down.'
  'A man in glasses and a tie is making an unhappy face.']]

2.3. 验证是否对应

（1）如何确定BrainCaptioning处理的时候将数据按行对应？将fMRI数据nsd_test_fmriavg_nsdgeneral_sub1.npy，图片刺激数据nsd_test_stim_sub1.npy和字幕数据nsd_test_cap_sub1.npy都打印前三个数据：

看上去没什么问题，图片和字幕一定对应好了，但fMRI是肉眼看不出来的就当也对应好了吧

2.4. LLM图片嵌入获取

①采用brain-captioning-with-gpt2parse_dinov2_embeds.py获取图片嵌入（以下代码可能不能直接跑，需要修改，仅供展示。我不提供这部分的修改，只增加一些备注）（dino要去huggingface下载）：

import torch
from torchvision import transforms

from PIL import Image
import h5py
from tqdm import tqdm
import numpy as np

import pickle

annots_cur = np.load('../data/annots/COCO_73k_annots_curated.npy')
print("Annotations are loaded.", annots_cur.shape)
# annots_cur.shape=(73000,5);73000张不同的图片，每张图片有5个描述，但部分描述为空

IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)

f_stim = h5py.File('../data/nsddata_stimuli/stimuli/nsd/nsd_stimuli.hdf5', 'r')
stim = f_stim['imgBrick'][:]

print("Stimuli are loaded.", stim.shape)

device = "cuda:6" if torch.cuda.is_available() else "cpu"
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_lc')

model.to(device)
model.eval()

preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
    
])

all_embeddings = []
all_captions = []

for i in tqdm(range(stim.shape[0])):
    image = stim[i].astype(np.uint8)
    
    captions = annots_cur[i] 
    captions = [c.strip() for c in captions if c.strip() != '']

    with torch.no_grad():
        image = preprocess(Image.fromarray(image)).unsqueeze(0).to(device)
        prefix = model.backbone.forward_features(image)['x_norm_clstoken']  #一张图由(1,3，224，224)->(1,1536)
    
    all_embeddings.append(prefix.cpu())  #包含73000个tensor张量的列表
    all_captions.append(captions)
        
    if (i+1)%1000 or i == stim.shape[0]-1:
        # Save the embeddings and captions
        with open('../processed_data/stimuli_original_dino_vision.pkl', 'wb') as f:
            pickle.dump(torch.cat(all_embeddings, dim=0), f)  #保存一个张量，形状为（73000，1536）
        
        with open('../processed_data/stimuli_original_captions.pkl', 'wb') as f:
            pickle.dump(all_captions, f)

#all_caption为list类型，长度为73000;每个元素同样是一个List,保存每张图片对应的caption，无空描述，每张图片caption数量不均等
#如一个图片的描述: ["A person kitesurfing over the waves of the ocean's shore.", 'A man is flying up in the air and having fun.', 'A guy is waterboarding in the ocean on a windy day.']
#每张图片至少有一个描述
print(len(all_embeddings), len(all_captions))
print("All done.")

得到训练集和测试集图片嵌入stimuli_train_dino_vision.pkl和stimuli_test_dino_vision.pkl，形状是(8859, 1536)和(982, 1536)。意思是从原始输入(8859, 425, 425, 3)和(982, 425, 425, 3)通过图像特征提取器Dino v2变成了统一的Dino嵌入1536，这是不能改的。

2.5. 将fMRI嵌入与图片嵌入对齐

①在rain-captioning-with-gpt2-master/train_linear_ridge.py，我修改了原始代码，因此不需要索引文件，直接按行匹配。

②输出linear_regression_sub0{args.sub}_test_dinov2_preds.pkl展示：

③回归结果展示（使用岭回归对齐fMRI嵌入和图片嵌入的MSE指标）：

2.6. 训练字幕预测网络

①在brain-captioning-with-gpt2-master rain_captioner.py（如果是按照BrainCaptioning的按行对齐，这里的代码还是需要去掉索引并按行对齐的）

②Caption module中只训练mappting network或同时训练mapping network和captioner的关键代码：

    if config['mapping_network']['only_prefix']:
        # train only mapping network
        model = ClipCaptionPrefix(
            prefix_length=config['mapping_network']['prefix_length'],
            clip_length=config['mapping_network']['clip_length'],
            prefix_size=config['mapping_network']['prefix_size'],
            num_layers=config['mapping_network']['num_layers'],
            mapping_type=config['mapping_network']['mapping_type'],
            gpt2_type=config['gpt2_type']
        )
        
    else:
        # train both captioner and mapping network
        model = ClipCaptionModel(
            prefix_length=config['mapping_network']['prefix_length'],
            clip_length=config['mapping_network']['clip_length'],
            prefix_size=config['mapping_network']['prefix_size'],
            num_layers=config['mapping_network']['num_layers'],
            mapping_type=config['mapping_network']['mapping_type'],
            gpt2_type=config['gpt2_type']
        )

对应到论文图片（红框）：

③都训练的代码（我加了额外的备注，是原文没有的）：

class ClipCaptionModel(nn.Module):

    def get_dummy_token(self, batch_size: int, device: torch.device) -> torch.Tensor:
        return torch.zeros(batch_size, self.prefix_length, dtype=torch.int64, device=device)

    def forward(self, tokens: torch.Tensor, prefix: torch.Tensor, mask: Optional[torch.Tensor] = None,
                labels: Optional[torch.Tensor] = None):
        embedding_text = self.gpt.transformer.wte(tokens) # (9841,32)→(9841,32,768)
        prefix_projections = self.clip_project(prefix).view(-1, self.prefix_length, self.gpt_embedding_size) #每一行脑信号都会被映射成一个 10 × 768 的矩阵作为 GPT2 的“前缀词”嵌入输入。因此是(9841,10,768)
        embedding_cat = torch.cat((prefix_projections, embedding_text), dim=1)  #(9481,42,768)
        if labels is not None:  # labels是空的，这里if不会触发
            dummy_token = self.get_dummy_token(tokens.shape[0], tokens.device)
            labels = torch.cat((dummy_token, tokens), dim=1)  # (batch, 10 + 32)
        out = self.gpt(inputs_embeds=embedding_cat, labels=labels, attention_mask=mask)  # out=(9841, 42, 50257)  mask=(9841, 42) labels是空的，因此GPT使用自己的默认labels
        return out

    def __init__(self, prefix_length: int, clip_length: Optional[int] = None, prefix_size: int = 512,
                 num_layers: int = 8, mapping_type: str = 'mlp', gpt2_type: str = 'gpt2'):
        super(ClipCaptionModel, self).__init__()
        self.prefix_length = prefix_length
        self.gpt = GPT2LMHeadModel.from_pretrained(gpt2_type)
        self.gpt_embedding_size = self.gpt.transformer.wte.weight.shape[1]  #这个是固定的768
        if mapping_type == 'mlp':
            self.clip_project = MLP((prefix_size, (self.gpt_embedding_size * prefix_length) // 2,
                                     self.gpt_embedding_size * prefix_length))
        else:
            self.clip_project = TransformerMapper(prefix_size, self.gpt_embedding_size, prefix_length,
                                                                     clip_length, num_layers)

④只训练mappting network的代码：

class ClipCaptionPrefix(ClipCaptionModel):

    def parameters(self, recurse: bool = True):
        return self.clip_project.parameters()  # 仅返回 mapping network 的参数

    def train(self, mode: bool = True):
        super(ClipCaptionPrefix, self).train(mode)# 设置 ClipCaptionPrefix 为训练模式
        self.gpt.eval()  # 冻结 GPT-2，不更新其参数
        return self

⑤运行完这个会得到权重文件