人工智能 AI Law

律师在AI医疗项目尽职调查实践

Lawyer Practice in Due Diligence for AI Medical Projects

2024.11.22 /

预计阅读 15 分钟 15 MIN READ

国家药监局的官方指导文件已经允许AI用于病灶性质、用药、治疗等医疗行为的辅助决策，还包括一些流程性的非辅助决策，但是强调审查数据来源的合规性、数据分布的合理性、有效性和准确性。美国食品药品监督管理局（FDA）也在逐步完善针对AI医疗产品的审批流程，这为行业带来了更多的规范性保障，也为合法合规的AI医疗技术提供了广阔的应用空间。

AI医疗市场预计在未来几年将保持快速增长。据市场研究，全球AI医疗市场的年均增长率（CAGR）可能会达到40%以上，市场规模到2030年将达到数千亿美元。

AI在医学影像分析中的应用，如X光、CT、MRI等，显著提高了诊断的准确性和效率，甚至在某些情况下超过了人类医生的水平。目前AI在医疗影像分析中的应用最为成熟，也是投资的热门领域。许多初创公司和大企业正在开发AI驱动的诊断工具，市场潜力巨大。但技术尚处于高速迭代发展阶段，可能存在不准确或偏差的问题。投资者需要警惕技术和市场的双重风险。

平衡

对AI医疗项目投资的法律风险评估应当依据AI项目的特点开展，风险与收益需要平衡，数据审查也需要根据AI项目特点进行平衡。

据报告，在某数据平台上有人公开贩卖二型糖尿病人群队列基因位点数据、头皮脑电、颅内脑电、脑功能数据、用于反向虚拟筛选的蛋白质数据库等，如果采用未经验证的、非法渠道获取的公共数据集用于AI决策训练可能导致严重的偏差风险，也是律师进行外部审查的重点。律师需要针对AI医疗的特点审查AI医疗系统背后大数据训练所依据的诊断数据来源。实践中，因为涉及医学、法学和计算机科学三个学科的交叉领域，律师应当从多个角度考虑，以确保训练源数据的合法性、隐私保护和真实性。

律师应当核查AI医疗系统所使用的数据是否是合法获取的，是否经过了病人或数据主体的同意。检查相关的知情同意书和数据共享协议，确保数据的采集过程符合《个人信息保护法》以及其他相关行业规则，同时律师还应当评估医疗数据去识别化（脱敏）措施是否到位。AI系统所使用的医疗数据应经过去标识化处理，以确保即使数据被泄露，也不会轻易识别出个人身份。律师对数据来源的审查是重要的工作，也需要律师对AI医疗系统有着充分的了解，因为行业的专业性，律师团队应独立聘请第三方医疗行业专家对医疗数据的具体内容进行辅助判断，律师应要求AI医疗系统的开发者或投资方提供数据来源的详细记录，并核实这些数据是否来源于可信的医疗机构。需要评估数据的记录方式、存储方法以及版本管理情况。

一、如何在海量数据中确定抽样数据

在审查用于AI医疗的海量训练数据时，律师不可能逐一核查每一条数据，因此合理的抽样策略是必要的。在大数据系统中，数据通常被分割成多个数据块，分布在不同的存储节点上。每个数据块包含一部分完整的数据记录。为了进行有效的抽样，律师需要理解这些数据块的组织和分布。

律师应当根据AI医疗机构提供的相关数据授权文件（来自医院或中介机构），查找对应的数据块（源数据），再从其中随机抽取若干个数据块进行审查。每个数据块内部再随机选择若干条数据进行详细核查。这种方法有助于在不偏向任何特定区域的情况下获取整个源数据集的代表性样本。

如果训练数据有不同的层次或类别（如不同医院、不同地区或不同的病种），可以采用分层抽样方法。在每个层次中随机抽取数据块进行审查，这样可以确保各层数据的代表性。

对于有序存储的数据（如按诊断时间、患者ID排序），可以采用系统抽样法，即按照一定的间隔抽取数据块（例如每1000个数据块抽取1个），确保覆盖整个源数据集的分布。

确定抽样的基本单位（如数据块、时间窗口、患者群体等）。根据数据的总体量和期望的抽样精度，确定需要审查的样本量。一般来说，样本量越大，审查结果越具代表性，但也要平衡律师的工作量以及投资人确定的审查期限。

根据确定的抽样方法和样本量，提取相应的样本数据块。可以使用脚本或大数据工具来自动化抽样过程，确保随机性和公平性。以下为Hadoop工具使用示例：

Hadoop的MapReduce框架是处理大规模数据的核心工具，通过编写Map和Reduce函数来实现数据抽样。在Map函数中，可以对每条记录进行处理。比如，生成一个随机数来决定该记录是否被选中作为样本。对于被选中的记录，将其标记为样本并输出到中间结果。Reduce函数接收来自Map阶段的样本数据块，并将它们聚合在一起。如果样本量过大，可能需要进一步随机选取一部分样本输出。最终的样本数据可以输出到HDFS的特定目录，供后续的律师审查使用。以下为Hadoop代码示例：

如果样本量不足或过大，可以调整sampleRate参数重新运行作业。必要时，可以采用分层抽样或系统抽样方法，以获得更加精准的样本。而后使用Hadoop的Job类来配置和运行抽样作业。可以指定输入路径（数据集所在的HDFS目录）和输出路径（样本数据输出的HDFS目录），并设置MapReduce作业的其他参数，如Mapper和Reducer的数量。将作业提交给Hadoop集群运行。Hadoop将自动分配资源，执行MapReduce作业，最终生成样本数据。以下为示例：

抽样完成后，律师团队可使用常用的工具如Excel或数据库管理系统进行进一步的审查。

二、对多个数据来源的源数据进行审查

律师在核对AI医疗所依赖的基础数据时，根据投资项目的特点，可能需要核对来自两个源数据来源（例如A医院和B医院）的数据授权协议，两个授权协议分别对应数据库C和数据库D, C 和D 是先后两个时间节点取得的，在AI医疗机构的大数据库中，两家医院的大量数据已经合并入一个大型数据库中，以Hadoop为例，律师需要以源数据授权协议的时间线查看未合并前C和D的存储时间和存储合并操作日志，以此确定授权协议和数据的对应关系，也是确认源数据来源和数据真实性的必要步骤。

要查看未合并前C和D的存储时间和存储合并操作的日志，Hadoop系统中可以通过以下步骤进行操作。这些步骤主要涉及HDFS（Hadoop分布式文件系统）和YARN（Hadoop的资源管理框架）来跟踪文件的存储时间和合并操作的日志。

首先，通过HDFS命令行工具可以列出文件在HDFS中的详细信息，包括文件的创建时间和最后修改时间。运行以下命令查看C和D两个数据库对应文件的存储时间：

hdfs dfs -ls /path/to/database/C/

hdfs dfs -ls /path/to/database/D/

该命令会列出指定路径下所有文件的详细信息，包括权限、大小、所有者、组、修改时间和文件名。你可以通过查看这些信息，确定C和D数据库中各文件的存储时间。

如果需要更详细的信息，可以使用Hadoop提供的hdfs fsck工具，它可以显示文件的状态以及块信息。

hdfs fsck /path/to/database/C/ -files -blocks -locations

hdfs fsck /path/to/database/D/ -files -blocks -locations

这将显示C和D数据库中每个文件的块详细信息，包括创建时间、修改时间、存储位置等。

如果C和D的合并是通过MapReduce作业实现的，可以通过YARN查看具体的作业日志。YARN负责管理Hadoop集群中的资源和作业执行，可以帮助找到合并操作的相关日志。首先，在YARN的ResourceManager Web界面（通常默认地址为http://:8088/cluster）中查找相关的MapReduce作业。你可以通过时间范围、作业名称或用户来过滤相关的作业。点击作业ID，进入详细信息页面，查看该作业的各个阶段日志，包括Map阶段、Reduce阶段和合并操作的详细日志。

如果存储合并操作直接在HDFS层面完成（例如通过hdfs dfs -mv或hdfs dfs -cp命令），律师可以查看HDFS的NameNode日志，这些日志记录了所有文件系统级别的操作。

在日志中搜索相关时间范围内的操作日志，以查找C和D数据库的合并操作记录。通过这些日志可以追踪到具体的数据操作，包括合并、插入、复制、删除等。

在核对存储时间和合并操作日志后，律师团队应将C和D数据库的存储时间记录下来，并核实这些时间与相应的数据授权协议的时间是否一致。通过YARN和HDFS日志确认数据合并操作的时间和操作过程，确保合并操作符合协议约定，并且数据在合并前后的完整性和一致性得到了保证。

通过上述方法，律师可以有效地查找和验证C和D数据库的数据存储时间和合并操作的相关日志。这些信息对于核对数据授权协议的合规性和审查数据处理过程中的潜在法律风险非常重要。这将帮助律师确保AI医疗系统的数据处理过程合法、透明，并符合数据授权协议中的相关规定。

三、律师应如何平衡虚假数据和必要的合成数据

AI医疗系统必须能够访问足够量的全面源数据用于训练和识别，这既可以提高其系统性能，也可以避免形成有缺陷的诊断结论并对医生造成误导，但是因为客观条件的限制，AI医疗机构可能无法一次性拥有大量专有源数据访问权限，因此程序员可能会通过算法人工合并大量数据并用于AI诊断系统的训练（数据增强），合成数据对程序员来讲是必要的，但是从律师的角度则违反了真实性的原则，律师应当依照以下原则进行审查，以在AI医疗项目中合理平衡数据真实性与技术需求的矛盾，确保项目在法律合规和技术有效性之间取得最佳平衡。

对于确定来源不明、未获授权或无法确认真实性的源数据，律师审查工作中会认为AI机构对投资人包含“欺诈”或故意“隐瞒”，在这个过程中，应注意区分不完整数据（真实诊疗过程中部分因为不能追溯原因造成的数据不完整）对律师审查的影响，但是对于合成数据，律师首先需要与程序员合作，确认合成数据在AI医疗系统中的具体应用场景，以及作为补充和扩展真实数据集，是否是在真实数据不足的情况下的必要性。

从律师审查的角度，AI医疗系统所依赖的数据应当以真实的源数据为主，因为这些数据直接影响AI模型的输出质量和准确性。律师应当要求项目在可能的情况下优先使用真实数据，并确保这些数据的合法性和可追溯性。合成数据应在数据集内以特殊字段以清晰标识为“合成”或“人工生成”，并应记录生成过程，以确保数据审计时能够区分真实源数据和合成数据。AI机构有义务向投资人或尽调律师报告使用了多少比例的合成数据，并解释其生成和使用的原因。

实践中，部分AI机构的数据来源可能包含了来自第三方的合并数据，律师应审查AI医疗机构、数据供应商以及技术开发方签订合同时，合同中是否明确规定关于数据真实性的要求，以及合成数据的使用范围和限制、合成数据导致AI系统出现错误诊断等法律责任。

合成数据在AI医疗整体数据集中所占的比例应受到严格控制，以确保模型仍然能够在大部分情况下基于真实数据而进行训练。对于使用了合成数据的AI医疗项目，律师应协助项目方准备详尽的合规报告，向监管机构如实报告数据使用情况，并解释合成数据的使用背景和必要性。合成数据的生成必须经过严格的验证，以确保其不会引入偏差或错误。

四、AI医疗源数据来源核实的其他方法

通常情况下AI医疗机构拿到的源数据来源比较复杂，有的是从中介机构拿到的，有的从医疗机构例如医院直接获取，而且在大数据层层传递的过程中，部分涉及个人信息的敏感数据都被屏蔽或者加密，这使得律师无法回溯这些数据并最终从医疗机构的医生处核实某个抽样医疗数据的真实性，数据隐私和数据来源的复杂性确实给律师带来了挑战。律师可以通过以下几种方式和线索来查证数据的真实性：

首先，要求AI医疗机构提供详细的数据供应链记录，包括数据的来源、传输过程以及中介机构的角色。透明的数据供应链有助于追溯数据的原始来源，尽管无法直接访问医院的数据，但可以检查数据传输和处理的合法性。

其次，对比AI医疗系统中使用的数据与其他独立的、公开的数据来源，如公共健康报告、医院公开的接诊人数、科室收入、特定疾病的发病率、治疗数据等，以检查源数据的一致性和合理性。

抽取样本数据进行质量检查，与真实的医疗案例进行对比，确认数据是否符合医学常识和实际情况。可以通过律师事务所雇佣的第三方独立医学专家进行这项检查，以确保数据的医学合理性。

再次，律师需要对数据中介机构进行背景调查，确认其合法性、信誉和历史记录。了解这些机构的操作模式和数据处理能力，有助于评估数据的真实性。

最后，根据获取的抽样信息，律师可以使用统计分析工具检测数据中的异常情况，如数据分布不正常、过于一致或不符合实际医疗情况的情况，这些异常可能表明数据存在问题。

五、律师尽调访谈的其他事项

源数据加密、访问控制和审计控制。
源数据标签的来源以及可靠性（例如肿瘤区域的划定）。
不同来源的非结构化数据处理过程。
整体计算架构是否嵌入来自第三方的算法模块。
医疗源数据获取授权以及医疗源数据（资产）买卖的两个维度差异。
评估AI医疗机构的数据备份和恢复策略，确保在数据损坏或丢失的情况下能够迅速恢复。备份数据应存储在安全的异地位置，并进行定期测试。
机构的防火墙、入侵检测和防御系统（IDS/IPS）、DDoS防护、VPN使用等。律师可以要求AI医疗机构提供这些措施的实施细节和审计记录。
评估机构是否对数据访问和修改行为进行全面的日志记录，并且是否设置了实时监控和报警系统，能够及时发现并响应异常行为。
律师应确认AI医疗机构是否有完善的事故响应计划，包括数据泄露处理程序、通知流程、法律责任的界定等。这些计划应经过定期演练，以确保在紧急情况下能够有效执行。
数据泄露处理机制：审查数据泄露的处理机制，确保其符合相关法律法规的要求，如数据泄露通知义务。
评估AI医疗机构是否定期对员工进行安全培训，确保所有员工了解数据安全的重要性，并能识别和应对常见的安全威胁，如网络钓鱼、社交工程攻击等。

Official guidance documents from the National Medical Products Administration have already allowed AI to be used in assisted decision-making for medical behaviors such as lesion nature, medication, and treatment, as well as some procedural non-assisted decision-making, but emphasis is placed on reviewing the compliance of data sources, the reasonableness of data distribution, validity, and accuracy. The U.S. Food and Drug Administration (FDA) is also gradually improving the approval process for AI medical products, which brings more standardized guarantees to the industry and provides broad application space for legally compliant AI medical technology.

The AI medical market is expected to maintain rapid growth in the coming years. According to market research, the global AI medical market’s compound annual growth rate (CAGR) may reach over 40%, with the market size reaching hundreds of billions of dollars by 2030.

The application of AI in medical imaging analysis, such as X-rays, CT, and MRI, has significantly improved the accuracy and efficiency of diagnosis, even surpassing human doctors in some cases. At present, AI’s application in medical imaging analysis is the most mature and is also a popular investment field. Many startups and large enterprises are developing AI-driven diagnostic tools, with enormous market potential. However, the technology is still in a stage of rapid iteration and development, and may have problems of inaccuracy or deviation. Investors need to be vigilant about dual risks of technology and market.

BALANCE

Legal risk assessment for AI medical project investment should be conducted based on the characteristics of AI projects—risks and benefits need to be balanced, and data review also needs to be balanced according to the characteristics of AI projects.

According to reports, on a certain data platform, people publicly sold Type 2 diabetes patient population gene locus data, scalp EEG, intracranial EEG, brain function data, protein databases used for reverse virtual screening, etc. If unverified or illegally obtained public datasets are used for AI decision-making training, it may lead to serious deviation risks, which is also the focus of lawyers’ external review. Lawyers need to review the diagnostic data sources used in the big data training behind AI medical systems according to the characteristics of AI medical systems. In practice, because it involves the cross-disciplinary field of medicine, law, and computer science, lawyers should consider from multiple angles to ensure the legality, privacy protection, and authenticity of the training source data.

Lawyers should verify whether the data used by AI medical systems was legally obtained and whether it has received consent from patients or data subjects. Check relevant informed consent forms and data sharing agreements to ensure the data collection process complies with the Personal Information Protection Law and other relevant industry rules. At the same time, lawyers should also evaluate whether medical data de-identification (anonymization) measures are in place. Medical data used by AI systems should be de-identified to ensure that even if data is leaked, personal identities cannot be easily identified. Lawyers’ review of data sources is important work, and lawyers also need to have a thorough understanding of AI medical systems. Due to industry professionalism, lawyer teams should independently hire third-party medical industry experts to assist in judging the specific content of medical data. Lawyers should require AI medical system developers or investors to provide detailed records of data sources and verify whether these data come from trustworthy medical institutions. It is necessary to evaluate data recording methods, storage methods, and version management situations.

1. How to Determine Sampled Data from Massive Data

When reviewing massive training data for AI medical purposes, it is impossible for lawyers to verify each piece of data one by one, so a reasonable sampling strategy is necessary. In big data systems, data is usually divided into multiple data blocks distributed across different storage nodes. Each data block contains a portion of complete data records. To conduct effective sampling, lawyers need to understand the organization and distribution of these data blocks.

Lawyers should, based on relevant data authorization documents (from hospitals or intermediary institutions) provided by AI medical institutions, find corresponding data blocks (source data), and then randomly extract several data blocks from them for review. Within each data block, randomly select several pieces of data for detailed verification. This method helps obtain representative samples of the entire source dataset without biasing toward any specific region.

If training data has different layers or categories (such as different hospitals, regions, or disease types), a stratified sampling method can be used. Randomly extract data blocks from each layer for review to ensure the representativeness of data at each layer.

For ordered stored data (such as sorted by diagnosis time, patient ID), systematic sampling can be used—that is, extract data blocks at certain intervals (for example, 1 out of every 1000 data blocks), ensuring coverage of the entire source dataset distribution.

Determine the basic sampling unit (such as data blocks, time windows, patient groups, etc.). Based on the total data volume and desired sampling precision, determine the sample size that needs to be reviewed. Generally, the larger the sample size, the more representative the review results, but the workload of lawyers and the review period determined by investors also need to be balanced.

Based on the determined sampling method and sample size, extract corresponding sample data blocks. Scripts or big data tools can be used to automate the sampling process, ensuring randomness and fairness. Below is a Hadoop tool usage example:

f7f85bc53753c1a9615d36ed9b6ab44a

Hadoop’s MapReduce framework is the core tool for processing large-scale data, implementing data sampling by writing Map and Reduce functions. In the Map function, each record can be processed. For example, generate a random number to decide whether that record is selected as a sample. For selected records, mark them as samples and output to intermediate results. The Reduce function receives sample data blocks from the Map phase and aggregates them together. If the sample size is too large, it may be necessary to further randomly select a portion of samples for output. The final sample data can be output to a specific directory in HDFS for subsequent lawyer review. Below is a Hadoop code example:

If the sample size is insufficient or too large, the sampleRate parameter can be adjusted to rerun the job. If necessary, stratified sampling or systematic sampling methods can be used to obtain more precise samples. Then use Hadoop’s Job class to configure and run the sampling job. Input paths (HDFS directory where the dataset is located) and output paths (HDFS directory for sample data output) can be specified, along with other parameters of the MapReduce job such as the number of Mappers and Reducers. Submit the job to the Hadoop cluster for running. Hadoop will automatically allocate resources, execute the MapReduce job, and finally generate sample data. Below is an example:

After sampling is completed, the lawyer team can use common tools such as Excel or database management systems for further review.

2. Reviewing Source Data from Multiple Data Sources

When lawyers verify the underlying data relied upon by AI medical systems, according to the characteristics of the investment project, it may be necessary to verify data authorization agreements from two source data sources (for example, Hospital A and Hospital B). The two authorization agreements correspond to Database C and Database D respectively—C and D are obtained at two successive time points. In the large database of AI medical institutions, large amounts of data from both hospitals have been merged into one large database. Taking Hadoop as an example, lawyers need to view the storage time of C and D before merging and the storage merge operation logs according to the timeline of data authorization agreements to determine the correspondence between authorization agreements and data, which is also a necessary step to confirm source data provenance and data authenticity.

To view the storage time of C and D before merging and logs of storage merge operations, operations can be conducted in Hadoop systems through the following steps. These steps mainly involve HDFS (Hadoop Distributed File System) and YARN (Hadoop’s resource management framework) to track file storage time and logs of merge operations.

First, HDFS command-line tools can list detailed information about files in HDFS, including file creation time and last modification time. Run the following commands to view the storage time of files corresponding to Databases C and D:

hdfs dfs -ls /path/to/database/C/

hdfs dfs -ls /path/to/database/D/

This command will list detailed information for all files in the specified path, including permissions, size, owner, group, modification time, and filename. You can determine the storage time of files in Databases C and D by viewing this information.

If more detailed information is needed, the hdfs fsck tool provided by Hadoop can be used to display file status and block information.

hdfs fsck /path/to/database/C/ -files -blocks -locations

hdfs fsck /path/to/database/D/ -files -blocks -locations

This will display detailed block information for each file in Databases C and D, including creation time, modification time, storage locations, etc.

If the merger of C and D was implemented through MapReduce jobs, specific job logs can be viewed through YARN. YARN is responsible for managing resources and job execution in Hadoop clusters and can help find logs related to merge operations. First, find relevant MapReduce jobs in YARN’s ResourceManager Web interface (usually default address http://:8088/cluster). You can filter relevant jobs by time range, job name, or user. Click the job ID to enter the detailed information page and view logs of various stages of the job, including Map phase, Reduce phase, and detailed logs of merge operations.

If the storage merge operation was completed directly at the HDFS level (for example, through hdfs dfs -mv or hdfs dfs -cp commands), lawyers can view HDFS’s NameNode logs, which record all file system-level operations.

Search operation logs within the relevant time range to find merge operation records of Databases C and D. Through these logs, specific data operations including merges, insertions, copies, deletions, etc. can be traced.

After verifying storage time and merge operation logs, the lawyer team should record the storage time of Databases C and D and verify whether these times are consistent with the corresponding data authorization agreements. Confirm the time and operation process of data merge operations through YARN and HDFS logs, ensuring that merge operations comply with agreement stipulations and that data integrity and consistency before and after merging are guaranteed.

Through the above methods, lawyers can effectively find and verify storage time and merge operation logs related to Databases C and D. This information is very important for verifying the compliance of data authorization agreements and reviewing potential legal risks in data processing. This will help lawyers ensure that the data processing process of AI medical systems is legal, transparent, and complies with relevant regulations in data authorization agreements.

3. How Should Lawyers Balance False Data and Necessary Synthetic Data

AI medical systems must be able to access sufficient amounts of comprehensive source data for training and recognition—this can both improve their system performance and avoid forming flawed diagnostic conclusions that mislead doctors. However, due to objective conditions, AI medical institutions may not be able to have access to large amounts of proprietary source data at once, so programmers may artificially merge large amounts of data through algorithms and use them for training AI diagnostic systems (data augmentation). Synthetic data is necessary for programmers, but from a lawyer’s perspective, it violates the principle of authenticity. Lawyers should conduct reviews based on the following principles to reasonably balance data authenticity and technical needs in AI medical projects, ensuring that projects achieve the best balance between legal compliance and technical effectiveness.

For source data with unclear provenance, unauthorized, or whose authenticity cannot be confirmed, in lawyers’ review work, it will be considered that AI institutions have “fraud” or intentional “concealment” toward investors. In this process, attention should be paid to distinguishing the impact of incomplete data (incomplete data caused by untraceable reasons in real diagnosis and treatment processes) on lawyers’ review. But for synthetic data, lawyers first need to cooperate with programmers to confirm the specific application scenarios of synthetic data in AI medical systems and whether it is a necessity when real data is insufficient as supplementation and expansion of real datasets.

From a lawyer’s review perspective, data relied upon by AI medical systems should primarily be real source data, because these data directly affect the output quality and accuracy of AI models. Lawyers should require projects to prioritize using real data when possible and ensure the legality and traceability of these data. Synthetic data should be clearly marked as “synthetic” or “artificially generated” in datasets with special fields, and the generation process should be recorded to ensure that real source data and synthetic data can be distinguished during data audits. AI institutions are obligated to report to investors or due diligence lawyers what proportion of synthetic data was used and explain the reasons for its generation and use.

In practice, some AI institutions’ data sources may include merged data from third parties. Lawyers should review whether contracts signed between AI medical institutions, data suppliers, and technology developers clearly stipulate requirements for data authenticity, as well as the scope and limitations of synthetic data use and legal responsibilities for erroneous diagnoses caused by synthetic data.

The proportion of synthetic data in the overall AI medical dataset should be strictly controlled to ensure that the model can still be trained based on real data in most situations. For AI medical projects using synthetic data, lawyers should assist project parties in preparing detailed compliance reports, truthfully reporting data usage situations to regulatory authorities, and explaining the background and necessity of synthetic data use. The generation of synthetic data must undergo strict verification to ensure it does not introduce bias or errors.

4. Other Methods for Verifying AI Medical Source Data Provenance

Under normal circumstances, the source data obtained by AI medical institutions is quite complex—some is obtained from intermediary institutions, some is directly obtained from medical institutions such as hospitals. Moreover, in the process of multi-layer transmission of big data, some sensitive data involving personal information is masked or encrypted, making it impossible for lawyers to trace back this data and ultimately verify the authenticity of certain sampled medical data from doctors at medical institutions. Data privacy and data source complexity indeed pose challenges for lawyers. Lawyers can verify data authenticity through the following methods and clues:

First, require AI medical institutions to provide detailed data supply chain records, including data sources, transmission processes, and the role of intermediary institutions. Transparent data supply chains help trace the original source of data. Although direct access to hospital data is not possible, the legality of data transmission and processing can be checked.

Second, compare data used in AI medical systems with other independent, public data sources, such as public health reports, publicly reported patient numbers by hospitals, department revenues, incidence rates of specific diseases, treatment data, etc., to check the consistency and reasonableness of source data.

Extract sample data for quality inspection and compare with real medical cases to confirm whether data conforms to medical common sense and actual situations. This inspection can be conducted by third-party independent medical experts hired by the law firm to ensure the medical reasonableness of data.

Third, lawyers need to conduct background investigations on data intermediary institutions to confirm their legality, reputation, and historical records. Understanding the operating modes and data processing capabilities of these institutions helps assess the authenticity of data.

Finally, based on obtained sampling information, lawyers can use statistical analysis tools to detect abnormal situations in data, such as abnormal data distribution, overly consistent data, or situations not conforming to actual medical situations—these abnormalities may indicate problems with data.

5. Other Matters for Lawyer Due Diligence Interviews

Source data encryption, access control, and audit control.
Source and reliability of source data labels (for example, delineation of tumor regions).
Unstructured data processing processes from different sources.
Whether the overall computing architecture embeds algorithm modules from third parties.
Two-dimensional differences between medical source data acquisition authorization and medical source data (asset) buying/selling.
Evaluate the data backup and recovery strategies of AI medical institutions to ensure rapid recovery in case of data damage or loss. Backup data should be stored in secure off-site locations and regularly tested.
Institution firewalls, intrusion detection and prevention systems (IDS/IPS), DDoS protection, VPN usage, etc. Lawyers can require AI medical institutions to provide implementation details and audit records of these measures.
Evaluate whether institutions conduct comprehensive logging of data access and modification behaviors, and whether real-time monitoring and alarm systems are set up to timely discover and respond to abnormal behaviors.
Lawyers should confirm whether AI medical institutions have complete incident response plans, including data breach handling procedures, notification processes, and definition of legal responsibilities. These plans should undergo regular drills to ensure effective execution in emergency situations.
Data breach handling mechanism: Review the data breach handling mechanism to ensure it complies with requirements of relevant laws and regulations, such as data breach notification obligations.
Evaluate whether AI medical institutions regularly conduct security training for employees to ensure all employees understand the importance of data security and can identify and respond to common security threats such as phishing and social engineering attacks.