Lawyer Practice in Due Diligence for AI Medical Projects
Lawyer Practice in Due Diligence for AI Medical Projects
Official guidance documents from the National Medical Products Administration have already allowed AI to be used in assisted decision-making for medical behaviors such as lesion nature, medication, and treatment, as well as some procedural non-assisted decision-making, but emphasis is placed on reviewing the compliance of data sources, the reasonableness of data distribution, validity, and accuracy. The U.S. Food and Drug Administration (FDA) is also gradually improving the approval process for AI medical products, which brings more standardized guarantees to the industry and provides broad application space for legally compliant AI medical technology.
Official guidance documents from the National Medical Products Administration have already allowed AI to be used in assisted decision-making for medical behaviors such as lesion nature, medication, and treatment, as well as some procedural non-assisted decision-making, but emphasis is placed on reviewing the compliance of data sources, the reasonableness of data distribution, validity, and accuracy. The U.S. Food and Drug Administration (FDA) is also gradually improving the approval process for AI medical products, which brings more standardized guarantees to the industry and provides broad application space for legally compliant AI medical technology.
The AI medical market is expected to maintain rapid growth in the coming years. According to market research, the global AI medical market’s compound annual growth rate (CAGR) may reach over 40%, with the market size reaching hundreds of billions of dollars by 2030.
The application of AI in medical imaging analysis, such as X-rays, CT, and MRI, has significantly improved the accuracy and efficiency of diagnosis, even surpassing human doctors in some cases. At present, AI’s application in medical imaging analysis is the most mature and is also a popular investment field. Many startups and large enterprises are developing AI-driven diagnostic tools, with enormous market potential. However, the technology is still in a stage of rapid iteration and development, and may have problems of inaccuracy or deviation. Investors need to be vigilant about dual risks of technology and market.
BALANCE
Legal risk assessment for AI medical project investment should be conducted based on the characteristics of AI projects—risks and benefits need to be balanced, and data review also needs to be balanced according to the characteristics of AI projects.
According to reports, on a certain data platform, people publicly sold Type 2 diabetes patient population gene locus data, scalp EEG, intracranial EEG, brain function data, protein databases used for reverse virtual screening, etc. If unverified or illegally obtained public datasets are used for AI decision-making training, it may lead to serious deviation risks, which is also the focus of lawyers’ external review. Lawyers need to review the diagnostic data sources used in the big data training behind AI medical systems according to the characteristics of AI medical systems. In practice, because it involves the cross-disciplinary field of medicine, law, and computer science, lawyers should consider from multiple angles to ensure the legality, privacy protection, and authenticity of the training source data.
Lawyers should verify whether the data used by AI medical systems was legally obtained and whether it has received consent from patients or data subjects. Check relevant informed consent forms and data sharing agreements to ensure the data collection process complies with the Personal Information Protection Law and other relevant industry rules. At the same time, lawyers should also evaluate whether medical data de-identification (anonymization) measures are in place. Medical data used by AI systems should be de-identified to ensure that even if data is leaked, personal identities cannot be easily identified. Lawyers’ review of data sources is important work, and lawyers also need to have a thorough understanding of AI medical systems. Due to industry professionalism, lawyer teams should independently hire third-party medical industry experts to assist in judging the specific content of medical data. Lawyers should require AI medical system developers or investors to provide detailed records of data sources and verify whether these data come from trustworthy medical institutions. It is necessary to evaluate data recording methods, storage methods, and version management situations.
1. How to Determine Sampled Data from Massive Data
When reviewing massive training data for AI medical purposes, it is impossible for lawyers to verify each piece of data one by one, so a reasonable sampling strategy is necessary. In big data systems, data is usually divided into multiple data blocks distributed across different storage nodes. Each data block contains a portion of complete data records. To conduct effective sampling, lawyers need to understand the organization and distribution of these data blocks.
Lawyers should, based on relevant data authorization documents (from hospitals or intermediary institutions) provided by AI medical institutions, find corresponding data blocks (source data), and then randomly extract several data blocks from them for review. Within each data block, randomly select several pieces of data for detailed verification. This method helps obtain representative samples of the entire source dataset without biasing toward any specific region.
If training data has different layers or categories (such as different hospitals, regions, or disease types), a stratified sampling method can be used. Randomly extract data blocks from each layer for review to ensure the representativeness of data at each layer.
For ordered stored data (such as sorted by diagnosis time, patient ID), systematic sampling can be used—that is, extract data blocks at certain intervals (for example, 1 out of every 1000 data blocks), ensuring coverage of the entire source dataset distribution.
Determine the basic sampling unit (such as data blocks, time windows, patient groups, etc.). Based on the total data volume and desired sampling precision, determine the sample size that needs to be reviewed. Generally, the larger the sample size, the more representative the review results, but the workload of lawyers and the review period determined by investors also need to be balanced.
Based on the determined sampling method and sample size, extract corresponding sample data blocks. Scripts or big data tools can be used to automate the sampling process, ensuring randomness and fairness. Below is a Hadoop tool usage example:

Hadoop’s MapReduce framework is the core tool for processing large-scale data, implementing data sampling by writing Map and Reduce functions. In the Map function, each record can be processed. For example, generate a random number to decide whether that record is selected as a sample. For selected records, mark them as samples and output to intermediate results. The Reduce function receives sample data blocks from the Map phase and aggregates them together. If the sample size is too large, it may be necessary to further randomly select a portion of samples for output. The final sample data can be output to a specific directory in HDFS for subsequent lawyer review. Below is a Hadoop code example:
If the sample size is insufficient or too large, the sampleRate parameter can be adjusted to rerun the job. If necessary, stratified sampling or systematic sampling methods can be used to obtain more precise samples. Then use Hadoop’s Job class to configure and run the sampling job. Input paths (HDFS directory where the dataset is located) and output paths (HDFS directory for sample data output) can be specified, along with other parameters of the MapReduce job such as the number of Mappers and Reducers. Submit the job to the Hadoop cluster for running. Hadoop will automatically allocate resources, execute the MapReduce job, and finally generate sample data. Below is an example:
After sampling is completed, the lawyer team can use common tools such as Excel or database management systems for further review.
2. Reviewing Source Data from Multiple Data Sources
When lawyers verify the underlying data relied upon by AI medical systems, according to the characteristics of the investment project, it may be necessary to verify data authorization agreements from two source data sources (for example, Hospital A and Hospital B). The two authorization agreements correspond to Database C and Database D respectively—C and D are obtained at two successive time points. In the large database of AI medical institutions, large amounts of data from both hospitals have been merged into one large database. Taking Hadoop as an example, lawyers need to view the storage time of C and D before merging and the storage merge operation logs according to the timeline of data authorization agreements to determine the correspondence between authorization agreements and data, which is also a necessary step to confirm source data provenance and data authenticity.
To view the storage time of C and D before merging and logs of storage merge operations, operations can be conducted in Hadoop systems through the following steps. These steps mainly involve HDFS (Hadoop Distributed File System) and YARN (Hadoop’s resource management framework) to track file storage time and logs of merge operations.
First, HDFS command-line tools can list detailed information about files in HDFS, including file creation time and last modification time. Run the following commands to view the storage time of files corresponding to Databases C and D:
hdfs dfs -ls /path/to/database/C/
hdfs dfs -ls /path/to/database/D/
This command will list detailed information for all files in the specified path, including permissions, size, owner, group, modification time, and filename. You can determine the storage time of files in Databases C and D by viewing this information.
If more detailed information is needed, the hdfs fsck tool provided by Hadoop can be used to display file status and block information.
hdfs fsck /path/to/database/C/ -files -blocks -locations
hdfs fsck /path/to/database/D/ -files -blocks -locations
This will display detailed block information for each file in Databases C and D, including creation time, modification time, storage locations, etc.
If the merger of C and D was implemented through MapReduce jobs, specific job logs can be viewed through YARN. YARN is responsible for managing resources and job execution in Hadoop clusters and can help find logs related to merge operations. First, find relevant MapReduce jobs in YARN’s ResourceManager Web interface (usually default address http://:8088/cluster). You can filter relevant jobs by time range, job name, or user. Click the job ID to enter the detailed information page and view logs of various stages of the job, including Map phase, Reduce phase, and detailed logs of merge operations.
If the storage merge operation was completed directly at the HDFS level (for example, through hdfs dfs -mv or hdfs dfs -cp commands), lawyers can view HDFS’s NameNode logs, which record all file system-level operations.
Search operation logs within the relevant time range to find merge operation records of Databases C and D. Through these logs, specific data operations including merges, insertions, copies, deletions, etc. can be traced.
After verifying storage time and merge operation logs, the lawyer team should record the storage time of Databases C and D and verify whether these times are consistent with the corresponding data authorization agreements. Confirm the time and operation process of data merge operations through YARN and HDFS logs, ensuring that merge operations comply with agreement stipulations and that data integrity and consistency before and after merging are guaranteed.
Through the above methods, lawyers can effectively find and verify storage time and merge operation logs related to Databases C and D. This information is very important for verifying the compliance of data authorization agreements and reviewing potential legal risks in data processing. This will help lawyers ensure that the data processing process of AI medical systems is legal, transparent, and complies with relevant regulations in data authorization agreements.
3. How Should Lawyers Balance False Data and Necessary Synthetic Data
AI medical systems must be able to access sufficient amounts of comprehensive source data for training and recognition—this can both improve their system performance and avoid forming flawed diagnostic conclusions that mislead doctors. However, due to objective conditions, AI medical institutions may not be able to have access to large amounts of proprietary source data at once, so programmers may artificially merge large amounts of data through algorithms and use them for training AI diagnostic systems (data augmentation). Synthetic data is necessary for programmers, but from a lawyer’s perspective, it violates the principle of authenticity. Lawyers should conduct reviews based on the following principles to reasonably balance data authenticity and technical needs in AI medical projects, ensuring that projects achieve the best balance between legal compliance and technical effectiveness.
For source data with unclear provenance, unauthorized, or whose authenticity cannot be confirmed, in lawyers’ review work, it will be considered that AI institutions have “fraud” or intentional “concealment” toward investors. In this process, attention should be paid to distinguishing the impact of incomplete data (incomplete data caused by untraceable reasons in real diagnosis and treatment processes) on lawyers’ review. But for synthetic data, lawyers first need to cooperate with programmers to confirm the specific application scenarios of synthetic data in AI medical systems and whether it is a necessity when real data is insufficient as supplementation and expansion of real datasets.
From a lawyer’s review perspective, data relied upon by AI medical systems should primarily be real source data, because these data directly affect the output quality and accuracy of AI models. Lawyers should require projects to prioritize using real data when possible and ensure the legality and traceability of these data. Synthetic data should be clearly marked as “synthetic” or “artificially generated” in datasets with special fields, and the generation process should be recorded to ensure that real source data and synthetic data can be distinguished during data audits. AI institutions are obligated to report to investors or due diligence lawyers what proportion of synthetic data was used and explain the reasons for its generation and use.
In practice, some AI institutions’ data sources may include merged data from third parties. Lawyers should review whether contracts signed between AI medical institutions, data suppliers, and technology developers clearly stipulate requirements for data authenticity, as well as the scope and limitations of synthetic data use and legal responsibilities for erroneous diagnoses caused by synthetic data.
The proportion of synthetic data in the overall AI medical dataset should be strictly controlled to ensure that the model can still be trained based on real data in most situations. For AI medical projects using synthetic data, lawyers should assist project parties in preparing detailed compliance reports, truthfully reporting data usage situations to regulatory authorities, and explaining the background and necessity of synthetic data use. The generation of synthetic data must undergo strict verification to ensure it does not introduce bias or errors.
4. Other Methods for Verifying AI Medical Source Data Provenance
Under normal circumstances, the source data obtained by AI medical institutions is quite complex—some is obtained from intermediary institutions, some is directly obtained from medical institutions such as hospitals. Moreover, in the process of multi-layer transmission of big data, some sensitive data involving personal information is masked or encrypted, making it impossible for lawyers to trace back this data and ultimately verify the authenticity of certain sampled medical data from doctors at medical institutions. Data privacy and data source complexity indeed pose challenges for lawyers. Lawyers can verify data authenticity through the following methods and clues:
First, require AI medical institutions to provide detailed data supply chain records, including data sources, transmission processes, and the role of intermediary institutions. Transparent data supply chains help trace the original source of data. Although direct access to hospital data is not possible, the legality of data transmission and processing can be checked.
Second, compare data used in AI medical systems with other independent, public data sources, such as public health reports, publicly reported patient numbers by hospitals, department revenues, incidence rates of specific diseases, treatment data, etc., to check the consistency and reasonableness of source data.
Extract sample data for quality inspection and compare with real medical cases to confirm whether data conforms to medical common sense and actual situations. This inspection can be conducted by third-party independent medical experts hired by the law firm to ensure the medical reasonableness of data.
Third, lawyers need to conduct background investigations on data intermediary institutions to confirm their legality, reputation, and historical records. Understanding the operating modes and data processing capabilities of these institutions helps assess the authenticity of data.
Finally, based on obtained sampling information, lawyers can use statistical analysis tools to detect abnormal situations in data, such as abnormal data distribution, overly consistent data, or situations not conforming to actual medical situations—these abnormalities may indicate problems with data.
5. Other Matters for Lawyer Due Diligence Interviews
-
Source data encryption, access control, and audit control.
-
Source and reliability of source data labels (for example, delineation of tumor regions).
-
Unstructured data processing processes from different sources.
-
Whether the overall computing architecture embeds algorithm modules from third parties.
-
Two-dimensional differences between medical source data acquisition authorization and medical source data (asset) buying/selling.
-
Evaluate the data backup and recovery strategies of AI medical institutions to ensure rapid recovery in case of data damage or loss. Backup data should be stored in secure off-site locations and regularly tested.
-
Institution firewalls, intrusion detection and prevention systems (IDS/IPS), DDoS protection, VPN usage, etc. Lawyers can require AI medical institutions to provide implementation details and audit records of these measures.
-
Evaluate whether institutions conduct comprehensive logging of data access and modification behaviors, and whether real-time monitoring and alarm systems are set up to timely discover and respond to abnormal behaviors.
-
Lawyers should confirm whether AI medical institutions have complete incident response plans, including data breach handling procedures, notification processes, and definition of legal responsibilities. These plans should undergo regular drills to ensure effective execution in emergency situations.
-
Data breach handling mechanism: Review the data breach handling mechanism to ensure it complies with requirements of relevant laws and regulations, such as data breach notification obligations.
-
Evaluate whether AI medical institutions regularly conduct security training for employees to ensure all employees understand the importance of data security and can identify and respond to common security threats such as phishing and social engineering attacks.