Research - (2024) Volume 12, Issue 1
Received: May 23, 2024, Manuscript No. IJCSMA-24-136856; Editor assigned: May 26, 2024, Pre QC No. IJCSMA-24-136856 (PQ); Reviewed: May 29, 2024, QC No. IJCSMA-24-136856 (Q); Revised: Jun 01, 2024, Manuscript No. IJCSMA-24-136856; Published: Jun 14, 2024
Artificial Intelligence (AI) has emerged as a powerful tool in drug discovery and development, revolutionizing how new medicines are identified, optimized, and brought to market. Several drug discovery processes, including peptide synthesis, structure-based virtual screening, ligand-based virtual screening, toxicity prediction, drug monitoring and release, pharmacophore modeling, quantitative structure-activity relationship, drug repositioning, polypharmacology, and physiochemical activity, have leveraged AI to achieve their goals. This review provides an in-depth analysis of AI's various applications in the pharmaceutical industry, including virtual screening, molecular modeling, target identification, and drug repurposing. We discuss the challenges and opportunities associated with AI in drug discovery and its impact on the future of medicine.
Drug discovery; Artificial intelligence; Machine learning; Deep learning; Virtual screening; Molecular modelling; Target identification; Drug repurposing
Drug development consists of four key phases: drug discovery, pre-clinical research, clinical development and market approval (Phase I focuses on evaluating pharmacokinetics, safety, and tolerability in healthy volunteers. Phase II involves testing efficacy and dose response in a small cohort group of patients with the target disease. Phase III comprises large-scale studies to verify safety and effectiveness.), and post-approval surveillance (Phase IV). The initial drug discovery stage involves identifying and creating new chemical compounds that target specific protein structures related to diseases and medical conditions. Drug design is vital in drug discovery, refining potential drug compounds through lead optimization [1].
Introducing AI techniques in drug discovery and development processes shows potential for accelerating timelines and reducing costs [2]. The substantial financial and time investments required in bringing a new drug to market, with an estimated average price of 2.6 billion USD and a timeline exceeding ten years, illustrate the significance of streamlining drug development efforts. Additionally, the low success rate of new therapeutic agents reaching the market from Phase I clinical trials, estimated at less than 10%, highlights the challenges faced in this field [3-7]. The initial phase of the drug discovery process centers on identifying pertinent targets, such as specific genes and proteins linked to disease pathways, followed by the quest for suitable pharmaceutical compounds or drug analogs capable of interacting with these targets [8, 9]. The concept of big data pertains to vast datasets beyond the scope of conventional data analysis tools and techniques, owing to their substantial size, rapid data generation rates, and diverse data modalities. The abundance of large-scale biomedical data repositories is a valuable resource in facilitating these endeavors. Concurrently, the evolution of AI technologies has streamlined data analysis practices, enabling the application of various Machine Learning (ML) methodologies to explore, interpret, and discern essential insights, patterns, and relationships within extensive biomedical datasets [10]. Researchers can use ML, Deep Learning (DL), and other AI tools to pinpoint promising drug candidates, forecast their potential efficacy and safety profiles, and refine their molecular structures to amplify therapeutic potency [11].
ML algorithms push forward drug discovery, providing significant benefits to pharmaceutical companies. These algorithms have been instrumental in creating predictive models for evaluating chemical, biological, and physical properties of compounds in drug development [12-14]. ML algorithms are versatile tools that can be integrated into all stages of the drug discovery process. They have been utilized to uncover new applications for existing drugs, forecast interactions between drugs and proteins, assess drug efficacy, determine safety markers, and enhance the bioactivity of molecules [15-18]. Commonly used ML algorithms in drug discovery include Random Forest (RF), Naive Bayesian (NB), and Support Vector Machine (SVM) [19-21].
This review article provides a comprehensive overview of the critical applications of AI in drug discovery and development. We discuss the use of AI in virtual screening, which involves rapidly screening large compound libraries to identify potential drug candidates [5, 22-24]. We also explore how AI is being utilized in molecular modeling to predict the binding affinity of drug candidates to their target proteins, as well as in target identification to identify novel therapeutic targets for specific diseases. Additionally, we examine the growing trend of drug repurposing, where AI is used to discover new indications for existing drugs.
AI is a branch of computer science that focuses on developing machine systems capable of performing tasks that typically require human intelligence. AI systems are designed to learn from data, identify patterns, make decisions, and solve complex problems [25]. Various techniques and approaches are used in AI, such as ML, DL, Natural Language Processing (NLP), and computer vision [26, 27]. ML algorithms enable AI systems to improve their performance over time by learning from data without being explicitly programmed. DL, a subset of ML, uses neural networks to process large amounts of data and extract meaningful patterns [28]. It is important to note that ML algorithms are not uniform within AI. There are two primary categories of ML algorithms: supervised and unsupervised learning. Supervised learning involves training with labeled data to predict labels for new samples, while unsupervised learning identifies patterns in unlabeled data. Unsupervised learning often transforms the data into a lower-dimensional space to facilitate pattern recognition when working with high-dimensional data. This dimension reduction enhances efficiency and aids in the interpretation of patterns. Additionally, the fusion of supervised and unsupervised learning occurs in semi-supervised and reinforcement learning approaches, offering flexibility for diverse datasets.
The success of ML algorithms in drug discovery heavily relies on the availability of vast quantities of high-quality data and well-defined training sets. This requirement is particularly crucial in precision medicine, where a meticulous understanding of various pan-omic data types (such as genomic, transcriptomic, and proteomic) is essential for tailoring effective personalized therapies. Characteristics such as known data and thorough training sets are vital for the development, refinement, and efficacy of ML algorithms supporting drug discovery in the modern era [28, 29]. The principles related to drug discovery and computer-assisted drug design methods can be found in the Computer-Assisted Drug Design [29, 30].
Drug discovery projects typically commence when there is a lack of effective drugs for a specific disease or when existing treatments exhibit limited efficacy or substantial side effects [31]. The initial phase involves formulating a hypothesis that manipulating a particular target, such as an enzyme or receptor, will result in therapeutic benefits for the disease. This process includes identifying and validating the target. Subsequently, rigorous assays are conducted to identify potential compounds (hits) and develop them into potential drug candidates (leads) through hit discovery, hit-to-lead transformation, and lead optimization stages. These candidates then undergo preclinical testing and clinical trials. Successful candidates can eventually be approved and marketed as medical treatments for the targeted disease in Figure 1.
Figure 1: Drug Discovery Process Pipeline.
Since the 1980s, High-Throughput Screening (HTS) has been utilized to expedite the discovery of small-molecule drugs, enhancing efficiency by leveraging automation and large chemical libraries [32, 33]. HTS generates significant Structure-Activity Relationship (SAR) datasets, which enrich chemical databases like PubChem and ZINC [34, 35]. AI Virtual Screening (VS) involves computational methods to sift through these vast chemical libraries for potentially active compounds for further in vitro and in vivo testing, relying on knowledge of the target (structure-based VS) or known active molecules (ligand-based VS) [36]. This method aims to improve the speed of identification of active molecules based on the hypothesis that targeting a specific molecule can treat a disease, utilizing agonists and antagonists as major classes of drugs with different mechanisms of action. Agonists activate the target to evoke a biological response, while antagonists bind to the target to block this response [37]. Quantifying activity involves measuring affinity (or potency) and efficacy. Affinity reflects how strongly a molecule binds to a target, potency indicates the amount needed to elicit a specific effect, and efficacy describes the magnitude of the impact, such as inhibiting an enzyme by a particular percentage amount.
An ideal drug candidate needs to have sufficient activity as a ligand and exhibit binding specificity to specific targets to avoid unexpected side effects [38]. High selectivity is desired to prevent binding to multiple targets [39]. A combination of criteria, including physicochemical, pharmacokinetic, and pharmacodynamic properties, must be considered for drug candidates. Other properties like the Synthetic Accessibility Score (SAS) and the Quantitative Estimation of Drug-likeness (QED) are also crucial during compound synthesis. Based on the molecular fragments involved, the SAS rating system indicates the complexity of synthesizing a molecule, ranging from 1 (easy) to 10 (difficult). The QED is a measure, on a scale of 0 to 1, that predicts the likelihood that a molecule is a suitable drug Drug discovery involves a multi-objective optimization with predictive models like Quantitative Structure-Activity Relationship (QSAR) modeling used to map molecular structure to property values. QSAR models can also be used in reverse to identify structural features for optimal properties and guide drug design from scratch, known as de novo drug design [40].
Drug design goes beyond screening existing chemical libraries to explore the vast chemical space, comprising all potential small molecules [41, 42]. This space is estimated to be extensive and involves a continuous cycle of Design-Making-Testing-Analysis (DMTA) which includes iterative organic synthesis and property testing [43]. To effectively navigate this chemical space, quantitative drug design has been proposed since the late 1970s [44]. Drug design is centered on two main questions: whether molecular properties can be inferred from molecular structures and which structural characteristics are relevant for specific molecular properties. The former question forms the basis of VS, while QSAR models address the latter. Drug design can be seen as an extension of virtual screening, encompassing tasks such as predicting molecular properties and generating molecules, which are essential components of current AI-driven drug discovery processes [45].
Calculating the affinity between a ligand and a biological target is crucial in drug discovery. Various virtual screening techniques can quickly classify millions of compounds as 'active' or 'inactive' for a specific target. Understanding the binding affinity and interaction strength between a ligand and target protein is a vital aspect of drug discovery, as it aids in identifying potential drug candidates. Computational methods for predicting binding affinity offer significant time and cost savings compared to traditional laboratory experiments [46]. In the initial stages of small molecule drug discovery, rational drug design targets specific proteins, with potential molecules chosen as hits based on their binding affinity. While numerous computational methods have been developed for predicting protein-ligand binding affinity, they often rely on simplifying assumptions that lead to inaccuracies and high false-positive rates in hit identification [47-52]. Additionally, docking scores for hit prioritization during virtual screening may provide unreliable information. Therefore, accurately predicting binding affinity and identifying target hits through structure-based methods remain challenging. Using ML approaches involves training models with experimental protein-ligand data to predict the binding affinity of new protein-ligand complexes. However, challenges persist in accurately representing protein-ligand interactions, accounting for protein flexibility, selecting appropriate descriptors, and dealing with the wide range of ligand affinity values [53]. Among ML models, the RF method has shown improved affinity prediction compared to other techniques, with models like RF-Score and SFCscoreRF being developed.
DL models are gaining popularity for their high performance in various fields, including drug discovery and visual and speech recognition [54, 55]. Several DL methods have been proposed for predicting ligand affinity and designing new drugs. For example, DEELIG, a DL model by Ahmed et al. employs Convolutional Neural Networks (CNN) to identify spatial relationships within data, using a 3D grid of atoms to represent protein-ligand complexes. Li et al. developed Deep Atom based on a CNN that extracts atom interaction features from the voxelized structure of protein-ligand complexes [56, 57].
Additionally, Jiménez et al. introduced KDEEP, which uses a CNN-based model utilizing 3D voxel representations of proteins and ligands [58]. Limbu et al. introduced a novel Hybrid Neural Network (HNN) DL model that includes the 'HNN-Lenovo' and 'HNN-affinity' methods [46]. These methods leverage distinct learning frameworks to enhance de novo drug design prediction accuracy.
Modeling diseases and identifying targets are vital early stages in drug discovery, and they play a crucial role in determining the success of drug development. Target identification can be categorized into experimental, multiomic, and computational strategies. Combined, these methods can create innovative therapeutic ideas in preliminary target identification, leading to a deeper comprehension of intricate diseases [59].
Experimental Target Identification:
• Affinity-based biochemical methods: Small-molecule affinity probes stand out for their ability to label proteins without leaving a trace when interacting with their ligands [60].
• Comparative profiling: A widely used quantitative proteomics technique called Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC) utilizes stable isotope-labeled amino acids to distinguish between cellular proteomes accurately [61]. Various studies across different cancer types, such as hepatocellular carcinoma, multiple myeloma, endometrial cancer, and colorectal cancer, have shown SILAC's effectiveness in identifying critical factors in disease development [62-65].
• Chemical/genetic screening: Chemical/genetic screening methods, like RNA interference (RNAi) and CRISPR-Cas9 gene editing, have long fascinated biologists due to their specificity and efficiency. CRISPR technology has significantly advanced our understanding of molecular and pharmacological aspects of human diseases [66].
Multiomic Target Identification:
Multiomic data offers researchers a comprehensive view of molecular information from various sources, encompassing both stable genomic data and dynamic expression and metabolic profiles across space and time [67]. Genomics, the oldest and most established omics field, primarily focuses on genetic variations within the DNA sequence [68]. Through large-scale Genome-Wide Association Studies (GWAS) driven by advanced sequencing technologies, numerous links have been discovered between genetic variations and complex diseases or traits leading to groundbreaking therapeutic advancements like the development of cystic fibrosis modulator drugs targeting CFTR mutations and new treatments for inflammatory bowel disease that target the disease-related gene IL23A [69, 70]. Furthermore, recent meta-analyses of extensive GWAS data have unveiled previously unknown genetic loci associated with various diseases, creating opportunities for drug repurposing [71].
Computational Target Identification:
• Traditional experimental methods for identifying targets are time-consuming and resource-intensive, prompting the exploration of computational techniques as efficient alternatives. Various approaches, such as pharmacophore screening, reverse docking and structure similarity analysis predict potential biological targets for small molecules based on factors like protein structure and compound chemical structure [72- 74]. In pharmaceutical research, DL techniques, including generative adversarial networks and recurrent neural networks, have gained substantial traction due to their ability to process data and extract features through multiple layers of nodes [75]. These advanced algorithms have found applications in smallmolecule design aging studies and predicting drug responses from gene expression data [76-78]. By leveraging diverse data sources and text analysis, DL methods address pressing medical challenges, including severe and unmet medical conditions.
Moreover, advanced language models are crucial in expedited biomedical text mining for discovering therapeutic targets. These models, like BioGPT by Microsoft and ChatPandaGPT by Insilico Medicine, use extensive training on vast text data to connect diseases, genes, and biological processes [79]. They facilitate the quick identification of disease mechanisms, potential drug targets, and biomarkers. While these language models excel at understanding complex scientific concepts and accelerating disease hypothesis generation, they may unknowingly perpetuate human biases and need more discernment to validate input data accuracy. Additionally, their reliance on published information might restrict their ability to uncover innovative targets. Therefore, it is recommended to acknowledge these model limitations and complement their use with other approaches to ensure the discovery of genuinely novel therapeutic targets.
Using AI-generated synthetic data is beneficial for target identification in medical research. Synthetic data, created by AI algorithms to mimic real-world patterns, can help researchers explore a broader range of scenarios, especially in areas with limited experimental data like rare diseases. By generating synthetic data based on existing knowledge, AI can reveal potential therapeutic targets that may have been overlooked, aiding in the validation of predictions and addressing data imbalance issues [80-82]. However, it's essential to acknowledge the limitations of synthetic data, as models may not capture unknown complexities, and ethical concerns could arise from simulating under-represented populations. Robust validation and quality control measures are crucial to ensuring the reliability and relevance of AI-generated synthetic data in biomedical research [83-85].
Drug repurposing, a feasible and promising approach, has garnered increasing interest from governments and pharmaceutical companies for its excellent track record in saving time and money. It involves identifying new medical uses for existing drugs initially developed for different purposes. This approach presents a quicker and more cost-efficient way to create new treatments [86, 87].
Drug repurposing strategies can be categorized into drug-based and disease-based approaches. The familiar premise is that a drug may effectively treat multiple diseases with similarities or interconnections [88, 89]. Target associations can be complex due to the multifunctional nature of drugs, and computational drug repurposing faces challenges in distinguishing drug targets from other gene products indirectly involved in target activity [88]. Traditional methods may need more datasets and environmental variations, potentially leading to inaccuracies. Still, the growing biomedical and pharmaceutical data has improved computational approaches, such as data mining and ML, to repurpose drugs better [90, 91]. These advanced methods help uncover therapeutic opportunities by analyzing interactions among biological entities like genes, proteins, drugs, and diseases within complex networks. Some ML techniques for repurposing drugs include k-nearest neighbors, RF and SVM [92-94].
Various studies have used ML techniques like collaborative filtering to predict new drug-disease associations based on gene expression patterns [95, 96]. One study created drug similarity datasets to identify potential repurposed uses for FDA-approved drugs by analyzing known and unknown associations [97]. They used SVM classification and collaborative filtering to predict novel drug-disease links. Another study proposed a computational framework integrating different data sources to predict similarities between drugs, diseases, and drug-disease pairs [98]. They used strategies like block coordinate descent and causal inference-probabilistic matrix factorization to classify new drug-disease associations. These approaches help identify potential off-target drug interactions and discover new connections within the drug-disease network.
Due to their ability to uncover intricate patterns, DL models are well-suited for analyzing complex data such as electronic health records, the entire proteome, and the human genome [99]. This makes them particularly applicable in life sciences fields like drug repurposing. DL stands out from traditional ML techniques for its neural network flexibility [100]. DL methods offer various advantages over conventional approaches, particularly their ability to automatically model and learn complex features, and have proven beneficial in identifying repurposed drugs. Nonetheless, several challenges persist, including the “black box phenomena” and opaque nature of DL models, a lack of interpretability, output reliance on input data, and the necessity of extensive standardized biochemical datasets to achieve optimal learning and performance outcomes. Despite the availability of vast amounts of data, the process of extracting and preparing standardized data for ML and DL applications remains a significant challenge. It is hoped that the development of extensive datasets in the future will facilitate the establishment of standardized DL models for drug repurposing [101].
In conclusion, AI has the potential to revolutionize the field of drug discovery and development, transforming the way medicines are discovered, optimized, and brought to market. By harnessing the power of ML and other AI technologies, researchers can accelerate the drug development process, reduce costs, and improve the success rates of new therapeutic interventions. As AI continues to evolve and mature, its integration into the pharmaceutical industry will be crucial in advancing personalized medicine and improving healthcare outcomes for patients worldwide.
[GoogleScholar] [CrossRef ]