Why is QSAR so far behind other forms of machine learning, and what can be done to close the gap?

The key feature of successful machine learning that is missing in QSAR

In a previous post, we showed that the prediction error of QSAR (quantitative structure-activity relationship) models increases with the distance to the nearest element of the training set. This trend holds across a variety of machine learning algorithms and distance metrics. In contrast, prediction error is not correlated with distance to the training set on conventional ML (machine learning) tasks like image classification: modern deep learning algorithms are able to extrapolate far from their training data. 

A favorable resolution of this apparent contradiction would have significant practical consequences. QSAR algorithms that generalize widely and accurately could efficiently identify new potent and selective molecules for difficult target product profiles, considerably reducing the time and expense of drug discovery.

To reconcile the disparity in generalization between QSAR and conventional machine learning tasks, we need to identify its cause. Three potential explanations present themselves:

  1. Algorithms have been better aligned with conventional ML tasks,
  2. Better datasets have been brought to bear on standard ML tasks, or
  3. QSAR is intrinsically more difficult than typical ML tasks.

In a succession of posts, we will explore each of these possibilities in turn, beginning with the first.

QSAR algorithms do not capture the structure of ligand-protein binding

Machine learning algorithms can correctly extrapolate some functions only because they are unable to extrapolate others (Wolpert & Macready, 1997). An algorithm that can represent every possible mapping from inputs to outputs equally well will memorize the training set, while producing arbitrary outputs on any other input. Correspondingly, ML algorithms generalize best when their architecture enforces the structure and regularities of the underlying problem domain. Such algorithms have difficulty representing irrelevant input-output mappings that violate the problem domain regularities, while easily capturing the true input-output mapping. Counterintuitively, recent state-of-the-art ML performance appears to be driven by increasingly flexible ML architectures, which are applicable across problem domains.

This post will identify the key structural constraint that remains embedded in these ML algorithms: the consistent application of linear filters to local input patches. We will also show that the machine learning algorithms currently applied to small molecule potency prediction do not respect this structural constraint. The integration of local linear filters into QSAR (quantitative structure-activity relationship) models represents an untapped opportunity to improve potency prediction accuracy.

ML architectures must be matched to the problem

The visual domain exhibits significant structure and regularities. For instance, the semantic content of an image (e.g, whether or not an image contains a cat) is invariant to small shifts (e.g., vertical or horizontal), rotations, and scalings of the image. Critical features (e.g., edges and curves) in an image are composed of local, contiguous groups of pixels.

Conventional neural networks, also called multi-layer perceptrons (MLPs), are not constrained to reflect the structure of images. Each unit (neuron) in an MLP computes a weighted sum of the units in the previous layer, followed by an activation function such as a rectified linear unit (ReLU; f(x) = max(0, x)) or a sigmoid (f(x) = 1/(1 + e-x)), as shown in Figure 1. Even the simplest one-hidden-layer MLP has unlimited power, and can represent any function mapping inputs (e.g., molecular structure) to outputs (e.g., log IC50 against a protein target), if the hidden layer is large enough (Hornik, et al., 1989).

Figure 1: Architecture of a single unit in a multi-layer perceptron (MLP), and common activation functions. Many copies of this building block are repeated within and across layers. (Feng, et al., 2019)

When an MLP is applied to a vision task, shifting an image up and to the right can completely transform the output, since the weights applied to each input pixel can change completely. On the other hand, if the pixels are rearranged before training (in a manner consistent between images), an MLP is not affected; it just permutes its weights in the same pattern. This lack of structure is evident in the filters (weights) learned by MLPs when applied to vision tasks. As shown in Figure 2, they do not contain any identifiable semantic features like edges, but rather look like high-frequency noise. Because MLPs do not embody enough structure of the problem domain, they generalize poorly (Bachman, et al., 2024).

Figure 2: First layer linear filters from an unconstrained MLP (Bachman, et al., 2024), a convolutional neural network (Krizhevsky, et al., 2012), and a vision Transformer (patch embedding; Dosovitskiy, et al., 2020).

The deep learning revolution was largely driven by the rise of convolutional neural networks (CNNs), which prominently embody the structure of the vision domain (Krizhevsky, et al., 2012). Convolutional networks apply a set of small pattern-matching filters (e.g., 3 pixels by 3 pixels) to every location in the image (Figure 3). In the early layers, these filters detect edges, as shown in Figure 2. Shifting the image up and to the right simply shifts the outputs of the filters. Eventually, filter outputs at nearby locations are pooled together, which virtually eliminates the effect of the shift.

Figure 3: Architecture of a single unit of a convolutional neural network (CNN). Each unit applies a shared local, linear filter to a spatially-aligned patch of its input layer, followed by an activation function such as a ReLU. (Robson, 2017)

Convolutional neural networks explicitly capture the invariance of semantic content to small shifts, as well as the fact that critical features (e.g., edges and curves) are composed of local, contiguous groups of pixels. They cannot easily produce classifications that change when the input is translated a few pixels to the right, or process images where the pixels have been arbitrarily rearranged, even if the permutation is consistent across all images. Correspondingly, CNNs achieve significantly better performance than MLPs on tasks like image classification (Bachman, et al., 2024).

Devising new ML architectures to capture domain-specific features is difficult

There is a long tradition of efforts to design new ML architectures that better capture the structure of the problem domain. Virtually none of these attempts have been successful. 

In the visual domain, the semantic content of an image is also invariant to moderate changes in scaling (zoom), rotation (including out of plane), translation (shift), reflection (mirror image), lighting, and object pose. Architectures ranging from the scattering transform (invariant to rotation and scaling; Bruna & Mallat, 2013) to group equivariant CNNs (invariant to rotations and reflections; Cohen & Welling, 2016), geometric CNNs (invariant to pose; Bronstein, et al., 2017), and capsule networks (invariant to viewpoint and pose; Sabour, et al., 2017) have been developed that are invariant to these transformations. Despite their intuitive appeal, these approaches have not improved performance. The current state-of-the-art is based upon the vision Transformer, which eliminates even some of the explicit translational invariance of convolutional neural networks (Dosovitskiy, et al., 2020).

In natural language, early work was dominated by parse trees. Recursive neural networks recapitulate the structure of parse trees by successively merging the representations of connected words and phrases (Socher, et al., 2010). This heavy, domain-specific structure was rendered unnecessary by the ascendance of recurrent neural networks (e.g. LSTMs and GRUs), which were inspired by the human ability to process words one at a time, in sequence (Hochreiter & Schmidhuber, 1997Mikolov, et al., 2010Cho, et al., 2014). They compute the summary of the first n words based only upon the preceding summary of the first n-1 words, combined with the nth word. Despite exhaustive explorations across the space of possible recurrent neural network architectures (e.g., Greff, et al., 2016), they have been supplanted by Transformers, which further weaken the architectural constraint by allowing the representation of each word to be directly influenced by every other word (Vaswani, et al., 2017). 

Flexible architectures are almost unbeatable

Most of the significant architectural advances in machine learning merely ensure that the learned function is as simple as possible. Examples include non-saturating activation functions (e.g., ReLU), dropout, normalization layers (e.g., batch norm), approximate second-order gradient descent (e.g., Adam) with learning rate decay and weight decay, residual connections, inverse bottleneck layers, label smoothing, attentional layers and gating (e.g, Transformers), and diffusion models. These architectural elements hold the learned function close to the identity transformation, which leaves the input unchanged, with deviations that are smooth or roughly linear.

Other essential ML architectures capture only the most basic structure of the problem domain. Convolutional/recurrent networks enforce locality in and consistency across space/sequence, and have largely been supplanted by Transformers. Data augmentations (e.g., shifts, mixup; primarily applicable to visual data) enforce invariance to common distortions, including compositionality (adding a second object to an image doesn’t change the identity of the first object). Generative modeling (including masked auto-encoding, as in BERT) learns the joint distribution of the entire dataset, rather than the distribution of some variables (output) conditioned on other variables (input). 

There are additional architectural elements that help the algorithm make maximal use of the training data: e.g., variational autoencoders, adversarial training, contrastive learning, and distillation. Other architectures are specific to specialized applications like equivariant graph neural networks and reinforcement learning. Nevertheless, this limited set of concepts is basically sufficient to understand how ChatGPT works its magic, and makes it clear that most advances in machine learning do not embed the detailed structure of the problem domain.

The modest constraints imposed by domain-specific architectures are manifestations of Rich Sutton’s bitter lesson (Sutton, 2019):

1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

ML requires consistent local filters

At the same time, some domain knowledge actually is required. All successful ML architectures apply a consistent set of local filters across the input. These filters extract key features in a manner robust to small perturbations or spurious correlations between distant inputs. In images, these filters must be able to detect local edges; in text, they must learn common n-grams. Without the domain knowledge implicit in local filters, performance collapses. 

Convolutional neural networks explicitly apply local filters across the image (Figure 3). Vision Transformers construct the same filters implicitly through their self-attention mechanism (Cordonnier, et al., 2020d’Ascoli, et al., 2021; for text Transformers, see Vig & Belinkov, 2019). The initial patch embedding in vision Transformers additionally applies explicit edge filters analogous to those of convolutional neural networks (Dosovitskiy, et al., 2020), as shown in Figure 2.

MLP-Mixers eliminate even self-attention, and merely alternate between fully-connected layers across hidden units and spatial position, without significantly reducing performance (Tolstikhin, et al., 2021). These fully-connected layers implement non-local filters that cover the entire input, and are not consistent between different positions. However, MLP-Mixers begin with a patchification layer, which performs convolution (applies consistent local filters) with stride (offset between successive applications of the filter) equal to the filter size (Figure 4). The patchification layer differs from a conventional convolutional layer only in that there is no overlap between the regions to which the filters are applied.

Figure 4: MLP-Mixers apply a patchification layer, which differs from a conventional convolution only in that the filter is shifted by a distance equal to its width between applications. (Bachman, et al., 2024)

With its patchification layer, MLP-Mixer achieves a top-1 accuracy of 84.15% on ImageNet (a standard image classification dataset with 1.2M images spread across 1k categories; Tolstikhin, et al., 2021), comparable to vision Transformers (85.30% accuracy; Dosovitskiy, et al., 2020) and state-of-the-art convolutional networks (87.8% accuracy; Liu, et al., 2022) when all algorithms are pre-trained on ImageNet-21k. Even if communication is not allowed between the patches before the final classification, patchification during the initial embedding facilitates reasonable performance: 56.5% top-1 accuracy on ImageNet even without extra pre-training data (Touvron, et al., 2021). If the patchification layer is removed, eliminating all consistent local filtering, the performance of the resulting MLP collapses to 31.7% (no pre-training; Bachman, et al., 2024). MLP accuracy decays further to 8.7% without data augmentation, such as random shifts and flips, which implicitly enforce the invariance captured explicitly by convolution/patchification.

Convolution or patchification layers, which apply consistent local linear filters, are the key driver of prediction accuracy. A variety of different strategies for passing information between spatial positions yield similar performance, so long as local linear filtering is also performed (Liu, et al., 2021Tolstikhin, et al., 2021Touvron, et al., 2021Yu, et al., 2022).

Current ML architectures cannot produce consistent local filters for small molecules

It is difficult to construct an analogous filtering or patchification operation on molecular graphs. For local filters to be useful, the precise relationship and alignment between nearby regions must be evaluated. In the vision domain, filter alignment captures the difference between two nearby edges that form a single line and orthogonal edges that belong to different shapes. Computer vision algorithms such as convolutional neural networks and vision Transformers can easily evaluate alignment between rotationally asymmetric local filters (e.g., edge detectors) because the filters are defined relative to a global orientation: “up” is consistent between patches and images. 

Molecular graphs do not have a privileged global orientation against which rotationally asymmetric local filters can be aligned. Imposing an arbitrary global orientation would require the algorithm to separately learn every possible orientation, rather than embodying invariance to global orientation in the consistency of the local filters. In the image domain, this would be analogous to using unconstrained MLPs without convolution or patchification layers (eliminating the consistency of local filters), and instead applying random shifts to every image to train the unconstrained MLP to be invariant to these transformations. As we saw above, removing this explicit embodiment of shift-invariance significantly hurts performance. 

A naive extension of patches to molecular graphs might uniquely encode each radius-n sub-graph. Extended connectivity fingerprints (ECFPs) approximate this process, but hash to a smaller number of bits, so the same representation is assigned to multiple unrelated patches (Rogers & Hahn, 2010). ECFPs are approximated in turn by graph convolutional networks, to the extent that the per-node feedforward layer(s) approximate a hash function (Xu, et al., 2019). This patchification is radially symmetric, and does not align representations between patches. Since even similar patches (radius-n sub-graphs) have maximally distant representations in these approaches, they do not facilitate generalization. In contrast, in state-of-the-art ML algorithms, image patches are processed by aligned linear filters, for which the output changes smoothly as the input is altered. 

Many modern QSAR algorithms are based upon SMILES strings, which cut the rings of a molecule to form a tree, and then walk through the tree, reading out atoms and bonds as they are encountered (Chithrananda, et al., 2020Ross, et al., 2022). There are many valid SMILES strings that equally represent a single molecule. Each such SMILES string highlights some bonds, by presenting the bonded atoms consecutively; and deemphasizes others, for which the atoms are distant in the SMILES string. In any single SMILES string, some local fragments of the molecule will be split up across faraway sections of the SMILES string, and thus not subject to processing by local filters on the SMILES string (Figure 5). 

Figure 5: Multiple SMILES strings of the same molecule, with the traversal path of the SMILES string overlaid on the molecular graph. Each SMILES string represents a different subset of bonded atoms using consecutive tokens.

Graph neural networks (Kipf & Welling, 2016Veličković, et al., 2018Xu, et al., 2019) and graph Transformers (Ying, et al., 2021Dwivedi, et al., 2023) are often hailed as the obvious architectures for molecular property prediction (Atz, et al., 2021; Müller, et al., 2023), but they do not consistently surpass even simple architectures like random forests on extended connectivity fingerprints for experimental activity prediction (Cichońska, et al., 2021; Huang, et al., 2021; Deng, et al., 2022; Luukkonen, et al., 2023). Graph neural networks effectively apply a filter that takes the average of the connected nodes. In the visual domain, this would correspond to only using the filter 

which averages nearby pixels and smooths away fine details (Figure 6). This extremely limited filter only achieves good performance in the visual domain when used in conjunction with multiple patchification layers, which apply complicated, consistent local filters (Yu, et al., 2022). Without separate patchification, graph neural networks constitute low-pass filters, and suffer from the over-smoothing problem: the representations of all nodes converge to a common value exponentially quickly in the number of layers (Oono & Suzuki, 2020Rusch, et al., 2023). After a few layers, all but one node can be safely ignored, since the nodes are all identical. Attentional mechanisms do not resolve this limitation (Wu, et al., 2024). 

Figure 6: A 3×3 low-pass filter only smooths an image, and does not identify important features. Graph neural networks effectively apply such a low-pass filter, rather than using trained filters as in a traditional convolutional neural network.

 Moreover, graph neural networks are structurally incapable of distinguishing between pairs of molecules with significant differences, such as those in Figure 7 (Xu, et al., 2019). They cannot even count the number of each functional group present in a molecule (Chen, et al., 2020), a task which should be easily solvable using consistent local filters (by constructing a filter for each functional group, and summing the filtered outputs).

Figure 7: Graph neural networks, including Graph Isomorphism Networks (GIN; Xu, et al., 2019) and Graph Attention Networks (GAT; Veličković, et al., 2018), are unable to distinguish between these two molecules. They produce identical predictions, regardless of how the network is trained. The confusion arises because there is a correspondence between the atoms (indicated by matching colors in the figure), such that the neighborhoods of corresponding atoms also correspond. For example, in both molecules, a blue carbon is connected to a blue carbon, a magenta carbon, and a green fragment.

Transformers break their input into discrete tokens, which are all processed in the same way. In text, each word (or word part) constitutes a distinct token. On images, tokens typically comprise large (e.g., 16 x 16) pixel patches (Dosovitskiy, et al., 2020). Since tokens are processed uniformly, the position of each token within the text or image must be captured in the token’s embedding (the vector representation assigned to the token). This is achieved via a positional encoding, which is added to the encoding of the content before the first Transformer layer. The traditional positional encoding is a set of sinusoids, which captures the relative distance between each pair of tokens along the key axes (sequence for text; x and y axes for images). 

Corresponding positional encodings cannot be easily constructed for molecular graphs, since they lack unique axes that are consistent between different molecules. (There is no well-defined “up” direction.) The most common graph positional encodings include those based on the path distance (the number of bonds between two atoms, using either shortest paths or random walks; Ying, et al., 2021Dwivedi, et al., 2021), or the eigenvectors of the graph Laplacian (the vibrational modes if the atoms were point masses and the bonds were springs; Dwivedi, et al., 2023). 

A positional encoding based upon path distances can only construct rotationally symmetric filters. On an image (connecting each pair of adjacent pixels, resulting in a square grid), such a positional encoding cannot produce the edge filters required for image classification. 

The eigenvectors of the graph Laplacian do not have a well-defined sign: they define axes, but do not distinguish between the two directions along each axis (Huang, et al., 2023). Moreover, they do not distinguish between rotations amongst eigenvectors with repeated eigenvalues, and can be unstable to even small changes in the graph (Figure 8; Wang, et al., 2022). Between dissimilar molecular graphs, there is no obvious relationship amongst corresponding eigenvectors. As a result, the eigenvectors of the graph Laplacian do not define consistent axes between different molecular graphs. 

Figure 8: Corresponding eigenvectors (ordered by eigenvalue) of similar and dissimilar molecules are visualized on the molecular graphs using a cold-to-hot color scale. Eigenvector 6 is mostly consistent between the similar molecules. The two other eigenvectors are reordered or significantly different. Eigenvectors have an even weaker correspondence between dissimilar molecules.

Some early efforts applied 3D convolutional neural networks to the voxelization of 3D molecular poses (Wallach, et al., 2015Ragoza, et al., 2017Jiménez, et al., 2018Stepniewska-Dziubinska, et al., 2018). While they do explicitly utilize consistent local filters, and are robust to shifts, 3D CNNs have no intrinsic invariance to rotations. The set of rotations to which the network must be robust is much larger than in the case of computer vision, since there is no global “up” direction to which inputs can be aligned. Computational requirements and voxel sparsity increase cubically with the grid resolution, so such 3D CNNs have typically used coarse grids, rendering them blind to perturbations as large as 0.5 or 1 Å. Moreover, a single conformer must be chosen, which may bear little resemblance to the bound pose of the ligand to any particular target. Because of these difficulties, few modern QSAR algorithms are based on 3D CNNs (Wallach, et al., 2024).

Conclusion

The success of machine learning in domains like vision and natural language depends upon the consistent application of local linear filters across the input. Unfortunately, standard QSAR approaches based upon molecular fingerprints, SMILES strings, graph neural networks, and graph Transformers fail to perform a corresponding filtering operation on molecular graphs. The absence of consistent local filters may explain the relative poor performance of powerful deep learning algorithms on QSAR tasks. Efforts should be directed towards developing techniques to apply consistent local filters to molecular graphs.

Acknowledgements

Thanks to Mehran Khodabandeh for making most of the figures in this post.

References

Atz, K., Grisoni, F., & Schneider, G. (2021). Geometric deep learning on molecular representations. Nature Machine Intelligence, 3(12), 1023-1032.

Bachmann, G., Anagnostidis, S., & Hofmann, T. (2024). Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., & Vandergheynst, P. (2017). Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4), 18-42.

Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1872-1886.

Chen, Z., Chen, L., Villar, S., & Bruna, J. (2020). Can graph neural networks count substructures?. Advances in neural information processing systems, 33, 10383-10395.

Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.

Cho, K., van Merriënboer, B., Gu̇lçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014, October). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1724-1734).

Cichońska, A., Ravikumar, B., Allaway, R. J., Wan, F., Park, S., Isayev, O., … & Challenge organizers. (2021). Crowdsourced mapping of unexplored target space of kinase inhibitors. Nature communications, 12(1), 3307.

Cohen, T., & Welling, M. (2016, June). Group equivariant convolutional networks. In International conference on machine learning (pp. 2990-2999). PMLR.

Cordonnier, J. B., Loukas, A., & Jaggi, M. (2019). On the relationship between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584.

d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021, July). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning (pp. 2286-2296). PMLR.

Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D., & Wang, F. (2023). A systematic study of key elements underlying molecular property prediction. Nature Communications, 14(1), 6395.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Dwivedi, V. P., Luu, A. T., Laurent, T., Bengio, Y., & Bresson, X. (2021). Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875.

Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., & Bresson, X. (2023). Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43), 1-48.

Feng, J., He, X., Teng, Q., Ren, C., Chen, H., & Li, Y. (2019). Reconstruction of porous media from extremely limited information using conditional generative adversarial networks. Physical Review E, 100(3), 033308.

Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10), 2222-2232.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., … & Zitnik, M. (2021). Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548.

Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., & Li, P. (2023). On the stability of expressive positional encodings for graph neural networks. arXiv preprint arXiv:2310.02579.

Jiménez, J., Skalic, M., Martinez-Rosell, G., & De Fabritiis, G. (2018). K deep: protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. Journal of chemical information and modeling, 58(2), 287-296.

Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.

Liu, H., Dai, Z., So, D., & Le, Q. V. (2021). Pay attention to mlps. Advances in neural information processing systems, 34, 9204-9215.

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).

Luukkonen, S., Meijer, E., Tricarico, G. A., Hofmans, J., Stouten, P. F., van Westen, G. J., & Lenselink, E. B. (2023). Large-scale modeling of sparse protein kinase activity data. Journal of Chemical Information and Modeling, 63(12), 3688-3696.

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010, September). Recurrent neural network based language model. In Interspeech (Vol. 2, No. 3, pp. 1045-1048).

Müller, L., Galkin, M., Morris, C., & Rampášek, L. (2023). Attending to graph transformers. arXiv preprint arXiv:2302.04181.

Oono, K., & Suzuki, T. (2020). Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947.

Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J., & Koes, D. R. (2017). Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling, 57(4), 942-957.

Robson, R. (2017). Convolutional neural networks – Basics. MLNotebook. https://mlnotebook.github.io/post/CNN1/

Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.

Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., & Das, P. (2022). Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12), 1256-1264.

Rusch, T. K., Bronstein, M. M., & Mishra, S. (2023). A survey on oversmoothing in graph neural networks. arXiv preprint arXiv:2303.10993.

Socher, R., Manning, C. D., & Ng, A. Y. (2010, December). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop (Vol. 2010, pp. 1-9).

Stepniewska-Dziubinska, M. M., Zielenkiewicz, P., & Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics, 34(21), 3666-3674.

Sutton, R. (2019). The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34, 24261-24272.

Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., … & Jégou, H. (2022). Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 5314-5321.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. arXiv preprint arXiv:1710.10903.

Vig, J., & Belinkov, Y. (2019). Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284.

Wallach, I., Dzamba, M., & Heifets, A. (2015). AtomNet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery. arXiv preprint arXiv:1510.02855.

Wallach & The Atomwise AIMS Program. (2024). AI is a viable alternative to high throughput screening: a 318-target study. Sci Rep 14, 7526. 

Wang, H., Yin, H., Zhang, M., & Li, P. (2022). Equivariant and stable positional encoding for more powerful graph neural networks. arXiv preprint arXiv:2203.00199.

Wu, X., Ajorlou, A., Wu, Z., & Jadbabaie, A. (2024). Demystifying oversmoothing in attention-based graph neural networks. Advances in Neural Information Processing Systems, 36.

Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826.

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., … & Liu, T. Y. (2021). Do transformers really perform badly for graph representation?. Advances in neural information processing systems, 34, 28877-28888.

Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., … & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10819-10829).