Incorporating LLM Embeddings for Variation Across the Human Genome

arXiv:2509.20702v2 Announce Type: replace-cross Abstract: Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus on gene-level information. We present one of the first systematic frameworks to generate genetic variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we construct functional text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3/MEGA variants, 90 million imputed UK Biobank (UKB) variants, and 9 billion all possible variants. Embeddings were produced using general purpose models including both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline quality control experiments demonstrate high predictive accuracy for variant-level properties, validating the embeddings as structured representations of genomic variation. We further apply them to real-world embedding-augmented genetic risk predictions that demonstrate the performance of using LLM embeddings in polygenic risk score (PRS) style predictions over the UK Biobank cohort data. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.

Leave a Comment