Search & lookup terms#

Entities and ontologies can be complex with many different identifiers.

Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to

  • access the reference table via .df()

  • look up an entity term via .lookup()

  • look up an entity term via .search()

import bionty as bt

.fields: fields of an ontology reference#

gene_bt = bt.Gene()

gene_bt
Gene
Species: human
Source: ensembl, release-110
#terms: 77043

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🪜 Gene.diff(): difference between two versions
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
 'description',
 'ensembl_gene_id',
 'ncbi_gene_id',
 'symbol',
 'synonyms'}

Fields can be accessed as attributes for autocompletion:

(You can pass them to the field parameter in any bionty function instead of strings.)

gene_bt.ncbi_gene_id
ncbi_gene_id

.df(): reference table#

Data scientists love DataFrames, and every entity has a reference table containing all the fields.

df = gene_bt.df()
df.head()
ensembl_gene_id symbol ncbi_gene_id biotype description synonyms
0 ENSG00000000003 TSPAN6 7105 protein_coding tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] TM4SF6|T245|TSPAN-6
1 ENSG00000000005 TNMD 64102 protein_coding tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] TEM|MYODULIN|CHM1L|TENDIN|BRICD4
2 ENSG00000000419 DPM1 8813 protein_coding dolichyl-phosphate mannosyltransferase subunit... CDGIE|MPDS
3 ENSG00000000457 SCYL3 57147 protein_coding SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... PACE-1|PACE1
4 ENSG00000000460 C1orf112 55732 protein_coding chromosome 1 open reading frame 112 [Source:HG... FLJ10706|APOLO1|FLIP

To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:

df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id ncbi_gene_id biotype description synonyms
symbol
LMNA ENSG00000160789 4000 protein_coding lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] CMD1A|LGMD1B|LMNL1|MADA|LMN1|PRO1|HGPS
LMNA LRG_254 None LRG_gene lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] CMD1A|LGMD1B|LMNL1|MADA|LMN1|PRO1|HGPS
TCF7 ENSG00000081059 6932 protein_coding transcription factor 7 [Source:HGNC Symbol;Acc... TCF-1
BRCA1 ENSG00000012048 672 protein_coding BRCA1 DNA repair associated [Source:HGNC Symbo... BRCC1|FANCS|RNF53|PPP1R53
BRCA1 LRG_292 None LRG_gene BRCA1 DNA repair associated [Source:HGNC Symbo... BRCC1|FANCS|RNF53|PPP1R53

.lookup(): Lookup terms and records with autocompletion#

Terms can be searched with auto-complete using a lookup object.

lookup = gene_bt.lookup()

We provide dot. accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):

lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

To look up the exact original strings, convert the lookup object to dict and use the bracket[] accessor for autocompletion:

lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')

By default, the name field is used to generate lookup keys.

You can specify another field to look up:

lookup = gene_bt.lookup(gene_bt.ncbi_gene_id)

If multiple entries are matched, they are returned as a list:

lookup.bt_100126572
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
lookup_dict = lookup.dict()
lookup_dict["100126572"]
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')

.search: Search a term against a field#

celltype_bt = bt.CellType()


Matching scores are stored in the __ratio__ column:

celltype_bt.search("cytotoxic T cells").head(3)
ontology_id definition synonyms parents __ratio__
name
cytotoxic T cell CL:0000910 A Mature T Cell That Differentiated And Acquir... cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... [CL:0000911] 96.969697
obsolete cytotoxic T cell CL:0000491 Obsolete: A Cell Responsible For Spontaneous C... None [] 76.190476
Tc2 cell CL:0000918 A Cd8-Positive, Alpha-Beta Positive T Cell Exp... Th2 non-TFH CD8-positive T cell|Th2 CD8-positi... [CL:0000908] 76.190476

By default, search also matches against each of the synonyms:

celltype_bt.search("P cell").head(3)
ontology_id definition synonyms parents __ratio__
name
nodal myocyte CL:0002072 A Specialized Cardiac Myocyte In The Sinoatria... myocytus nodalis|P cell|cardiac pacemaker cell [CL:0002086] 100.000000
double-positive, alpha-beta thymocyte CL:0000809 A Thymocyte Expressing The Alpha-Beta T Cell R... DP cell|DP thymocyte|double-positive, alpha-be... [CL:0000790] 92.307692
PP cell CL:0000696 A Cell That Stores And Secretes Pancreatic Pol... type F enteroendocrine cell [CL:0000167, CL:0000164] 92.307692

You can turn off synonym matching with synonyms_field=None:

celltype_bt.search("P cell", synonyms_field=None).head(3)
ontology_id definition synonyms parents __ratio__
name
PP cell CL:0000696 A Cell That Stores And Secretes Pancreatic Pol... type F enteroendocrine cell [CL:0000167, CL:0000164] 92.307692
cap cell CL:0000676 None None [CL:0000378, CL:0000548] 85.714286
GIP cell CL:0002278 An Enteroendocrine Cell Of Duodenum And Jejunu... type K enteroendocrine cell [CL:0000167, CL:0000164] 85.714286

Match against another field (default is “name”):

celltype_bt.search("CD8 postive alpha beta T cells", field=celltype_bt.definition).head(
    3
)
ontology_id name synonyms parents __ratio__
definition
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. CL:0000625 CD8-positive, alpha-beta T cell CD8-positive, alpha-beta T lymphocyte|CD8-posi... [CL:0000791] 95.081967
A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor. CL:0000624 CD4-positive, alpha-beta T cell CD4-positive, alpha-beta T lymphocyte|CD4-posi... [CL:0000791] 91.803279
A Cd8-Positive, Alpha-Beta T Cell That Has Differentiated Into A Memory T Cell. CL:0000909 CD8-positive, alpha-beta memory T cell CD8-positive, alpha-beta memory T lymphocyte|C... [] 85.294118