# protr

## Introduction

The protr package focus on offering a unique and comprehensive toolkit for protein sequence descriptor calculation and similarity computation. The descriptors included in the protr package are extensively utilized in Bioinformatics and Chemogenomics research. The protr package is developed by Computational Biology and Drug Design (CBDD) Group, Central South University ([http://cbdd.csu.edu.cn/](http://cbdd.csu.edu.cn/) ).

## Package Features

### Qualitative Descriptors

The commonly used qualitative protein sequence features implemented in protr include:

  * Amino Acid Composition
  
    * Amino Acid Composition
    * Dipeptide Composition
    * Tripeptide Composition

  * Autocorrelation
  
    * Normalized Moreau-Broto Autocorrelation
    * Moran Autocorrelation
    * Geary Autocorrelation

  * CTD
  
    * Composition
    * Transition
    * Distribution

  * Conjoint Triad

  * Quasi-sequence-order Descriptors
  
    * Sequence-order-coupling Number
    * Quasi-sequence-order Descriptors
  
  * Pseudo Amino Acid Composition
  
    * Pseudo Amino Acid Composition
    * Amphiphilic Pseudo Amino Acid Composition

### Quantitative Descriptors

The quantitative descriptors commonly used in Proteochemometric (PCM) Modeling implemented in protr include:

  * Generalized Scales-Based Descriptors derived by Principal Components Analysis

    * Generalized Scales-Based Descriptors derived by AA-Properties (AAindex)
    * Generalized Scales-Based Descriptors derived by 20+ classes of 2D and 3D Molecular Descriptors (Topological, WHIM, VHSE, etc.)

  * Generalized Scales-Based Descriptors derived by Factor Analysis

  * Generalized Scales-Based Descriptors derived by Multidimensional Scaling
  
  * Generalized BLOSUM and PAM Matrix-Derived Descriptors

### Parallelized Sequence Alignment Similarity Computation

Local and global sequence alignment for protein sequences:

  * Between two protein sequences
  * Parallelized pairwise similarity calculation with a list of protein sequences

### Parallelized Gene Ontology (GO) Semantic Similarity Computation

GO semantic similarity measures:

  * Between two groups of GO terms / two Entrez Gene IDs
  * Parallelized pairwise similarity calculation with a list of GO terms / Entrez Gene IDs

### Miscellaneous Tools and Datasets

  * Retrieve protein sequences from UniProt
  
  * Read protein sequences in FASTA format

  * Read protein sequences in PDB format
  
  * Sanity check of the amino acid types appeared in the protein sequences
  
  * Protein sequence segmentation

  * Auto Cross Covariance (ACC) for generating scales-based descriptors of the same length

  * 20+ Pre-computed 2D and 3D Descriptor Sets for the 20 Amino Acids to use with the Generalized Scales-Based Descriptors

  * BLOSUM and PAM Matrices for the 20 Amino Acids

  * Meta Information of the 20 Amino Acids

## Web Service

ProtrWeb, the web service built on protr, is located at:

[http://cbdd.csu.edu.cn:8080/protrweb/](http://cbdd.csu.edu.cn:8080/protrweb/)

ProtrWeb does not require any knowledge of programming for the users, it is a user-friendly and one-click-to-go online platform for computing the protein features presented in the protr package.

## Links

  * CRAN Page: http://cran.r-project.org/web/packages/protr/

  * Track Devel: https://github.com/road2stat/protr

  * Report Bugs: https://github.com/road2stat/protr/issues

## Authors

  * Nan Xiao <road2stat@gmail.com>

  * Qing-Song Xu <dasongxu@gmail.com>

  * Dong-Sheng Cao <oriental-cds@163.com>

## Publication

  * (to appear)
