# protr

## Introduction

The protr package is a unique and comprehensive toolkit for generating various numerical representation schemes of protein sequence. The descriptors included in the protr package are extensively utilized in bioinformatics and chemogenomics research.

## Package Description

### Commonly used descriptors

  * Amino acid composition
  
    * Amino acid composition
    * Dipeptide composition
    * Tripeptide composition

  * Autocorrelation
  
    * Normalized Moreau-Broto autocorrelation
    * Moran autocorrelation
    * Geary autocorrelation

  * CTD
  
    * Composition
    * Transition
    * Distribution

  * Conjoint Triad

  * Quasi-sequence-order descriptors
  
    * Sequence-order-coupling number
    * Quasi-sequence-order descriptors
  
  * Pseudo amino acid composition
  
    * Pseudo amino acid composition
    * Amphiphilic pseudo amino acid composition

  * Profile-based descriptors

    * Profile-based descriptors derived by PSSM (Position-Specific Scoring Matrix)

### Proteochemometric (PCM) modeling descriptors

  * Generalized scales-based descriptors derived by principal components analysis

    * Generalized scales-based descriptors derived by amino acid properties (AAindex)
    * Generalized scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.)

  * Generalized scales-based descriptors derived by factor analysis

  * Generalized scales-based descriptors derived by multidimensional scaling
  
  * Generalized BLOSUM and PAM matrix-derived descriptors

### Similarity Computation

Local and global pairwise sequence alignment for protein sequences:

  * Between two protein sequences
  * Parallelized pairwise similarity calculation with a list of protein sequences

GO semantic similarity measures:

  * Between two groups of GO terms / two Entrez Gene IDs
  * Parallelized pairwise similarity calculation with a list of GO terms / Entrez Gene IDs

### Miscellaneous tools and datasets

  * Retrieve protein sequences from UniProt
  
  * Read protein sequences in FASTA format

  * Read protein sequences in PDB format
  
  * Sanity check of the amino acid types appeared in the protein sequences
  
  * Protein sequence segmentation

  * Auto cross covariance (ACC) for generating scales-based descriptors of the same length

  * 20+ pre-computed 2D and 3D descriptor sets for the 20 amino acids to use with the generalized scales-based descriptors

  * BLOSUM and PAM matrices for the 20 amino acids

  * Meta information of the 20 amino acids

## Web Server

ProtrWeb, the web server built on protr, is located at:

[http://cbdd.csu.edu.cn:8080/protrweb/](http://cbdd.csu.edu.cn:8080/protrweb/)

ProtrWeb does not require any knowledge of R programming for the users, it is a user-friendly and one-click-to-go online platform for computing the descriptors presented in the protr package.

## Links

  * CRAN page: http://cran.r-project.org/web/packages/protr/

  * Track development: https://github.com/road2stat/protr/

  * Bug report: https://github.com/road2stat/protr/issues/

## Authors

  * Nan Xiao <road2stat@gmail.com>

  * Qing-Song Xu <dasongxu@gmail.com>

  * Dong-Sheng Cao <oriental-cds@163.com>

## Publication

  * (to appear)
