Last updated: 2020-02-05

In this vignette we will load the data, pre-process it and analyse top gene by p_beta parameter.


Here we load raw dataset and filter it by cell, individual parameters and order genes by smallest p beta parameter

raw_data<-read.table("D:/Uchicago/Thesis/Real_data/scqtl-counts/scqtl-counts.txt", header = TRUE, sep = "", dec = ".")
combined_filter<-read.table("D:/Uchicago/Thesis/Real_data/combined_filter.txt", header = FALSE, sep = ",", dec = ".")
combined_filter[,1]<-combined_filter[,1]+2 #adjusting index from python to R
mean_txt_file<-read.table("D:/Uchicago/Thesis/Real_data/mean/mean.txt", header = TRUE, sep = "", dec = ".",row.names=NULL)
data<- left_join(data, gene_filter, by = "gene")
data <- data[order(data$p_beta),]
clean_data <- data[,1:(ncol(data)-1)]

Maximum likelihood estimation for one gene

Below we take the gene with the smallest p beta and estimate the parameters for all individuals that passed quality control

  est=matrix(, nrow = (length(unique(data_ind))-1), ncol = 3)
  info=matrix(, nrow = (length(unique(data_ind))-1), ncol = 2)
for (i in unique(data_ind)){
  if (i=="NA18498"){
  x=full_data[ , grepl(i, names(full_data))]
hist(log(estimates[,1]), main="Distribution of k_on estimates", xlab="", ylab="Number of individuals")
hist(log(estimates[,2]), main="Distribution of k_off estimates", xlab="", ylab="Number of individuals")
hist(log(estimates[,3]), main="Distribution of k_r estimates", xlab="", ylab="Number of individuals")

Distribution for one individuals

Then we take the individual with most cells (NA18501) and take a look at the estimated parameters:

The first plot is the real protein level distribution for NA18501 individual. Second and third plots are beta and poisson distributions with the estimated \(k_{on}\),\(k_{off}\) and \(k_r\) parameters.

Boxplots by genotype

We can get the genotype of the individuals and compare it to the estimated \(k_{on}\), \(k_{off}\) and \(k_{r}\) parameters.

Finally we can fit the regression to estimate linear dependance of genotype on estimated parameters.

