#----------------------------------------#
# TileProbe README
#----------------------------------------#

#----------------------------------------#
# O. INTRODUCTION
#----------------------------------------#
TileProbe offers two levels of functioning. For experienced users, it is possible to build probe models with their own data or data downloaded from GEO, provided that an adequate number of samples are available. To do this, users must first download all the relevant samples according to the platform of interest. Samples need to be organized into the correct groups (based on similar experimental conditions as described in the paper) so that the appropriate within group variances can be accounted for. In what follows, data that will be used to train the probe effect model will be called the training data, and data that the users want to analyze (e.g. to detect true transcription factor binding sites) will be called the data of primary interest. The entire TileProbe procedure includes:

(1) applying the original MAT correction to all arrays, 
(2) building the residual probe effect model with the training data, 
(3) applying the TileProbe correction (either TPM or TPV) to the data of primary interest, using the previously-built model, and 
(4) peak detection. 

For detailed instructions on this procedure, please start from the first section, "Building New Models".

A more likely use of TileProbe is for users to take advantage of our existing models. We have precompiled probe models for several commonly used Affymetrix arrays for which there are a sufficient number of samples to robustly build the probe effect model. These include the human tiling and promoter arrays, as well as the mouse tiling and promoter arrays.  For this level of functioning, users will only need to supply a list of their data (which is the data of primary interest). The abbreviated protocol for analyzing these data involves:

(1) applying the original MAT correction to just the data user supplies (i.e. the data of primary interest), 
(2) applying the TileProbe correction (either TPM or TPV) to the data, using the provided corresponding model, and 
(3) peak detection. 

For detailed instructions on this protocol, see the second section, "Using Existing Models".

#----------------------------------------#
# I. Building New Models
#----------------------------------------#

------------
Quick Start:
	
1. tileprobe_mat mat_arg.txt
2. tileprobe_buildbarmodel_v2 -i trainlist.txt -o modelname
3. a) tileprobe_barnorm -i samplename.bar -o ./ -m modelname
or b) tileprobe_barnorm -i testlist.txt -c 1 -o ./ -m modelname -b 0.0
4. a) tileprobe_peak tileprobe_peak_arg.txt
or b) tilemapv2 tilemapv2_arg.txt


-------------
STEP 1. Applying the MAT correction to all arrays (and convert .cel formatting to .bar)

Command: tileprobe_mat modelX_mat_arg.txt

mat_arg.txt: for an example of this argument file, please see the website. Briefly, the file should include:

1.	Project title: filename to label results
2.	CEL directory: where the .CEL files are stored
3.	BPMAP directory: folder name where the bpmap is stored
	NOTE: we used the bpmap used by MAT, which is available at http://liulab.dfci.harvard.edu/MAT/ 
4.	GenomeGrp: which species’ genome (consistent with MAT)
	(e.g. Mm for mouse; Hg for human)
5.	Working directory: folder to export results
6.	Number of Libraries: number of array types
7.	Libraries: name of the bpmap file
8.	Number of samples: the total number of arrays.
9.	Arrays: a list of filenames for all the arrays, each array should be on a new line, preceded by a line consisting of “>N” where N is the number. Here you can include both the training data and the data of primary interest.
10.	Option to “remove masked cells in the CEL files”: choose 0 for no.
11.	Option to “remove outlier cells in CEL files”: choose 0 for no.


-------------
STEP 2. Building the model using only the training data

Command: tileprobe_buildbarmodel_v2 -i trainlist.txt -o ModelX

(i) The .txt file following the -i option is the list of training datasets (to model the MAT residual probe effect). 

1.	Each array is on a separate line, and includes
	a. A number specifying the appropriate group for the samples (grouped by similar experimental conditions as described in the paper).
	b. the full pathname of the file.
2.	The files are the bar files which are the output of tileprobe_mat. These files contain MAT corrected probe intensities. 
3.	Example:

1	/home/project/abMyc_1.bar
1	/home/project/abMyc_2.bar
2	/home/project/bOct4_1.bar
2	/home/project/ bOct4_2.bar
...

(ii)	The file specified following the -o option is the name of the output model.

-------------
STEP 3. Applying the TileProbe correction to only the data of primary interest, using the model from step 2

Command1: tileprobe_barnorm -i sample.bar -o ./ -m ModelX (for 1 sample)
OR
Command2: tileprobe_barnorm -i testlist.txt -c 1 -o ./ -m ModelX -b 0.0 (for a list of samples)

(i)	The file after the -i option is either a single .bar sample to be analyzed, as in command1; or a .txt file list of arrays, as in command2. The .txt list file consists of a separate line for each array, and contains the full pathname.
(ii)	-c: use -c 1 if the input is a .txt list. Otherwise, leave out.
(iii)	-o: specifies the output pathname. Here, ./ represents the current directory.
(iv)	The file specified after the -m option represents the model built in step 2.
(v)	-b: use -b 0.0 for TPV; -b 1 for TPM.

-------------
STEP 4. Peak Detection

Command1 (MAT peak calling procedure): tileprobe_peak tileprobe_peak_arg.txt
OR
Command2 (TileMap peak calling procedure): tilemapv2 tilemapv2_arg.txt

For Command1:
tileprobe_peak_arg.txt: for an example of this argument file, please see the website. Briefly, the file should include:

1.	Output path: folder to export results
2.	Bandwidth: distance to extend on either side of the probe. TileProbe collects all probes within this sliding window and computes the MAT score for the probe. MAT uses 300.
3.	MaxGap: maximum distance between peaks before they are merged 
4.	MinProbe: minimum number of significant probes within a peak: before it is dropped
5.	Var: variance from the control samples, as defined by the original MAT peak detection procedure. MAT uses 0 as default.
6.	Cutoff: arbitrary MAT score cutoff. MAT scores bigger than this number form regions, which are then assigned FDR’s.  
7.	IP: immunoprecipitation samples. First line is IP=N where N is the number of IP samples. It is followed with a line for each .bar file, containing the entire pathname for the files.
8.	CT: control samples. First line is CT=N where N is the number of IP samples. It is followed with a line for each .bar file, containing the entire pathname for the files.
 

For Command2: 
tilemapv2_arg.txt: for an example of this argument file, please see http://www.biostat.jhsph.edu/~hji/cisgenome/index_files/sample_tilemap_arg.txt. Briefly, the file should include:

1.	Comparison Type: one, two, or multiple samples to be compared
2.	Working directory: pathname to export results
3.	Project title: filename to label results
4.	Number of libraries: number of array types
5.	Number of samples: sum of IP + CT samples (Affymetrix) or sum of IP/CT pairs (NimbleGen or Agilent) 
6.	Number of groups: should be consistent with comparison type
7.	Data: each group of samples starts with a line [N] ->[Alias] where N is the group id and Alias is the name for the sample. This is followed by [M] lines that contain array file names, where M is the No. of libraries. Files should be ordered according to # the order of libraries
8.	Patterns of interest: relates to comparison type
9.	Masking bad data points: option to remove outliers
10.	Truncation lower bound: lower bound for truncation
11.	Truncation upper bound: upper bound for truncation
12.	Transformation: option to transform by identity, log2, logit, or exp(x)/1+exp(x)
13.	Common variance groups: specify the number of groups thought to have different variances. Set = 1 if all groups are thought to have no differences.
14.	Method to combine neighboring probes: choose either HMM or MA
15.	Method to compute FDR: choose to estimate from left tail (most common), permutation test, UMS, or none.
16.	Half window size: used to determine the number of probes to derive the MA statistic
17.	Window boundary: maximum distance in base pairs for a probe to be included in MA statistic
18.	Standardization of MA statistic: yes or no
19.	Region boundary cutoff: MA statistic threshold for peak detection
20.	Expected hybridization length: expected number of probes within an average peak
21.	Posterior probability cutoff value: HMM cutoff
22.	G0 selection criteria: p for UMS computation. See TileMap paper.
23.	G1 selection criteria: q for UMS computation. See TileMap paper.
24.	Selection offset: set = 1 for HMM; W+1 for MA
25.	Grid size: number of bins to group probe in UMS. See TileMap paper.
26.	Number of permutations: if permutation is used to compute FDR, specify the number of permutations.
27.	Number of exchangeable groups: if permutation is used to compute FDR, specify the number of exchangeable groups.
28.	Maximum gaps within a region: before the peaks are merged
29.	Maximum number of insignificant probes within a region: before the peaks are split
30.	Minimum length of a peak: before it is dropped
31.	Minimum number of significant probes within a peak: before it is dropped


#----------------------------------------#
#II. Using existing models
#----------------------------------------#

-------------
Quick Start:
1. tileprobe_mat testdata_mat_arg.txt
2. a) tileprobe_barnorm -i samplename.bar -o ./ -m modelname
or b) tileprobe_barnorm -i testlist.txt -c 1 -o ./ -m modelname -b 0.0
3. a) tileprobe_peak tileprobe_peak_arg.txt
or b) tilemapv2 tilemapv2_arg.txt

-------------
STEP 1. Applying the MAT correction to the data to be analyzed

Command: tileprobe_mat testdata_mat_arg.txt

testdata_mat_arg.txt: contains data to be analyzed. See "Building New Models" STEP 1 for the file format. 

-------------
STEP 2. Applying the TileProbe correction to the data to be analyzed, using the corresponding model provided

Command1: tileprobe_barnorm -i sample.bar -o ./ -m ModelX (for 1 sample)
OR
Command2: tileprobe_barnorm -i testlisto.txt -c 1 -o ./ -m ModelX -b 0.0 (for a list of samples)

This is the same as STEP 3 from the "Building New Models" instructions.

-------------
STEP 3. Peak detection
Command1: tileprobe_peak tileprobe_peak_arg.txt 
OR 
Command2: tilemapv2 tilemapv2_arg.txt
This is the same as STEP 4 from the "Building New Models" instructions.