Knowledge Base ..where you should look before asking

Knowledgebase

Push Date Code
02-13-2017
06-01-2016
  • Radhouane Aniba created a KB record : Detailed traceback information (text format)
    This is one of the standard library utilities that is not given the attention it deserves. cgitb gives you detailed information about an error, and I find it extremely helpful



    Python Analysis Tools
    Number of solutions Radhouane accepted : 0 out of 0

04-19-2016
  • Radhouane Aniba created a KB record : Checking if modules exist before importing them in Python
    Every time I write something in python, I like to make it possible to people to get informative message when they fail to run a tool because they are missing a dependency.

    In order to do this I use try--except and use the `ImportError` exception like so : 

    Python Generalities
    Number of solutions Radhouane accepted : 0 out of 0

04-08-2016
02-03-2016
  • Radhouane Aniba created a KB record : ggplot style in matplotlib
    Another example of rendering plots on CodersCrowd, this time using matplotlib to mimic a plot generated by R's ggplot

    Python Visualization tools
    Number of solutions Radhouane accepted : 0 out of 0

02-02-2016
  • Radhouane Aniba created a KB record : Testing Sckit-Learn in a container
    This is an example running a linear regression within a container. We use an example published in scikit-learn website

    Python Machine Learning
    Number of solutions Radhouane accepted : 1 out of 1

07-24-2015
  • shyam Ingle created a KB record : calculate primer temperature
    def check_Tm(seq):
        at = seq.count("A")+ seq.count("T")
        gc = seq.count("G")+ seq.count("C")
        tm = ((at*2)+(gc*4));print "Tm:",tm
        
    print "Enter Seq:"
    seq = raw_input()   
    check_Tm(seq)


    Python Next Generation Sequencing
    Number of solutions shyam accepted : 0 out of 1

07-21-2015
07-17-2015
  • Pranavathiyani G created a KB record : Printing statements
    First code in perl. Trying to print statements

    Perl Generalities
    Number of solutions Pranavathiyani accepted : 0 out of 0

07-14-2015
  • Radhouane Aniba created a KB record : Serving proxy logging to ruffus tasks
    This is an example where we can use ruffus proxy logging to share the log between different tasks. Of course we can use the traditional logging stuff but it is not recommended to pass the logging object itself between jobs, thus proxy logging comes handy, and it allows the log to be synced between processes

    Python Analysis Tools
    Number of solutions Radhouane accepted : 0 out of 0

07-13-2015
  • Radhouane Aniba created a KB record : Display git log tree in terminal
    This is a quick way of displaying the git log as a nice tree on the browser.
    I will be cloning the google genomics repo on github as an example

    Login to run

    Shell Generalities
    Number of solutions Radhouane accepted : 0 out of 0

07-08-2015
  • Radhouane Aniba created a KB record : Disk Usage alert utiliy in python
    This is a small tool I added lately on github. I had to monitor some disk usage for our group members on our servers so I had to develop a small cron job to send an alert if a 2.5 T cap is met by one of the team members. I could have done this with a simple shell script, but it would't be fun
    The code is hosted on github as well if you want to contribute https://github.com/radaniba/QuotaWatcher
    The code is importing the logging from another python scirpt

    import logging
    import datetime
    
    def init_log():
        current_time = datetime.datetime.now()
        logger = logging.getLogger('__name__')
        logger.setLevel(logging.INFO)
        handler = logging.FileHandler(current_time.isoformat()+'_quotawatcher.log')
        handler.setLevel(logging.INFO)
        # create a logging format
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        return logger
    






    Python Generalities
    Number of solutions Radhouane accepted : 0 out of 0

04-10-2015
  • Sebastian Raschka created a KB record : Shuffling multiple NumPy arrays in unison
    A function to shuffle multiple NumPy arrays in unison. One typical use case could be the shuffling of training data and the corresponding class labels prior to feeding data to a machine learning algorithm based on a stochastic gradient descent optimization algorithm.

    Python Machine Learning
    Number of solutions Sebastian accepted : 0 out of 2

03-17-2015
02-18-2015
02-15-2015
02-10-2015
  • Ishmael martin created a KB record : wish list
    i don't know why its not working

    Python Generalities
    Number of solutions Ishmael accepted : 1 out of 2

02-06-2015
  • Sebastian Raschka created a KB record : Converting large CSV files to SQLite databases using Python pandas
    I just struggled a little bit of converting ~10 Gb sized CSVs to SQLite databases for more effective processing because of memory issues :P

    I came up with a short workaround using standard tools like pandas and wanted to share it in hope that it could be useful to you. I would welcome any comments and discussion about alternatives and "better" ways :)

    Python Generalities
    Number of solutions Sebastian accepted : 0 out of 0

01-28-2015
01-25-2015
  • Sebastian Raschka created a KB record : Generalizing First and Last Names
    I am heavily into sports predictions these days where I collect raw data from various different sources. One of the challenges is to merge the information based on a shared "key" (here: the name of a soccer player). I wrote a simple function that tries to generalize the names. There are probably some exceptional cases where this function might fail. I am looking forward to some ideas and suggestions to improve this function!

    Python Information retrieval
    Number of solutions Sebastian accepted : 1 out of 1

01-25-2015
  • Sebastian Raschka created a KB record : File search in Python
    The shell commands like `find` inlinux/unix environments are quite powerful and useful. However, often I want to get a bunch of file paths as an iterable in Python for further processing. Here, I wrote a simple `find_files` function that performs a simple substring matching to find and return a list of files that can be used as input for further downstream processing in Python.
    I am curious about your opinion.

    Python Generalities
    Number of solutions Sebastian accepted : 0 out of 0

01-25-2015
  • Sebastian Raschka created a KB record : An EnsembleClassifier for scikit-learn
    Hi,
    I was recently competing in several kaggle competitions and implemented a weighted majority-rule EnsembleClassifier. The idea is to combine several classifier objects to use the majority class label or weighted probability for predictions. Although this is a very simple concept, it works quite well in practice.

    In a nutshell, it works like this:

            
        # initialize some classifiers
        clf1 = LogisticRegression()
        clf2 = RandomForestClassifier()
        clf3 = GaussianNB()

        # make an ensemble classifier
        eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
    
        # evaluate performance
        print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))


    Here, I just want to share my code. Examples and more detailed description can be found in my blog post:
    http://sebastianraschka.com/Articles/2014_ensemble_classifier.html

    Python Machine Learning
    Number of solutions Sebastian accepted : 0 out of 0

01-23-2015
  • Radhouane Aniba created a KB record : Printing nice tables in python
    I found this package useful in case we want to display nice tables as output in some python programs, so I tried it and decided to add it to CodersCrowd, here is an example of how to use it

    Python Generalities
    Number of solutions Radhouane accepted : 0 out of 0

01-15-2015
  • Radhouane Aniba created a KB record : Nearest Flanking genes
    This is a small task that we may do frequently, given a regions list we would like, for each region, get the nearest flanking gene, although the task is trivial, it seems there is no discussion on the net to do that quickly and efficiently, so I thought it might be useful to whoever wants to do that.

    First you'll need to download RefSeq , there are a lot of options to do that, for example you can use mysql command line to dump all the records from RefSeq and coupled to some awk magic, you can manage to have RefSeq in bed format somehow, example

    chr1 1736 4272 DDX11L1 +
    chr1 4224 19233 WASH7P -
    chr1 4224 7502 LOC100288778 -
    chr1 7231 7299 MIR6859-1 -
    chr1 7231 7299 MIR6859-2 -
    chr1 7231 7299 MIR6859-3 -
    chr1 7231 7299 MIR6859-4 -
    Now that you have your two bed files you can enjoy the bedtools closestBed to get your task done in a couple of lines, you may want to wrap all your command in a shell script as well if you want to make it more general
    The code below shows the protocol

    Cheers

    Shell Information retrieval
    Number of solutions Radhouane accepted : 0 out of 0

01-11-2015
01-10-2015
  • jiannis bonatakis bonatakis created a KB record : python poetry
    '''
    The circumstances
    '''
    def bemoan():
            print('Alas!')
    our_lives_must = ['end']
    the_suffering = [True]        # Read as "the suff'ring"
    she = 'loves you'
    love = 'a string of memories'
    alone = bemoan
    alas = alone
    
    '''
    The lament
    '''
    for poetry in the_suffering:
            bemoan()
    for variables in our_lives_must:
            pass
    if None and 1 or 1 and None:
            alone()
    if love.split() or she.replace('you', ''):
            alas()
    try:
            the_suffering.escape() and love.admit()
    except:
            for one_day in our_lives_must:
                    quit()


    Python Generalities
    Number of solutions jiannis bonatakis accepted : 0 out of 2

01-06-2015
  • Radhouane Aniba created a KB record : Handy dataframe
    this is one of the snippets I have configured with my Sublime Editor, I like to have some snippets and code like that to play with the data or if I have to test something on a data structure like a dataframe

    Python Generalities
    Number of solutions Radhouane accepted : 0 out of 0

12-11-2014
  • Severin Zinöcker created a KB record : Part 2: Parse output from IMGT/V-Quest annotated Ig DNA sequences (Part 1)
    This pipeline was designed and written in Perl by Severin Zinöcker and Michael Waisberg at National Institutes of Health in 2013 and 2014.

    The purpose of this package is to correctly annotate DNA sequence from human immunoglobulin (Ig) genes using the online repository IMGT/V-Quest and retrieve information about certain parameters of the genetic sequence in a summarized form that can be used for further analysis.  Multiple Ig heavy (H), kappa (H) and lambda (L) genes can be dealt with simultaneously.

    The code package comes in 3 parts, each of which deals with a separate step in the process of annotating, parsing and summarizing human Ig sequences.  At each step, editing and re-formatting of the output is possible, but may require editing and re-writing of code. 

    Script 2 is based on Ig sequence data derived from certain cellular phenotypes (single atypical and classical memory B cells) which are automatically assigned a label in the output table, column 'phenotype' based on a system of sequence ID numbering.  This part of the code should be edited for your own purposes.

    Script 2 will extract the best match of your Ig sequence alignment with its germline sequence along with un/productive rearrangement, indels, frameshift, stop codons, as well as percentage of sequence overlap, V, D and J segments, D gene reading frame, predicted CDR3 amino acid sequence, length of FWR and CDR amino acid sequences, number of total/non-silent/silent mutations in the V gene segment as well as non-silent/silent mutations in FWR and CDR, from the output data file generated previously (Part 1) from your IMGT/V-QUEST query together with the original nucleotide sequence and its length and saves this output as a table in CSV format.


    Perl Analysis package with multiple programs
    Number of solutions Severin accepted : 0 out of 0

12-10-2014
  • Severin Zinöcker created a KB record : Part 3: Analyze statistics from parsed output results (Part 2)
    This pipeline was designed and written in Perl by Severin Zinöcker and Michael Waisberg at National Institutes of Health in 2013 and 2014.

    The purpose of this package is to correctly annotate DNA sequence from human immunoglobulin (Ig) genes using the online repository IMGT/V-Quest and retrieve information about certain parameters of the genetic sequence in a summarized form that can be used for further analysis.  Multiple Ig heavy (H), kappa (H) and lambda (L) genes can be dealt with simultaneously.

    The code package comes in 3 parts, each of which deals with a separate step in the process of annotating, parsing and summarizing human Ig sequences.  At each step, editing and re-formatting of the output is possible, but may require editing and re-writing of code. 

    Script 3 is based on Ig sequence data derived from certain cellular phenotypes (single atypical and classical memory B cells) as defined in the input table from Part 2, in column 'phenotype'.

    Script 3 will analyze the input table containing summarized data of annotated Ig sequences (Part 2) and extract statistics on V gene usage, non/productive rearrangement, CDR composition, somatic hypermutations and sequences with very long CDR3s for all sequences in the input table and generate output files in CSV format stored in the local workdirectory.


    Perl Analysis package with multiple programs
    Number of solutions Severin accepted : 0 out of 0

12-10-2014
  • Severin Zinöcker created a KB record : Part 1: Annotate raw Ig DNA sequences in FASTA format via IMGT/V-Quest online
    This pipeline was designed and written in Perl by Severin Zinöcker and Michael Waisberg at National Institutes of Health in 2013 and 2014.

    The purpose of this package is to correctly annotate DNA sequence from human immunoglobulin (Ig) genes using the online repository IMGT/V-Quest and retrieve information about certain parameters of the genetic sequence in a summarized form that can be used for further analysis.  Multiple Ig heavy (H), kappa (H) and lambda (L) genes can be dealt with simultaneously.

    The code package comes in 3 parts, each of which deals with a separate step in the process of annotating, parsing and summarizing human Ig sequences.  At each step, editing and re-formatting of the output is possible, but may require editing and re-writing of code. 

    To run this code requires having installed certain packages (please confer source code).

    Script 1 will send queries of individual (single) Ig sequence data either from single or, one by one, from multiple Ig nucleotide sequences in FASTA format to the IMGT/V-QUEST server online and save the results of each IMGT/V-QUEST annotation in one individual output file in TXT format per Ig nucleotide sequence.

    Perl Analysis package with multiple programs
    Number of solutions Severin accepted : 0 out of 0

12-03-2014
  • Ryan Johnson created a KB record : Finding Fur DNA binding sequences in H.pylori
    I am a novice programmer. I made this script to search for DNA Fur-binding sequences (ATTATTANTAATAAT or TAATAATNATTATTA), allowing for up to 2 mismatches. The program runs well and prints the results to a text file, but I'm always looking for ways in which my coding strategy could be improved. The script uses a "sliding window" technique to scan the H.pylori genome for sequences with similarity to the fur binding sequences. Let me know what you think. Thanks everyone! 

    Note: To run, you muse have a fasta file containing the H.pylori strain G27 genome in the same directory ("Hp_G27.fasta").


    Python Similarity and Homology
    Number of solutions Ryan accepted : 2 out of 2

12-02-2014
  • Radhouane Aniba created a KB record : When exit code is not enough
    There is a lot of tools out there, very useful for command line usage, and very widely used in Bioinformatics, which rapidly turns out to be annoying (may be, sometimes) if we are writing a pipeline that cares about I/O connection and each tasks' exit status.
    I am writing a pipeline using samtools, and samtools turns out to be a little bit annoying in the I/O management, because sometimes it generates an output file, but you don't really explicitly name that file. Sometimes, other tools don't even prompt for output files, or some other tools ask users to provide paths literally which adds up more turnarounds that need to be introduced and this can be a bit frustrating.
    Here is an example using samtools. I am wrapping a call to samtools on a file that does not exist, the command is failing but the exit code is still zero which is a bit misleading if we care about reporting the status of the entire pipeline, which means here it means that the sort went Ok and this will trigger other tasks, which is wrong 



    Python Generalities
    Number of solutions Radhouane accepted : 1 out of 1

12-01-2014
11-28-2014
  • Radhouane Aniba created a KB record : Decorator to check if a module is imported before running a bloc
    This can be very useful, even though we always import needed modules in the beggining, sometimes when you review a code or change it you might want to add another function that you tested before but with no prior import in the current code, so using this as a control if a given module is imported can be useful

    Python Generalities
    Number of solutions Radhouane accepted : 2 out of 2

11-14-2014
11-11-2014
  • Ninan Sajeeth Philip Sajeeth Philip created a KB record : Difference Boosting Neural Network - DBNN Code
    This is a Bayesian classifier that handles conditional dependence by splitting the feature space into pairs and approximating the cummulative probability by treating the  pairs as conditionally independent. That is, given N features that may or may not be conditionally dependent on each other, probability is estimated for a pair such that given the value of one feature, what is the probable value for the other features is. This is done for all combination of feature pairs and the cummulative probability is taken as the product of the pair-wise probabilities. It works well on most problems.

    The code also redefines prior as a derived knowledge about the importance that is to be given to each feature based on the experience it has in making the correct decisions.  As an example, assume that the objective is to differentiate monkeys from humans. Since both has amny similar features, the highest priority in this case should be given to the differences, rather than the similarities. So the code boosts the differences between the features of the objects in two given classes until they fall apart as desired. That is why it is called a difference boosting network. This knowledge about the difference goes as prior when predictions are to be made.

    The original code was developed in the late 90s and has been improved several times. As a result, there are many additions and deletions. But it works and can be further improved. I post it here expecting the community to play with it and if possible, port it to other platforms like Python.
    Cheers,
    -Sajeeth
    http://www.iucaa.ernet.in/~nspp

    C++ Machine Learning
    Number of solutions Ninan Sajeeth Philip accepted : 0 out of 0

11-04-2014
  • Amir Zadeh created a KB record : Visualize Flagstat Files
    Visualizing the statistics reported in flagstat files

    Python Visualization tools
    Number of solutions Amir accepted : 0 out of 1

10-24-2014
  • Radhouane Aniba created a KB record : Get coverage from a Bam file at regions from a file
    This is a quite usual task we always run through, we have a target (region) list with location and we want to get the coverage from a bam file on these regions
    So this is how I do in python
    dependencies
    pybedtools
    bedtools
    numpy


    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 1

09-26-2014
09-26-2014
08-06-2014
  • vamshidhar gangu created a KB record : Handling fastq files in perl
    Reading and Writing fastq compressed files is done frequently in NGS analysis. 
    The below code helps to handle both the fastq compressed/uncompressed files at one go!


    Perl Generalities
    Number of solutions vamshidhar accepted : 0 out of 0

07-29-2014
  • Radhouane Aniba created a KB record : Extract average sequence length from a fastq file
    In this code I am parsing a fastq file and I am extracting all sequences length to an array, that I feed to numpy and then got the average sequence length as a result, this might be useful prior to any analysis involving fastq files, I had to wrap this up because I am using thresholds with bowtie2 as for the reads length to be considered for the alignment. It can be useful to have an idea on the average length to adjust that threshold

    Rad

    usage : python read_average_length.py file.fastq  --minq 30

    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 0

06-24-2014
  • Radhouane Aniba created a KB record : Extract mapped and unmapped reads from a bamfile
    One of the usual tasks when it comes to examine an alignment file is to check the number of reads that successfully mapped to the reference genome and those who failed. Here is a small python function that uses samtools to do that and I am sure that there is a lot of other methods to do that (picard may be ?)
    If you have other alternative you can rewrite the code and use the same bam file that I am providing in the code

    Credits : Matt Shirly https://gist.github.com/mdshw5/73c6591237c7a9f88518

    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 0

05-26-2014
  • Ahmed S. Chakroun created a KB record : Read count within a genomic window
    Hi,

    Pseudo usage:
    python code.py INPUT_BAM="infile.bam" WINDOW_SIZE="integer" MAPPING_QUAL_THRESHOLD="integer"

    The purpose of code.py would be the reporting, in a "chromosome.tsv" file, of the read count (with MQ>=threshold) into the non-overlapping window of the given size, all along each chromosome of the infile.bam, based on their starting position. For instance, given the CHR01 and for a window size of 100, the CHR01.tsv file will include the number of all the reads mapped with a starting position between 0 and 100 pb and having a MQ >= threshold, then the nubmer of all the reads mapped with a starting position between 101 and 200 pb and so on . Each "chromosome.tsv" output file would look like this :

    First_Pos    Last_Pos    Reads
    1            100        150
    101          200         250
    201          301         159
    ...
    Cheers.

    Python Next Generation Sequencing
    Number of solutions Ahmed accepted : 1 out of 1

05-21-2014
  • Radhouane Aniba created a KB record : Fastest way to intersect dataframes in R ?
    I came through this approach to intersect dataframes in R, I am sure there is a lot of other way of doing this but this solved my problem today :), if you have any other ideas feel free to hack this code

    R Statistics
    Number of solutions Radhouane accepted : 0 out of 0

05-16-2014
  • Radhouane Aniba created a KB record : Easily spot information from a large output in python
    I am developing a pipeline, and the program is outputting a lot of information in stderr that I didnt want to store in log file, I just need one or two information from the entire output so I was wondering how to spot what I need in the large output message displayed in the terminal
    So I thought it could be great to color or tag these information programmatically and I found termcolor that it solved my problem.

    So here is an example of how this work (It will not display here but this is how this looks like in your terminal)


    Python Analysis Tools
    Number of solutions Radhouane accepted : 0 out of 0

05-15-2014
05-15-2014
  • Radhouane Aniba created a KB record : Testing Ruffus pipelines
    I added Ruffus as a runnable instalnce at CodersCrowd in order to showcase all good things that can be built with this library, and also to solve problems :)

    This is a simple code to make sure everything is working great

    Let the code flow :)

    Python ToolKits/Programming
    Number of solutions Radhouane accepted : 0 out of 0

05-15-2014
  • Radhouane Aniba created a KB record : Timing a code in python
    Today I wanted to monitor a code block in python and I was interested in knowing basic information on elapsed time per function. A lot of solutions exist out there to do that kind of tasks for benchmarking or other purposes but the library `time` was enough for me to do what I wanted

    Basically, for a simplistic usage, we set a  starting anchor, and an end anchor and then we simply calculate the difference (elapsed time)

    Here is a simple code illustrating this.

    Python Analysis Tools
    Number of solutions Radhouane accepted : 1 out of 1

05-14-2014
  • Radhouane Aniba created a KB record : Extract ggplot2 color information in a vector
    I am using the default color palette from ggplot2 and would like to be abe to extract the colors mapped to my variablemy function is like follows :
    p = ggplot(varf2, aes(factor(variable), value)) +
        geom_boxplot(aes(fill = factor(variable))) +
        facet_grid(sID ~ .) +
        scale_x_discrete(limits=ordre$V1) +
        geom_text(aes(y = 0.8, label = gsub("[;,]","\n",mix), size=2))
    my varf2 looks like this one :
      variable            sID       value V2    V3                  mix          V5
    1   CM0001 SP_SA036,SA040 0.492537313  1 A21Y0 hTERT=1.0;HCT116=0.0 Cell Number
    2   CM0001 SP_SA036,SA040 0.479564033  1 A21Y0 hTERT=1.0;HCT116=0.0 Cell Number
    3   CM0001 SP_SA036,SA040 0.559139785  1 A21Y0 hTERT=1.0;HCT116=0.0 Cell Number
    I would like to be able to get a dataframe mapping my variable IDs to the colors used, is there a way to extract the default palette colors used for factor(varf2$variable)Thanks

    R Visualization tools
    Number of solutions Radhouane accepted : 2 out of 2

05-07-2014
  • Andrew McPherson created a KB record : Read a fasta file
    Function to read a fasta file.  Yields id, sequence pairs.  Its a generator so sequences can be read into memory and discarded, or stored in a dictionary.

    Python Generalities
    Number of solutions Andrew accepted : 0 out of 0

04-24-2014
  • Radhouane Aniba created a KB record : Using bedtools to get difference in features between overlapping positions
    Here is an example of something I wanted to be implemented directly in intersectBed that can sometimes be useful. It would be great to adda flavor of groupBy into interesectBed to achieve a substraction and here is an example of what I mean, see comments in the code

    Shell Information retrieval
    Number of solutions Radhouane accepted : 1 out of 1

04-17-2014
04-15-2014
  • Radhouane Aniba created a KB record : Order a dataframe according to a target vector
    Here is something I do frequently, I just thought it might be useful to share. Sometimes I do create dataframe and do some graphic output using ggplot for example, and then collaborators want  a specific order on the plots for the axis, depending on an order list they provide, not that hard to do with R :) 

    R Generalities
    Number of solutions Radhouane accepted : 0 out of 0

04-14-2014
04-10-2014
  • Radhouane Aniba created a KB record : RangedData object : changing chromosome names
    Hello all,

    I am using a pachage from Bioconductor TEQC, and when I load my reads as a bam file, the object created (class RangedData) looks like this

    RangedData with 6 rows and 1 value column across 84 spaces
         space             ranges |                                           ID
      <factor>          <IRanges> |                                  <character>
    1        1 [8404005, 8404254] | M00897:53:000000000-A4HU0:1:1101:19250:16725
    2        1 [8404005, 8404254] | M00897:53:000000000-A4HU0:1:1101:11505:18314
    3        1 [8404006, 8404255] |  M00897:53:000000000-A4HU0:1:1101:7157:14416
    4        1 [8404006, 8404255] |  M00897:53:000000000-A4HU0:1:1102:19113:7115

    which creates a space column with chr IDs like (1) instead of chr1
     I am not familiar with this class yet, so I was wondering how to switch the IDs by appending "chr" on them, because It creates a lot of problems in the plotting part of the analysis

    R Gene set enrichment
    Number of solutions Radhouane accepted : 1 out of 2

04-07-2014
  • Radhouane Aniba created a KB record : Extract columns from a list of files
    Here is a one useful awk line to extract a column from a list of files and create a tabular file out of them.
    Very easy and very fast

    Rad

    Shell Information retrieval
    Number of solutions Radhouane accepted : 0 out of 0

04-03-2014
  • Radhouane Aniba created a KB record : Cut adaptors, trim sequence from fastq files
    This is a code to get some sanity check on fastq files. The code contains routines to check for adaptors using Biopython.
    The code is developed by Brad Chapman and can be found on Github 

    Pros : python (ideal for my the task I would like to implement into a pipeline)
    Cons : depends on Biopython ( I still need to test it on cluster to see if that takes a lot of time )

    According to Brad Chapman, this code is working but not probably scalable



    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 1 out of 1

04-01-2014
  • Radhouane Aniba created a KB record : Extract the last 25 bp from each read in fastq and output fasta
    This is an example of a task I had to do and I opted for using awk. I had to do that for custom local alignment of these end of reads. The script simulates a fastq file
    You can create an alias with the awk command and name it like "fq_to_25bp_fa"

    Shell Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 0

04-01-2014
03-31-2014
  • Radhouane Aniba created a KB record : Running bedtools from python
    As any process that you would run on python, I used to use subprocess on my python codes to run bedtools utilities such as intersectBed, until I met pybedtools, that I need to work with a little bit before adding it to the virtual machine. It looks interesting, but would be interested in knowing how fast it is, in comparison to the subprocess method.

    I am listing this simple snippet as an example.


    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 0

03-31-2014
  • Duarte Molha created a KB record : Script to overcome common bash utilities pitfalls in matching and joining text files
    As I am sure many of you, I use extensively join, sort and grep programs in the shell.

    However, although very fast, these programs do have some limitations that make what should be a relatively simple task much more cumbersome

    Join:
    1. only joins files sorted in the joining fields
    2. default delimiter is space and to change it you have to defined them as -t $'\t'  >> WTF!
    3. joined field is always output as the 1st column

    grep:

    one of the most useful things I do with grep in in fact a matching join-type operation:

    grep -F -f file_with_things_to_search.txt File_to_search_those_things.txt

    However, I am not sure if it is my computer configuration or something else, but this command always does partial matching. What I mean with this is 
    that if my query line is "test" the command line above with consider a hit line containing the word "testing"
    Maybe I am just not very good but, from according to the man pages of grep this should not occur!

    Tired of trying to understand it, I decided to make my own utility to do joins and fixed string style matches...

    So here is is. Do with it what you will, but please to let me know of any improvements to the code you do so that I can also benefit :)

    I tried to explain all the different options in  the usageMessage, but if something isn't very clear please let me know... or better yet improve it 
    yourself :)

    Cheers everyone!
    Duarte
    1. edited the code ... Grep mode is now much faster and less resource intensive since it only keeps the search key from 1st file (since the rest of the data from that file is not required in grep mode)
    2. Added a field separator checking mechanism to make it possible to input the field separator like -ofs "\t" and get it correctly interpolated as the tab character... no longer we have to force interpolation by doing -ofs $'\t' like in join :)
    3. Added the ability to select the output columns from each file you want on the output with the parameters -afr and -bfr (ie if you want the output to mach on field 4 and you want the 1st, 2nd, 3rd and 4th fields from file 1 and the 5th field from file two: match_records.pl -a file1.txt -b file2.txt -1 4 -2 4 -j -afr 1,2,3,4 -bfr 5 

    Perl Generalities
    Number of solutions Duarte accepted : 0 out of 0

03-21-2014
  • Radhouane Aniba created a KB record : Alignment in Python
    Here is an implementation (one of many on the web), of the local and global alignment in python. I am creating this code with errors introduced intentionally to demo one cool feature at CodersCrowd that is coming soon.
    I am keeping the code open for now, if you find the bug feel free to submit a solution :) I will post the right code in a couple of days, with the new release of CodersCrowd

    I will add the reference to the code later

    Python Similarity and Homology
    Number of solutions Radhouane accepted : 1 out of 2

02-10-2014
12-29-2013
  • Djekidel Mohamed Nadhir created a KB record : Downloading Chip-Seq data from ChipBase database by cell-line
    The ChipBase contain already processed Chip-Seq peaks, so sometimes one wants to directly use their result without the need to go  through the peak calling and pre-processing steps.
     This script can be use to  download the Chip-Seq peaks related to a cell-line from ChipBase database. As as-far as I know you need to do it manually for each file which is time consuming
    You can run the script as follow:

    perl parseChipBase.pl -cl <cell_line>

    Hope it will be helpful for some people

    Perl Chip-Seq
    Number of solutions Djekidel accepted : 0 out of 0

12-12-2013
12-10-2013
  • Tim Kacprowski created a KB record : Convert csv files to xlsx files
    This is a small python script to convert csv files into xlsx files.
    It is nothing fancy. Nevertheless, it comes in handy when you are working with people who regularly want you to hand over Excel files while you prefer to work with csv files.

    Python Generalities
    Number of solutions Tim accepted : 0 out of 0

12-09-2013
  • Radhouane Aniba created a KB record : Gene list KEGG enrichment using GOstats
    A script for gene set enrichment over KEGG, the perl script serves as automation for the analysis of several gene set files inside a directory, the R script does the analysis itself using bioconductor GOStats package. (added a perl script for bulk analysis). fdrltool package is used for qvalues calculations

    R Gene set enrichment
    Number of solutions Radhouane accepted : 0 out of 0

12-04-2013
12-03-2013
11-20-2013
  • Jack Bawer created a KB record : Extract SNP flanking region
    I am trying to extract flanking regions of a list of SNPs, I implemented my code in R using Bioconductor and even though the process seems straightforward, I am having some problems getting my code works the right way. I installed the libraries needed for the task using biocLite("BSgenome.Hsapiens.UCSC.hg18") but errors about getting the right coordinates from hg19.
    My input file has this format :

    rs10012775 chr4 122896328 0.45819 T C 0.358

    I am reading each line get the chromosome name, the position and specify the offset in the script to go upstream and downstream of the position.
    But still have some bugs.

    R Generalities
    Number of solutions Jack accepted : 1 out of 2

11-01-2013
  • Matt Shirley created a KB record : Parse CIGAR string
    CIGAR strings are a mandatory field in the Sequence Alignment/Mapping (SAM) specification. This field describes the relationship between the alignment's sequence coordinates and the genomic coordinates. This function will parse matches, deletions, and insertions from a CIGAR string from a SAM format alignment. It returns a tuple of key, value pairs describing the genomic position offset as well as alignment start and end for all untapped alignments represented in the CIGAR string.

    Python Next Generation Sequencing
    Number of solutions Matt accepted : 0 out of 0

09-04-2013
  • Radhouane Aniba created a KB record : Extract bigWig data for a bed file in Bulk
    This script calls the program bigWigSummary to extract enrichment from a bunch of bigWig files (they can be a lot of ChIP-seq experiments for example) for a set of intervals stored in a bed file. You need two files to run the script, a bed file containing the regions to be mined, and a directory containing all the bigWig files. The output will be a tabular file containing :
    - Chr
    - Start
    - End 
    - ID (if your file contains one)
    - BigWig1 value
    - BigWig 2 value
    - etc ...
    What this script is doing is getting all the bigwig files and depending on the intervals on the bed file it will use UCSC tool bigWigSummary to get average enrichment per interval.

    I will let this code open for a while (it is functional) but if you want to add a version of yours feel free.



    Perl Next Generation Sequencing
    Number of solutions Radhouane accepted : 1 out of 1

08-09-2013
  • Radhouane Aniba created a KB record : Parse Bowtie output
    This program reads a bowtie alignment file and writes  an output file with the informations about the chromosome, the position the strand of the alignment, the GC content  and the number of mismatches of the reads that align uniquely to the reference yeast genome.    

    Python Next Generation Sequencing
    Number of solutions Radhouane accepted : 0 out of 0

07-29-2013
  • Alex Frieden created a KB record : Groovy HTTP REST Cypher Query on Neo4j
    Groovy code to run a cypher graph query on a neo4j instance.  This uses the REST endpoint of the server so there are no neo4j libraries needed and is independent of version.  

    Java Information retrieval
    Number of solutions Alex accepted : 0 out of 0

07-26-2013
  • Jack Bawer created a KB record : Using sed to remove newline from file
    Hello 
    I am trying to use sed to reformat a file, specifically remove new lines (\n) and replace them by spaces to previous lines
    my original file has this format :

    title1
    _1 texthere
    title2
    _2 texthere

    See my code in the Source code below.
    It is not working for some reason


    Shell Information retrieval
    Number of solutions Jack accepted : 1 out of 1

07-24-2013
  • Radhouane Aniba created a KB record : Extract sequence by ID
    Obtaining the actual sequence making up a gene from NCBI is simple using a browser, but not so much when wanting to do it in batch. This script obtains the nucleotide sequences, mRNA's, CDS's and protein sequences associated with a list of gene IDs.

    Python Information retrieval
    Number of solutions Radhouane accepted : 0 out of 0

07-24-2013
07-24-2013
  • Radhouane Aniba created a KB record : Multifasta Parser in C#
    Multifasta Parser is a new parser for fasta files. basically allows you to extract fasta sequences from multifasta file.

    C# Information retrieval
    Number of solutions Radhouane accepted : 0 out of 0

07-24-2013
07-24-2013
07-24-2013
  • Radhouane Aniba created a KB record : Parsing Genbank report using regexp.
    Homemade Genbank report parser using regular expressions. Once desired data is captured, it can be printed in any format.
    WI Bioinformatics course - Feb 2002 - Lecture 6

    Perl Information retrieval
    Number of solutions Radhouane accepted : 0 out of 0

07-24-2013
  • Radhouane Aniba created a KB record : Computing Pearson Correlation
    Correlation is one of the most used tasks in computational biology, while using packages like R is very simple, sometimes we need to do it on the fly using other languages, here is a good example of doing it in perl. 

    Script written by Didier Gonze

    Perl Statistics
    Number of solutions Radhouane accepted : 0 out of 0

07-24-2013
  • Radhouane Aniba created a KB record : Parse Blast Output in Perl
    This is one of many methods to parse Blast output, written in perl. See comments in the script for usage

    Perl Similarity and Homology
    Number of solutions Radhouane accepted : 0 out of 0