Seed oversampling

From Dynamo
Jump to navigation Jump to search

Seed oversampling is a technique used in models that parametrise overall region where actual particles will be located, as membranes or pseudo crystals. The crop_points generated by the models after the geometric computation stage are not supposed to realistically be centered on a copy of the macromolecule of interest.

One possible procedure is to define a distribution of boxes denser than the expected distribution of physical particles. This way, after particle extraction you will end up end up with more subtomograms than actual particles, but you'll make sure that all the particles are actually included in some of the subtomograms. This might require some planning ahead, playing with the relationship between the expected physical distance between the particles, the sampling distance that you impose through the model, the physical size of the particle and the sidelength that you ask for the particle cropping.

Oversampling in an alignment project

When you crop your particles using seed oversampling, you should try to avoid carrying all the "extra" subtomograms along the iteration procedure.

You can run a couple of alignment iterations to allow the subtomograms to converge to positions with actual particles, and then elliminate repetitions (we call this procedure trimming). This can be done manually or through the alignment project.

Oversampling trimming through the alignment project

The parameter separation_in_tomogram can be tuned (in pixels) at the end of each iteration to ellminate particles that are closer to each other more than a given threshold. The particle with the highest correlation will be kept, and any other particle in the particle radius will be eliminated.

Manual oversampling trimming

In order to get a closer view on how the alignment handles the particles, it is advisable to run a few iterations with the full oversampled data set, then explore visually the refined table and trim it manually, i.e., impose manually a separation threshold on the refined table, and then initiate a new alignment project based on the resulting trimmed table.

Oversampling strategies

Imagine this typical situation: you have many viruses already defined on many tomograms (each one as a surface), and you are trying to average glycoproteins from the virus surface. So, you created and oversampling data set of several thousands subtomograms (which are probably specially big because of the very fact that you are doing an oversampling!), tried to average all of them together and then classified to get rid of the seeds that don't contain particles... This approach might or might not work on one shot, and each iteration cycle might consume many computation time.

So, it is frequently a good idea to spend some time trying to figure out what is the most appropriate way to analyze your own data set: which sidelengths for the particles will work, what is a sensible distribution parameter.

You can consider a single membrane and test a multireference project there. You don't need to rely on blind alignement: you can create geometrical models for the membrana and for the membrane with and additional mass in the center, protruding from the membrane. Then you can create a single multireference alignment project using these two shapes as initial templates. If the density of the proteins on the membrane are high enough, a single alignment project followd by a PCA might actually suffice to locate the seeds containing actual protein.

You might find the classes dpktomo.examples.motiveTypes.Membrane and dpktomo.examples.motiveTypes.MembraneWithRod useful. Notice that when using this approach based on geometrical shapes, it is advisable to impose a high low pass on the generated templates.

Command tools

The function that "filters" the refined table created by an iteration is dpktbl. exclusionPerVolume. You can use it to analize the results of an iteration manually.

Note that this function has not been optimized and can take some time for more than 5k particles. Let us know if that is bottlenecking your computation.

function [newTable,o] = dpktbl.exclusionPerVolume(table,distanceThreshold,columnVolume,columnCC)

Unless you are doing something non-standard, columnVolume should be 20 and columnCC should be 10.