Thrust error on min_element or max_element

Problem

I was compiling some old CUDA code with a recent version of the CUDA SDK. I got these errors on Thrust methods:

error: namespace thrust has no member max_element
error: namespace thrust has no member min_element

Solution

In recent versions of CUDA SDK, these Thrust methods have been moved to the extrema.h header file. These errors will go away if you include the thrust/extrema.h header file.

Tried with: CUDA SDK 7.5 and Ubuntu 14.04

Advertisements

How to generate random numbers in Thrust

Thrust has support for many random number generator engines and distributions. For example, the code below generates floating point numbers in the uniform_real_distribution using the default_random_engine. By default, the numbers generated by the uniform real distribution are in the range [0.0, 1.0)

Tried with: CUDA 5.5

execution_policy.h missing error on CUDA compilation

Problem

When I tried to compile CUDA code downloaded from the web, I got a missing file error on <thrust/system/cuda/execution_policy.h>

 

Solution

Thrust execution policies were introduced in version 1.7. CUDA 5 and older versions do not have this Thrust. Update to CUDA 5.5 or at least update to the most recent version of Thrust from Github.

Tried with: CUDA 5 and Ubuntu 12.04 LTS

Unload of CUDA runtime error with Thrust

Problem

Your CUDA program that uses the Thrust library executes correctly, but at the end throws up this error:

terminate called after throwing an instance of 'thrust::system::system_error'
what():  unload of CUDA runtime failed

Solution

This cryptic error typically occurs if you are using Thrust device_vector in your program. I found that this error occurs when I have a device_vector which is destroyed after the CUDA runtime has been unloaded.

For example, this can happen if you have declared a vector as a global variable. At the end of the program, the CUDA runtime is unloaded. If the destructor of the vector is called after this, it cannot destroy fully since it needs the CUDA runtime. Moving these global vectors inside functions or classes might fix the error.

Tried with: CUDA 5 and Ubuntu 12.04 LTS

Thrust: Compact Multiple Vectors Using Predicate

From this earlier post, we can see that the zip_iterator makes it really easy to compact multiple vectors (of same size) based on the duplicate values in one of those vectors.

Another scenario that arises frequently is the need to compact multiple vectors (of same size) based on testing the value in one of those vectors (using a predicate). We may want to remove all the elements for which the predicate is true.

For example, say I have 2 vectors. A previous kernel might have invalidated some of the values in the first vector by setting them to -1. Now, I want to compact these 2 vectors such that the elements corresponding to -1 in the first vector is removed from both vectors.

The code required to the above compaction is very similar to that in the earlier blog post:

// Many vectors
thrust::device_vector< int > vec0;
thrust::device_vector< int > vec1;

// Make zip_iterator easy to use
typedef thrust::device_vector< int >::iterator  IntDIter;
typedef thrust::tuple< IntDIter, IntDIter >     IntDIterTuple2;
typedef thrust::zip_iterator< IntDIterTuple2 >  ZipDIter;

// Remove elements in many vectors if element in vec0 is negative
ZipDIter newEnd = thrust::remove_if(    thrust::make_zip_iterator( thrust::make_tuple( vec0.begin(), vec1.begin() ) ),
                                        thrust::make_zip_iterator( thrust::make_tuple( vec0.end(), vec1.end() ) ),
                                        isTuple2Negative() );

// Erase the removed elements from the vectors
IntDIterTuple2 endTuple = newEnd.get_iterator_tuple();
vec0.erase( thrust::get<0>( endTuple ), vec0.end() );
vec1.erase( thrust::get<1>( endTuple ), vec1.end() );

The only extra work needed is to carefully write a predicate that does what we want:

// Make predicate easy to write
typedef thrust::tuple< int, int > IntTuple2;

// Predicate
struct isTuple2Negative
{
    __host__ __device__ bool operator() ( const IntTuple2& tup )
    {
        const int x = thrust::get<0>( tup );
        return ( x < 0 );
    }
};

That is it, the compaction works like magic! ­čÖé

Tried with: CUDA 4.0 and Thrust 1.4.0

Thrust: Remove Duplicates in Multiple Vectors

With the magical thrust::zip_iterator duplicates in multiple vectors can be easily removed and the vectors can be trimmed in Thrust.

Consider two vectors, one of key values and the other holding their values. There can be many values associated with each key. The keys are sorted and the values associated with each key are also sorted. Finding duplicates in these vectors boils down to finding duplicate pairs and removing them. Here is how to achieve this easily using thrust::unique and thrust::zip_iterator:

typedef thrust::device_vector< int >                IntVector;
typedef IntVector::iterator                         IntIterator;
typedef thrust::tuple< IntIterator, IntIterator >   IntIteratorTuple;
typedef thrust::zip_iterator< IntIteratorTuple >    ZipIterator;

IntVector keyVector;
IntVector valVector;

// Remove duplicate pairs
ZipIterator newEnd = thrust::unique( thrust::make_zip_iterator( thrust::make_tuple( keyVector.begin(), valVector.begin() ) ),
                                     thrust::make_zip_iterator( thrust::make_tuple( keyVector.end(), valVector.end() ) ) );

IntIteratorTuple endTuple = newEnd.get_iterator_tuple();

// Trim the vectors
keyVector.erase( thrust::get<0>( endTuple ), keyVector.end() );
valVector.erase( thrust::get<1>( endTuple ), valVector.end() );

Tried with: Thrust 1.3 and CUDA 3.2

How to pass Thrust device vector to CUDA kernel

Thrust makes it convenient to handle data with its device_vector. But, things get messy when the device_vector needs to be passed to your own kernel. Thrust data types are not understood by a CUDA kernel and need to be converted back to its underlying pointer. This is done using thrust::raw_pointer_cast:

#include <thrust/device_vector.h>

thrust::device_vector<int> iVec;

int* iArray = thrust::raw_pointer_cast(&iVec[0]);

fooKernel<<<x, y>>>(iArray);

Inside a kernel, I typically need not only a pointer to the array, but also the length of the array. Thus, I find it is useful to convert device_vector to a structure that holds both a pointer to the array and its length. (Very much like a vector itself.) Creating such a structure and its conversion function is easy thanks to templates:

// Template structure to pass to kernel
template <typename T>
struct KernelArray
{
    T*  _array;
    int _size;
};

// Function to convert device_vector to structure
template <typename T>
KernelArray<T> convertToKernel(thrust::device_vector<T>& dVec)
{
    KernelArray<T> kArray;
    kArray._array = thrust::raw_pointer_cast(&dVec[0]);
    kArray._size  = (int) dVec.size();

    return kArray;
}

Passing device_vector to kernels and accessing its array inside the kernel is easy thanks to this infrastructure:

thrust::device_vector<int> iVec;

fooKernel<<<x, y>>>(convertToKernel(iVec)); // Explicit conversion from iVec to KernelArray<int>

__global__ fooKernel(KernelArray<int> inArray)
{
    for (int i = 0; i < inArray._size; ++i)
        something = inArray._array[i];
    // ...
    return;
}

You can take it a notch higher and make the conversion from device_vector to KernelArray to be implicit. This can be done by adding a constructor to KernelArray that takes one input parameter of type device_vector. (See Stephen’s comment below the post.)

With such a constructor, you can now pass a device_vector seamlessly to the kernel:

thrust::device_vector<int> iVec;

fooKernel<<<x, y>>>(iVec); // Implicit conversion from iVec to KernelArray<int>

__global__ fooKernel(KernelArray<int> inArray)
{
    // ...
}

Tried with: Thrust 1.3 and CUDA 3.2

Thrust: Installation Directory

Thrust consists of only header files and can be installed anywhere you want. However, it is most easy to use if it is installed in the %CUDA_INC_PATH% directory. This directory is directly included by all CUDA Build Rules, so nothing extra needs to be set in your Visual Studio projects that intend to use Thrust.

Tried with: Thrust 1.3.0 and CUDA 3.2

Thrust: zip_iterator

Thrust is a CUDA library of some of the most basic parallel algorithms that can be applied on data. It is written using C++ templates, looks like STL and thus consists of only header files!

One of the coolest yet confusing feature of Thrust is the zip_iterator. Most Thrust functions accept one or two vectors. The zip_iterator is useful when you want to apply the function on more than one or two vectors. It can be understood with a simple example.

Consider an application where each integer key is associated with a Foo object. In old-school serial applications, you would place the key and Foo together in a structure and use a vector of that structure. For parallel applications, such information is easier to manipulate if they are represented separately as a vector of keys and a vector of Foo objects:

#include <thrust/device_vector.h>
typedef thrust::device_vector<int> IntVec;
typedef thrust::device_vector<Foo> FooVec;
IntVec keyVec;
FooVec fooVec;

These keys and their corresponding Foo objects can be sorted easily using thrust::sort_by_key. This function takes two vectors: a key vector and a value vector. It sorts the keys in the key vector and also the value in the value vector corresponding to the location of each key:

#include <thrust/sort.h>
// Assuming keyVec and fooVec are of equal size ...
thrust::sort_by_key( keyVec.begin(), keyVec.end(), fooVec.begin() );

What if the application has two or more vectors that need to be sorted along with the key vector? For example:

#include <thrust/device_vector.h>
typedef thrust::device_vector<int> IntVec;
typedef thrust::device_vector<Foo> FooVec;
typedef thrust::device_vector<Bar> BarVec;
IntVec keyVec;
FooVec fooVec;
BarVec barVec;

This is where zip_iterator is useful! It can be used to convert multiple vectors into tuples, so that they can be manipulated as one element. Sorting both the Foo vector and the Bar vector along with the key vector is now easy:

#include <thrust/sort.h>
// Assuming keyVec, fooVec and barVec are of equal size ...
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
                     thrust::make_zip_iterator( make_tuple( fooVec.begin(), barVec.begin() ) ) );

That is it! Any number of vectors can be zipped up like this into a tuple and can be used with any of Thrust’s functions! ­čÖé

Tried with: Thrust 1.3.0 and CUDA 3.2