Access Nearest-neighbor in Linear Texture-read Mode

0 comments

You may not have noticed that, in either Cuda or OpenCL, you can perform nearest-neighbor texture lookup (like cudaFilterModePoint) on textures that are already set in linear access mode (cudaFilterModeLinear). Why not just use texture in nearest-neighbor mode? Because there are cases in one texture there are channels that should not be linear-interpolated but others are needed (in this way you can more efficiently cache the texture data). Also you don't want to do you own linear interpolation because that will reduce the performance (roughly by half).

Suppose your texture lookup code for point access mode is as follows:


float2 data = tex3D(tex, x,y,z);
id = data.x;
alpha = data.y

id is the color ID for later lookup and cannot be blended.

You want to do interpolation on first channel but not on second channel, here is the way for the texture with cudaFilterModeLinear and cudaReadModeElementType (do not normalize the coordinates) mode:

id = tex3D(tex, floor(x+.5f)+.5f, floor(y+.5f)+.5f, floor(z+.5f)+.5f).x; // nearest neighbor
alpha = tex3D(tex, x,y,z).y; // linear

The .5f in floor is for rounding into nearest integer instead of round down.
The outter .5f is to access the un-interpolated color in the device texture.
Suppose that x,y,z are not normalized

For normalized coordinates (cudaReadModeNormalizedFloat), it's more complex and I would not recommend:

id = tex3D(tex, (floor(x+.5f)+.5f)/w, (floor(y*h+.5f))+.5f)/h, (floor(z+.5f)+.5f)/d ).x; // nearest neighbor
alpha = tex3D(tex, x/w, y/h, z/d).y; // linear


Although there are more computation to perform, the performance will still be much better than doing your own interpolation. One drawback is that you lose the quality compared to if you manually interpolate the 6 neighbors colors referenced by IDs around the access point.

Cuda operator/ overload for

1 comments

Although Cuda utils provides convenient vector-type functions, they seem screwed up with vector-type operator/ overloading.
In the cutil_math.h you can find:


inline __host__ __device__ float4 operator/(float4 a, float s)
{
    float inv = 1.0f / s;
    return a * inv;
}
inline __host__ __device__ float4 operator/(float s, float4 a)
{
    float inv = 1.0f / s;
    return a * inv;
}


See? The function body is the same! So I always get wrong values when I tried to write:
    float4 a,inv_a;
    inv_a = 1/a;


No wonder someone told me not to use cuda utils.

PS.


The codes have been corrected in Cuda 4.0+.

Cache matters

0 comments

In my CS parallelism class homework we are giving a simple nested for-loop and asked to make it faster without changing the semantics. In other words, we only consider increasing the cache hit rate. Here are the codes:


 
#define N (128)
double A[N][N][N], B[N][N], C[N][N][N],  CC[N][N][N]; 
...

  for (i=0; i<N; i++)
   for (j=0; j<N; j++)
    for (k=0; k<N; k++)
     for (l=0; l<N; l++)
       C[i][j][k] += A[l][i][j]*B[l][k];
 

By permutating and unrolling the loops and check whether the running time reduces, here are the best codes I can change:

  
for (i=0; i<N; i++)
   for (l=0; l<N; l+=2)
     for (j=0; j<N; j+=2)   
      for (k=0; k<N; k+=2)
      {

      CC[i][j][k] += A[l][i][j]*B[l][k];
      CC[i][j][k+1] += A[l][i][j]*B[l][k+1];
 
      CC[i][j][k] += A[l+1][i][j]*B[l+1][k];
      CC[i][j][k+1] += A[l+1][i][j]*B[l+1][k+1];
 
      CC[i][j+1][k] += A[l][i][j+1]*B[l][k];
      CC[i][j+1][k+1] += A[l][i][j+1]*B[l][k+1];

      
      CC[i][j+1][k] += A[l+1][i][j+1]*B[l+1][k];
      CC[i][j+1][k+1] += A[l+1][i][j+1]*B[l+1][k+1];
      }
 

 
And here is the result: (Base is the first codes, and test is the next I changed)

Tensor Size = 128
Base-TensorMult: 113.4 MFLOPS; Time = 4.736 sec; 
Test-TensorMult: 1863.8 MFLOPS; Time = 0.288 sec;
No differences found between base and test versions
 

You see? It can be more than 16 times faster!! So we really need to take caches into consideration when we are coding, though the codes can be really ugly :(

I found an ultimate method to share files among apps on iPad(also iPhone)

0 comments

Background
The illusion we get from promotions about iPad is that it's easier to produce documents simply from an iPad. True it's easy, but apple seems to mess up the file sharing among apps and among devices. We need to share files among apps because some application has features that another doesn't have. For example, some applications are easy to download/upload files but not a good reader/editor, while some are vice versa.

Yes Apple provides development tools for programmers to send files to another app, or for users to directly access Documents directory under the application's own directory through iTunes. However they are still handicapped in the following aspects:
1. Files are not organizes. Our working files are spread in several apps and some are duplicated with different versions. It's a nightmare that every time We have to recall which app holds the latest file we edited, and if we have lots of documents, we have to scan through the files to find the exact one.
2. Waste storage space. As mentioned above, sending files to another application is to make a copy to the applications's own directory. So we waste the storage saving two or more copies in our iDevice, while Apple charge a lot for more storage spaces.

So, the ultimate solution to the above problem is to let all regular iApplications see one shared directory. If they can share the same directory, there is no more duplicated files and we can organize files in it once and for all! I have come up with an elegant way to solve this, and is simple, but it needs to be done in jailbroken devices. However I believe it's theoretically doable in non-jail broken iDevices. But I guess Apple will never allow this method or app legally working in our devices. (though they can consider providing more flexible file sharing now)

Problem
So I want all my applications to be able to see a shared directory.

Solution
The simplest way, after jailbreaking, is to utilize the Unix link (alias in Mac os, or shortcut in Windows), to redirect applications to our shared directory. So which directory is shared among all regular iApplications and is both readable and writable? I can only find one, the shared photo album directory, in /var/mobile/Media/DCIM.
The idea is to create a directory, say CrossAppDocs, in the shared directory, and create links in the document directory in every application you want to be accessible to the shared directory. So here are two methods:

Method
Method #1. Using iFile(Cydia)'s Graphical interface: (To be simple I don't tediously list every tiny step)
a. Browse to /var/mobile/Media/DCIM
b. Tap Edit and then '+' to create a new directory, naming it CrossAppDocs. Change the Group permission to be readable and writeable. Tap Create to save.
c. Now you are still in directory edit mode. Select the directly just created and tap the icon in the lower right corner to 'copy/link'
d. Tap Done to finish editing.
e. Browse to /var/mobile/Applications and go to applications you want to link to CrossAppDos.
f. Go to Documents directory and tap Edit again and tap the same lower right icon. Chose 'create link', and you finish creating a link to the sharing documents!


Method #2: using terminal/SSH
1. Go to /var/mobile/Media/DCIM
2. mkdir CrossAppDocs
3. Since you login as root, you have to change the ownership:
- chmod 775 CrossAppDocs
- chown mobile CrossAppDocs
4. Go to document directory in applications you want to link: /var/mobile/applications/(some hash code)/Documents
5. Type: link -s /var/mobile/Media/DCIM/CrossAppDocs ./CrossAppDocs
So you finish creating a link to the sharing document.

Result Images:

GoodReadersLite sees CrossAppDocs


Downloads Lite sees CrossAppDocs




Discussion
Note that some applications have different schemes about their usage of document directory. You may need to add the link in their subdirectories.

Adding a link in the document directory for each app is just one solution using the unix link feature. You can even try changing the document directory into a link to one folder in your shared documentation folder, if the application cannot read/write files in the subdirectories in document folder.
Here are a list of applications working very well:
- iBooks (only shows PDF)
- GoodReader
- Office2
- iAnnotate
- Downloads
- FileBrowser(in data sub-directory)
- Files
- PrintCentral
And more you can imagine

Unfortunately the iWorks products are currently not working well with this scenario because they do not see subdirectories in Documents. The only way but not recommended is that you turn the document directory itself into a link to the shared folder.

So how iTunes backup sees the links? I don't know. But it seems that iTunes doesn't complain. Maybe it simply backs up the link.

More advanced, can we add a link in app's document directory without jailbreaking? Itunes now supports file uploading since iOS 3.2 and also in iOS 4.0, so I am wondering somehow we can send a link created locally to our iDevices. But directly uploading a link does not work. Can it be done through some iTunes API?

This article is just a beginning idea of cross-application file sharing. If you have more advanced ideas, pleases feel free to leave your comments.

Fast way to back up Time Machine on the Windows network for the first time

0 comments

Keywords: Mac, Time Machine, Windows sharing, fast

If you have just migrated from PC to Mac (especially to a Mac Book), you must have an old PC that you seldom use, or you share the files to Mac. I have turned my old PC to a file server, just through basic Windows sharing. So I have been thinking why not backup my Time Machine in my file server? Therefore I don't have to plug-in a USB external drive every weekend. Here are some articles and really work(but not well...I'll tell why).


OSX Time Machine and Samba/Windows share
Time Machine Backup to Network Device for Mac

I followed the first guide and successfully started the Time Machine, but found it may take forever to finish the first backup (I used 802.11n to backup to my PC).

However, I know Time Capsule allows us to backup locally at first and then we can move the drive to the network. So, is there any similar way we can do on Windows share? Unfortunately Mac seems doesn't allow connecting to local shared directory. The only way I can think of is using virtual system like VMWare Fusion to share in the virtual machine. Mac will see the share as remote sharing.

I've paid 1 week's trial and error. It's not trivial, as Time Machine sometimes halts and there was nothing we could do but restart. Finally I found some tricks and successfully backed up to my Windows XP in VMWare. It tooks me just half a day to backup 114G's data.

Some things I have confirmed:
1. We sure can backup the Time Machine through Mac sharing(Netalk/AFP protocal) or Windows sharing(Samba/CIFS protocal). The backup format is different, so you cannot switch between after backup. Note that Mac sharing takes less hack (more reliablily and less complexity). However, you must have another Mac. I don't have it, so I focus on backing up into PC network. (If you have another Mac for backup, don't follow this post.)

2. I used NTFS format on PC. I guess it also works with FAT32 format, since Mac's sparsebundle is actually a directory with many small files in it.


How to:
Fist stage - preparation:
1. Run Windows on virtual machine, tune the network interface to Host-only(only sees the host) to prevent unwanted network transmission.

2. Connect backup external drive, create a backup folder, say TM, and share it (remember to alter the permission to allow read/write). Later on we'll let Time Machine backup all stuff in this folder.

3. Check the virtual machine's IP address by typing "ipconfig" in the command line. Say you get 192.168.1.5

4. Back to Mac, run Terminal, type "ping 192.168.1.5" and let it always there till backup done.
- This is to ensure the virtual machine does not halt and miss packets from the host.

Second stage - fist backup (I followed this post
and comment up to give more information)

1. Run Terminal and type:

defaults write com.apple.systempreferences TMShowUnsupportedNetworkVolumes 1

2. Mount the network share TM (Apple+K in Finder) for backup.
Note:
- The partition is recommended to have empty size twice the size of your data.

3. Get the Ethernet Address of your Mac. You can either do this through the Terminal and typing "ifconfig en0 | grep ether" to pull the address or by going to the System Profiler (About Mac --> More Info), clicking on Network, and then Ethernet.


4. Back to Terminal, type based on your situation:
sudo hdiutil create -size 199.5g -type SPARSEBUNDLE -nospotlight -volname "Time Machine" -fs HFS+J -verbose YourMacName_YourMacAddress.sparsebundle -tgtimagekey sparse-band-size=262144
Note:
- Do not use the GUI application Disk Utilities because you cannot specify "nospotlight" and "imagekey". Once my Time Machine halts backing up because the spotlight tries to index the shared drive. It's stupid.
- Do it locally. Even you specify to create a 199.5g image, this step will just create an image less than 500mb. You can move it to the network drive later.
- 199.5g: Change based on your empty size on network drive.
- YourMacName is your Mac's name. Some article says you should not use names longer than 8 characters.
- YourMacAddress is what you've got from previous step with no colon (12 characters).
- tgtimagekey: I'm not sure it raises your possibility to successful backup. Just try it.


5. After sparsebundle is generated, Mac automatically mount that image.

6. Unmount / Eject the image.

7. Move the .sparsebundle disk image file you created to your mounted network location.

8. Open up the time machine preferences, and click on "Choose Backup Disk".

9. Select your new disk image and then click on "Use for Backup".


10. Your Time Machine should start the initial backup process.
Note:
- You can hit "Back up now" in Time Machine's dropdown list on the status bar to start immediately.
- Never hit the X icon by the statusbar or hit "Stop Time Machine" to pause backup. Time Machine will restart from zero next time and never goes correctly. You have to start over from creating a new sparsebundle image.
- Leave the pinging terminal there until the bakcup process finishes. Do not run anything else in the virtual machine. Let it focuses on saving the packets to your external hard drive.
- Unfortunately, even I did everything right, the Time Machine still sometimes halts during backup. I guess the Time Machine is waiting for reply from the client but packets might have been lost. You can tell by opening up task manager in Windows, turning the networking page. If the networking graph shows flat zero, your Time Machine seems halts. This is what I do: put the Mac to sleep, and wake up. Time Machine will continue magically.
- Open Utilities => Activities Monitor. You can check the backup speed by turning to Disk Activity page at the bottom. The write speed should not be much less than read speed. (But my read speed is actually not high -- just 2-10mb/sec, sometimes dropped to kbs). If you find it too slow, do the magic step too - put Mac to sleep, and wake up.

Last Stage
1. Once done, you can move the external drive to your normal PC, share the TM folder again (remember to allow write permission).
2. Back to Mac. Mount the network share to TM. In Time Machine, hit "Select Disk" and choose the mounted network drive. Then you are done. Enjoy the mobile backup environment.

ldexp - Multiply by 2 to the power n

0 comments

Check the OpenCL spec and see if your calculation can be substituted by one of its built-in functions.

I used to write

y = x * (1 << n);

for x is float and n is integer. But I just found that there is a function doing this job! And FYI, C <math.h> has this function too!

CUDPP :: GPGPU.org

0 comments

Their library is good for CUDA. Maybe soon we can use the library in OpenCL.

clEnqueueWriteImage segmentation fault

2 comments

I got segmentation fault at clEnqueueWriteImage, but could not find any mistake in the arguments.
Eventually I tried just writing range {1,1,1} but still got error. Thus it was obviously that something other caused the segmentation fault, and it was clCreateImage!

I created a large image (1024x1024x1024) with format UNORM_INT8 of RGBA. I still don't know why this size exceeded the limit, and why nVidia's OpenCL library didn't return error when I called clCreateImage.
In the end I used this size 512x512x512 and it worked fine. Strange thing +1.

OpenCL buffer usage

1 comments

OpenCL buffer creation is too tedious and every time takes me lot of my time figuring out why it doesn't work. Here is some notes I learned after working on several hours.

__constant Limitation
Cannot set a large buffer as __constant in kernel arguments. nVidia GPU will be erroneous reading the buffer(but not showing error message! But you get wrong values!) because constant values are in local registers which cannot contain large array.
Solution: Use __global instead. To enhance performance, use CL_READ_ONLY in clCreateBuffer. This will increase the speed a lot.

Data Re-usage
If you have an array being changed in OpenCL kernel, and will be re-used next time running the kernel, it's not efficient if you just create a CL_READ_WRITE buffer.
Solution: Instead, create two buffers, CL_READ_ONLY, CL_WRITE_ONLY, and set them __global in the kernel arguments. After each round the host copy (clCopyBuffer) the written buffer back to the read-only buffer.
In other words, never use CL_READ_WRITE! That will turn parallelism down to be serialized!
* Revision: I used CL_READ_WRITE in the end and found that it worked well, to hold my previous state and re-usable in the next clExecuteNDRangeKernel call. This saves time copying buffers. I am not sure why it ran slowly before. Maybe the updated driver or I am aware of buffer coalescing.

nVidia Driver 196.21

0 comments

Noticed that nVidia has released new drivers couple of weeks ago. But for the 9400M on my laptop it's still 195.xx. Sad...

Currently nVidia's openCL runs slowly. It's slower than the same task implemented by GLSL on the same graphics card. I'm really wondering when the efficiency will be better, or never...

Notice on using clCreateImage 使用 clCreateImage 需要注意的地方

2 comments


cl_mem clCreateImage3D(
cl_context context,
cl_mem_flags flags,
const cl_image_format *image_format,
size_t image_width,
size_t image_height,
size_t image_depth,
size_t image_row_pitch,
size_t image_slice_pitch,
void *host_ptr,
cl_int *errcode_ret)


the unit for image_width , image_height , image_depth are in pixels
but the unit for image_row_pitch , image_slice_pitch are in bytes!

Example:
mem_volume = clCreateImage3D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, &volume_format,
vrParam.volSize[0], vrParam.volSize[1], vrParam.volSize[2], // unit in pixels
vrParam.volSize[0]*sizeof tmpBlock, vrParam.volSize[0]* vrParam.volSize[1]*sizeof tmpBlock, // unit in bytes
tmpBlock, &err);

Besides, the spec says "they should be power of 2 in bytes" which seems not mandatory in my experiment

host_ptr

A pointer to the image data that may already be allocated by the application.
The size of the buffer that host_ptr points to must be greater than or equal to
image_slice_pitch * image_depth. The size of each element in bytes must be a power
of 2. The image data specified by host_ptr is stored as a linear sequence of
adjacent 2D slices. Each 2D slice is a linear sequence of adjacent scanlines. Each
scanline is a linear sequence of image elements.

OpenCL / OpenGL Interop.

0 comments

Took me 1 week to figure out.

First download the 3.0 beta driver here (only this version supports CL/GL interop)
http://forums.nvidia.com/index.php?showtopic=149959

Then follow this page to modify the sample codes
http://oscarbg.blogspot.com/2009/11/amd-opencl-samples-on-nvidia-195-opencl_05.html

OpenCL & volume rendering - GF 8800 GTX

2 comments

OK I moved the same program to a PC with GF 8800 GTX
It runs pretty fast!
12.41 ms(80.51 fps) to render a skull ( 85x96x134 voxels) and
29.19 ms(34.25 fps) with 169x192x268 voxels.
The screen size is also 500x500.

PS. the volume renderer has no lighting, no transfer function lookup.


GPU info:
CL_DEVICE_NAME: GeForce 8800 GTX
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 195.62
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 16
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1350 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 192 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 768 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_IMAGE 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query


CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.0
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_FALSE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_FALSE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_FALSE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_ CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1

OpenCL & volume rendering

0 comments

I implemented a simple volume renderer by OpenCL, which can load the mouse dataset from the project.
In WinXP system on my laptop (bootcamp, nVidia 9400M),
it takes around 250ms(4fps) to render a skull ( 85x96x134 voxels) and
. 500 ms(2fps) with 169x192x268 voxels.
The screen size is 500x500.

----
GPU info:


CL_DEVICE_NAME: GeForce 9400M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 195.62
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1100 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 253 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

CL_DEVICE_IMAGE 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics


CL_DEVICE_COMPUTE_CAPABILITY_NV: 1.1
CL_DEVICE_REGISTERS_PER_BLOCK_NV: 8192
CL_DEVICE_WARP_SIZE_NV: 32
CL_DEVICE_GPU_OVERLAP_NV: CL_FALSE
CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV: CL_TRUE
CL_DEVICE_INTEGRATED_MEMORY_NV: CL_TRUE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_ CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1