The first assignment exploited data parallelism along a single dimension – with a direct
one-to-one or one-to-many correlation between each thread and the data it processes.
In the second assignment we will be operating on 2-D data and look at ways to adapt
some common operations for the GPU.
The official
CUDA Documentation is the best resource for implementation details and
API specifics. The
NVIDIA Developer Blog has a number of guides which showcase the
latest and best CUDA programming practices.
We would strongly recommend reading through the CUDA related links within this
document before you start writing code.
This lab is a significant departure from the previous lab. There are a number of unique
ways to parallelize each task and each choice comes with its own set of tradeoffs. There
is a significant set of possible implementations and
we cannot wait to see
ingeneous and unique take on the problems.
To boldly go where no one has gone before!
— Captain James T. Kirk, USS Enterprise
PART A: Filtering Noise from Images
Median Filtering
The median filter is a non-linear digital filtering technique, often used to remove noise
from an image or signal
[1]. For small to moderate levels of Gaussian noise, the median
filter is demonstrably better than Gaussian blur at removing noise whilst preserving
edges for a given, fixed window size.
While a number of efficient implementations of median filtering exists
[3], we will be
using the relatively compute-intensive “simple” implementation, that precisely computes

the median. Furthermore, to simplify the implementation, we will be using a cental
symmetric rectangular / square window such that the pixel at the centre will be replaced
by the median of all the pixels in
A simple description of the algorithm can be found
here. The curious, interested in
exploring Median Filtering in the context of recent (and past) academic work can
The principle of the median filter is to replace the gray level of each pixel by the median
of the gray levels in a neighborhood of the pixels, instead of using the average
operation. A simplified pseudo-code implementation of the median filtering algorithm is
shown below.
for each pixel:
load filter window – filterH x filterW centred @ pixel
sort the window
store median value as output
A naive CPU implementation of the same can be found
medianFilter_cpu() in cpuLib.cpp. You must implement an efficient algorithm for GPU.
Think carefully about which loops to parallelize and which to execute sequentially inside
the GPU kernel.
Make sure you handle the edges of the image where the filter window may expand
beyond the dimensions of the image.
Try to solve this at the algorithmic level instead of padding the image
As an optional exercise you can try to pad the image at CPU or GPU and report
the overall performance impact of this additional step.
You may implement this in the medianFilter_cpu() prior to starting on
1. Complete the CPU implementation
o verify visually using the script.
2. Write
runGpuMedianFilter() as entrypoint function.
3. Write the kernel
4. Write the kernel sort_gpu() to sort the filter window
o Choose the sorting algorithm of your choice.
o Carefully weigh (or better yet, measure optional)
the effects of a sequential vs recursive implementation (if
and their effects on the SIMT stack.
Optimizing the Kernels
This assignment is tightly coupled with the course material – utilize your
understanding of the GPU architecture and the memory heirarchy to squeeze out
the best performance for your kernel.
Repeated profiling with nvprof will help you compare the design options and
evaluate the impact of your choices.
Think deeply and explore the design space in the following aspects
o Careful and Optimal usage of shared memory resources [8]
o Leveraging CPU-to-GPU data transfer and GPU memory access and
coalescing patterns [6] [7]
o Threadblock and grid assignment to shaders
and how that influences the working set in each cache
o the data storage format and access patterns
to help select which loops to parallelize
Insufficient facts always invites danger, Captain.
— Cdr. Spock, USS Enterprise
Setting Up a Python Virtual Environment
Use the following commands to setup and use a python virtual environment. This step is
essential to install python packages on scholar without modifying the built-in Python
# Create a new python virtual environment in local directory `./python`
$ python3 -m venv ./python
# Activate your python virtual environment
# **NOTE** This step must be repeated every time to select your venv as default
python distribution
$ source ./python/bin/activate
# This step selects your “local” python environment for use instead of the system
python installation
# Verify python path

$ which python3
/home/ … /lab2/python/bin/python3
# Install required python packages
pip3 install numpy pillow
# Learn about the command line arguments
$ python3 –help
info: No GUI element found – disabling GUI
usage: [-h] [-f FILE] [-l LEVEL] [-r RATIO]
Prepare Image for CUDA Programming Lab 2
optional arguments:

-h, –help
show this help message and exit
input image file path
added noise intensity
Salt vs Pepper Ratio

# Run script
$ python -f ./resources/lena512color.tiff
info: No GUI element found – disabling GUI
TIFF 512×512 RGB
(512, 512, 3)
Storing bytes to – /……./resources/lena512color.tiff.bytes
786448 bytes written
… Done !
Using the .bytes file format
Modern images are stored in compressed formats like JPEG which have their own
storage formats and compression/decompression schemes. A
custom RAW Pixel format
.bytes) has been defined for the purpose of this class to enable you to quickly convert
any image into the RGB pixels and use it with the code developed as part of this
assignment. The required Python scripts for interconversion and viewing the original
and processed images have been provided. can be used to convert any given image (tested
.jpg and .tiff formats) into a .bytes file. can be used in a GUI-enabled environment to view the
contents of a
.bytes image.
For details about the command line arguments that can be passed to each script use
$ python3 –help
# or
$ python3 –help

NOTE : Viewing the images through any of this scripts will require you to run the scripts
on a GUI enabled workspace (ThinClient or your local machines).
The scripts still work on console-only workspaces but will have the GIUs disabled.
In non-GUI environments, will take console arguments and
directly store the output into a
.bytes file.
// .bytes metadata
typedef struct ImageDim_t
uint32_t height;
uint32_t width;
uint32_t channels;
uint32_t pixelSize;
} ImageDim;
The bytes format packs any image in a raw 8-bit pixel format comprised of a 16 byte
metadata followed by RGB inteleaved data in
little-endian format . [5]
For the purpose of this assignment, all images will have 3 channels and 8-bit
(1-Byte) pixels
The first 16 bytes of any .bytes file will have the metadata required to interpret
the contents of the rest of the file (can be used to calculate dimensions and sizes)
The remaining bytes in the file will have RGB pixels interleaved.
Pixels are stored in Row-Major format.
Look at the contents of
loadImageBytes() in cpuLib.cpp to get an idea of storage format.

MetaData Px
Px (0,N-
Px (N-
16 bytes 3
3 bytes 3 bytes 3 bytes

What to include in your report
Your parallelization strategy and any optimizations you have included (eg: shared
memory resources, memory coaleascing etc)

Images showing the effectives of your implementation of median filtering.
Drawbacks (if any) of your implementation.
PART B: Pooling
A quick overview of the pooling operation in Neural Networks can be found here. Read
carefully to understand the operation of a pooling layer and its various configuration
Pooling is a feature commonly imbibed into Convolutional Neural Network (CNN)
architectures. The main idea behind a pooling layer is to “accumulate” features from
maps generated by convolving a filter over an image.
Formally, its function is to progressively reduce the spatial size of the representation to
reduce the amount of parameters and computation in the network. The most common
form of pooling is max pooling.
Max pooling is done, in part, to help over-fitting by providing an abstracted form of the
representation. As well, it reduces the computational cost by reducing the number of
parameters to learn and provides basic translation invariance to the internal
representation. Max pooling is done by applying a max filter to (usually) nonoverlapping subregions of the initial representation.
The next two labs will introduce a deep neural network in its entirelty and provide you
the opportunity to implement a foundational work on DNNs which has been cited > 76k
times! –
AlexNet (
). For this part, however, you are only required to implement the pooling layer in
An illustration of the pool operation can be found below
CPU Implementation
A basic incomplete implementaton of a Max-Pool layer is provided inside cpuLib.h. It
lays down the API and (after completion) can be used for checking your GPU
implementation later.
1. Complete the CPU code for max pooling
2. Add provisions for padding and strides
o implement padding by using conditional statements rather than by
actually padding the data.
GPU Implementation
The naive CPU implementation is a set of nested loops – each loop presents an
oppornity for parallelization. Choose wisely.
Reporting your Progress
Refer to the Lab1 Readme, esp. sections on Building and Executing the
and Reporting your Progress to build, test and profile your application.
All the instructions provided in Lab 1 also apply to this Lab
Each time you login
Load the gcc, cuda and cmake modules To avoid typing this every time, consider
adding this command to
$ module load gcc cuda cmake
Activate the python virtual environment setup within the assignment directory
$ cd <LAB2_ROOT_DIR>

$ source ./python/bin/activate
> (python) [email protected]:~/…../lab2 $

Succesful loading of the venv will be indicated by (python) in front of the shell
prompt as shown in the last line above
Performance and Speedup
Time is the only true measure of performance
Please report all metrics and speedups in terms of absolute execution time
( in
μs or ms as applicable). Percentage doesn not convey the actual execution time.
If you happen to use Unified Virtual Memory (cudaMallocManaged) report seperate sets of
data for managed and unmanaged device memory.
An example of the stacked bar chart format expected for reporting the breakup of
workload execution time for the report is given below.
Figure is indicative of format and
has no correlation with any profiling graph real or fictional.
Here are some ideas for the graphs that you can use to present your efforts and results
in the report.
1. Stacked bar chart (as in figure) showing execution
time vs problem size
2. line or bar chart showing speedup vs naive cpu
3. chart showing speedup vs optimized cpu (if you make any changes)
4. chart with performance vs shared memory strategy (size or organization) – same
or varying problem size
5. chart with performance gain acheived by any tenchiques you have applied
Be terse and lucid with explanations of your observations from the trends you observe.
[1] Wikipedia – Median Filtering
[2] Arias-Castro, Ery; Donoho, David L. Does median filtering truly preserve edges better
than linear filtering?. Ann. Statist. 37 (2009), no. 3, 1172–1206. doi:10.1214/08-
[3] T. Huang, G. Yang and G. Tang, “A fast two-dimensional median filtering algorithm,”
in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 1, pp. 13-
18, February 1979, doi:
[4] Arce, G.R. (2004). Weighted Median Filters. In Nonlinear Signal Processing, G.R. Arce
[5] Wikipedia – Endianness
NVIDIA Developer Blog
[10] DeepAI Max-Pooling