Home Exam 2: Video Encoding on Tegra X1 using the CUDA framework
In this assignment, you will take advantage of the computing power available on a graphics processor to accelerate video encoding.
You are supposed to:
- Optimize the c63 encoder using CUDA and the GPUs on the Nvidia Jetson TX1 boards.
- Write a report where you discuss your results, and how you got to the final version.
- Create a poster (2 A3 pages) and participate in the poster session on October 26th.
Codec63
Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It is not compliant with any standards by itself, so the precode contains both an example of an encoder and a decoder (which converts an encoded file back to YUV). C63's inter-frame prediction works by encoding for every macroblock independently whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame.
Macroblocks are encoded according to the JPEG standard [1] if no motion vector is used, and stored in the output file. If a motion vector is used, the residual is stored in the same manner. An illustrative overview of the steps involved during JPEG encoding can be found at Wikipedia [2]. If a motion vector is used, this is stored right before storing the encoded residual.
It is your task to optimize the c63 encoder using the CUDA framework.
The c63 is very basic and shows behavior that you wouldn't allow a standard encoder to have. This concerns in particular the Huffman tables and the unconditional use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).
The video scenario is live streaming. You should not have an encoder pipeline of more than 3 frames. In addition, you should not use parallelization techniques that severely degrade the video quality.
You should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms provide large speedup potential, but they distract from the main goal of this home exam, which is to identify and implement parallelization options.
Two test sequences in YUV format are available in the /mnt/sdcard directory on the lab machines:
- foreman (352x288) CIF
- tractor (1920x1080) 1080p
These should be used as input to the provided c63 encoder, and can be used to test your implementations.
Precode
The precode consists of the reference c63 code including:
- an encoder
- a decoder
- the command c63pred (which extracts the prediction buffer for debugging purposes)
The precode is written in C. You are not required to touch the decoder or c63pred.
The precode can be downloaded from a Git repository here:
git clone https://bitbucket.org/mpg_code/inf5063-codec63.git
You must login to the Jetson TX1 devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about how to access the kits can be found in the GPU FAQ.
You are free to adapt, modify or completely rewrite the provided encoder to take full advantage of the target architecture. You are however not allowed to change out the algorithms for Motion Estimation, Motion Compensation or DCT/iDCT. You are not allowed to paste any other pre-written code into your implementation. You are also not allowed to post any code from the home exam on the Internet.
Start by profiling the encoder to see which parts of the encoder that are the bottlenecks. Remember, after optimizing one part of the code, more profiling might be needed to find new bottlenecks. Use Nvidia’s optimizations guides as a starting point.
Some usage examples:
To encode the foreman test sequence
$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv
To decode a sequence
$ ./c63dec /tmp/test.c63 /tmp/test.yuv
To playback a raw yuv file
$mplayer
/tmp/test.yuv-demuxer rawvideo -rawvideo w=352:h=288
Evaluation
Write a short report where you discuss your results. The exam will be graded on how well you are able to take advantage the GPU architecture to solve the task at hand.
In evaluation, we will consider (in order):
- A program that works (on the provided Jetson TX1). (**)
- Effective use of the GPU architecture:
- At least Motion Estimation and DCT/iDCT offloaded to the GPU (*)
- Use of the parallelization potential on the GPU.
- Understanding the SoC architecture, and minimizing overhead with moving data between the CPU and GPU.
- Correctness of memory use on the GPU (memory types, bank conflicts).
- GPU code optimization with regards to branching.
- Use of the parallelization potential of a 3-frame pipeline.
- Good documentation:
- Readable, well-commented code.
- Optimization steps and performance results
- Comparison of / reflection about alternative approaches
- Complete and well-presented document
- Output video has a quality with a similar or better PSNR and file size as the reference encoder’s.
- Bonus points for other non-obvious optimizations such as Motion Compensation and/or offloading parts of VLC.
(*) Automatic fail if this is not fulfilled. (**) We do not debug code before testing; correctness and effectiveness are not evaluated if this is not fulfilled.
Machine Setup
The Jetson TX1 devkits are situated at Simula Research Laboratory. Machine names and how to access them can be found in the GPU FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.
Contact inf5063@ifi.uio.no if you have problems logging in.
Formal Information
The deadline for handing in your assignment is: Monday, October 24th at (15:00:00.00).
Deliver your code and report (as PDF) at https://devilry.ifi.uio.no/. Submit the poster (as PDF) to inf5063@ifi.uio.no.
The groups should also prepare a poster (2 x A3 pages) and a quick 2 minutes talk (without slides) where you pitch your poster for the class on October 26th. Name the poster with your group name, and email the poster to inf5063@ifi.uio.no no later than noon (12:00) on October 25th. We will then print the poster for you.
For questions and course related chatter, we have created a Slack space: https://inf5063.slack.com
There will be a prize for best poster/presentation (awarded by an independent panel and independent of the grade).
Please check the GPU FAQ page for updates and FAQ
For questions please contact:
inf5063@ifi.uio.no
[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf
[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example