Home Exam 1: Video Encoding on ARMv8 using ARM NEON Instructions
Assignment
In this assignment, we will take advantage of the parallelization options available in a single ARMv8 NEON-enabled core to accelerate Codec63 (c63) video encoding.
You are supposed to:
- Profile and analyze the encoder, and write a short Design Review (max 1 x A4 page) that the group will present on Thursday, February 15th.
- Optimize the c63 encoder using ARM NEON vector instructions.
- Create a poster (PowerPoint or PDF slide) that the group will present on Thursday, February 29th.
- Write a short report where you describe which optimizations you have implemented and discuss your results. You should not describe other thinkable or planned optimizations you did not test.
Additional details
The exam will be graded on how well you can use ARM NEON instructions to solve the task.
You are not supposed to make the encoder multi-threaded. Your implementation should be single-threaded and optimized to use the parallelism available through NEON vector instructions.
The encoder must accept the test sequences in YUV format and generate the format understood by c63's unmodified decoder.
Start by profiling the encoder to see which parts of the encoder are the bottlenecks. Remember, more profiling might be needed to find new bottlenecks after optimizing one piece of the code. One operation (e.g., motion vector search) may still be the most important function when optimized as much as possible. If you cannot optimize one operation further, move to another one.
Based on your profiling, you should optimize different parts of the code, (1) structurally and (2) with NEON instructions. There is no definite answer to which sections of the code you have to optimize, and there is no definite answer to which instructions you must use. Look for SIMD-friendly cases where the same operation needs to execute on many similar data elements. You are NOT supposed to change or replace any algorithms. Only reimplement the algorithms using vector instructions.
Write a report detailing your profiling results, the instructions you used, and your changes to the precode. The report should also detail and explain the positive and negative performance results (in research, it is also essential to learn what not to do). Discuss them in your report if you find several alternatives for solving a problem, or if you tried several dead ends before solving a challenge successfully.
Codec63
Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It provides a variety of parallelization opportunities that exist in modern codecs without the complexity of those full-fledged codecs. It is not compliant with any standards by itself, so the precode contains both an example of an encoder and a decoder (which converts an encoded file back to YUV).
C63's inter-frame prediction works by encoding every macroblock independently, whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame. If a motion vector is used, the residual is stored in the same manner. If a motion vector is used, this is stored right before saving the encoded residual.
The macroblocks are encoded according to the JPEG standard [1] and stored in the output file if no motion vector is used. An illustrative overview of the steps involved during JPEG encoding can be found in Wikipedia [2].
The c63 is basic and shows behavior you would not allow a standard encoder to have. This concerns, in particular, the Huffman tables and the use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).
The video scenario is live streaming. You should not use parallelization techniques that severely degrade the video quality.
It is your task to optimize the c63 encoder. As mentioned above, you should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms may provide substantial speedups. Still, they distract from the primary goal of this home exam, which is to identify and implement parallelization using vector instructions.
Two test sequences in YUV format are available in the /mnt/sdcard directory on the Tegra-machines:
- foreman (352x288), CIF
- tractor (1920x1080), 1080p
These should be used as input to the provided c63 encoder and can test your implementations.
Precode
The precode consists of the following:
- The reference c63 encoder (c63enc).
- Decoder for c63 (c63dec).
- The command c63pred (which extracts the prediction buffer for debugging purposes).
The precode is written in C, and you should also write your solution in C. You are not required to touch the decoder (c63dec) or c63pred. We recommend that you have two separate repositories: one where you modify the encoder and one unmodified, which you use to test your implementation.
You can download the decoder from a Git repository here:
git clone https://bitbucket.org/mpg_code/in5050-codec63.git
You must log in to the Jetson AGX Xavier devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about accessing the devkit can be found in the ARM FAQ.
You are free to entirely adapt, modify, or rewrite the provided encoder to take full advantage of the target architecture. You are, however, not allowed to replace the algorithms for Motion Estimation, Motion Compensation, or DCT/iDCT. You are also not allowed to paste/reuse any other pre-written code in your implementation.
Some command usage examples:
To encode the foreman test sequence
$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv
To decode the test sequence:
$ ./c63dec /tmp/test.c63 /tmp/test.yuv
To dump the prediction buffer (used to test motion estimation):
$ ./c63pred /tmp/test.c63 /tmp/test.yuv
Playback of raw YUV videos:
$ mplayer /tmp/test.yuv -demuxer rawvideo -rawvideo w=352:h=288
Report
You must write the results as a technical report of no more than four pages in ACM format. The report should serve as a guide to the code modifications you have made and the resulting performance changes.
Evaluation
In the evaluation, we will consider and give points for (most important first):
- Motion Estimation & DCT/iDCT algorithmic functions in the source code have been NEONized.
- Document the bottleneck and the effect of your optimization.
- A program that works (on the Jetson AGX Xavier provided)
- The program runs to completion. (*)
- Encodes the foreman (CIF) video correctly.
- Output video has quality with a similar PSNR and file size as the unmodified encoder.
- Readable, well-commented code
- Effect of the Parallelization (SIMDification)
- Most costly algorithms identified and NEONized
- NEON instructions are used effectively
- Bonus points for non-obvious optimizations
- The quality of the report that accompanies the code
- Clear and structured report of the performance changes caused by your modifications to the precode
- References to the relevant parts of the accompanying program code (to aid the reviewer of the submitted assignment)
- Graphical presentation of the optimization steps and performance results (plots of performance changes)
- Comparison of / reflection about the alternative approaches tried out by your group
(*) We do not debug code before testing. There will not be any points for the correctness of videos if the code does not work.
Machine Setup
The Jetson AGX Xavier devkits are at IFI. Machine names and how to access them can be found in the ARM FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.
Contact in5050@ifi.uio.no if you have problems logging in.
Formal information
The deadline for handing in your assignment:
- Design: Thursday, February 15th at 12:00
- Code: Thursday, February 29th at 23:59
- Report: Monday, March 4th at 14:00
Deliver your code and report (as PDF) at https://devilry.ifi.uio.no/.
Submit the design review and poster (as PDF) to in5050@ifi.uio.no.
The groups should prepare a poster (PowerPoint or PDF slide) and a quick 5-minute talk where you pitch implementation for the class on February 29th. Name the poster with your group name and email it to in5050@ifi.uio.no. There will be a prize for the best poster/presentation (awarded by an independent panel and independent of the grade).
For questions and course-related chatter, we have created a Mattermost space: https://mattermost.uio.no/ifi-undervisning/channels/in5050
Please check the ARM FAQ page for updates, and FAQ
For questions, please contact:
in5050@ifi.uio.no