My thoughts on the qualification round of Google Hash Code 2019. If you are interested in the past Online Qualification Round and Finals problem statements, you should check this page.
General information
Based on the problem statement, your team of 4 need to solve 5 different cases of the problem. Each case is represented by an input text file. An output file should be generated for each input file and submitted to the Judge System. The Judge System will give a score for each output file. You can submit as many as you can, but only the highest score will be counted for each case. The summation of all the scores is the final score of the team. The top teams will be invited to the final round which will be held at Google Dublin on April 27, 2019.
The problem is always NP-hard, which means that you can never find the best solution in polynomial time. Fortunately, you can make the result better and better. The greedy algorithm and dynamic programming are very helpful.
Problem statement
- A slideshow contains a list of slides
- A slide contains either one horizontal photo or two vertical photos
- A photo contains a list of tags (string type)
- The tag set of a slide is the set of all tags of the photos in the slide
- If the slide containing 1 horizontal photos:
- If the slide containing 2 vertical photos:
- The slideshow is scored based on how interesting the transitions between each pair of subsequent slides are
- For two subsequent slides and , the score is the minimum (the smallest number of the three) of:
- the number of common tags between and
- the number of tags in but not in
- the number of tags in but not in
- Score formula:
For simplicity, horizontal slide refers to a slide containing 1 horizontal slide and vertical slide refers to a slide containing 2 vertical slides
Solution
First of all, let’s look at how the tags are distributed in the input files. This may help us understand the problem better.
Orientation | Metric | A | B | C | D | E |
---|---|---|---|---|---|---|
H | # photos | 2 | 80000 | 500 | 30000 | NA |
H | # distict tags | 4 | 840000 | 1558 | 220 | NA |
H | # tags per photo | 2.5 | 18.0 | 9.43 | 10.04 | NA |
H | # photos per tag | 1.25 | 1.71 | 3.02 | 1369.09 | NA |
V | # photos | 2 | NA | 500 | 60000 | 80000 |
V | # distict tags | 3 | NA | 1548 | 220 | 500 |
V | # tags per photo | 2 | NA | 9.52 | 10.01 | 19.09 |
V | # photos per tag | 1.3 | NA | 3.07 | 2732.06 | 3055.96 |
Some observations:
- B, D, E will come up with a large amount of score because there are many photos in these cases
- B contains horizontal photos only, which would be a good start point to try ideas
- B has many distinct tags which are far more than D, E
- In B and E, photos have twice more tags than D
- In term of photos per tag, D and E are thousands of times larger than B
- E contains vertical photos only which would be the last challenge
Intuition
Actually, we can consider the problem as building a sequence of slides and maximizing its overall transition score. Given the tag set of the previous slide, we can find a slide which maximizes the current transition. If every subsequent transition is optimized, we could eventually optimize the final score.
Greedy Algorithm
prevSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos
result += prevSlide
while some photos are unused:
(hSlide, hScore) <- Best horizontal slide w.r.t prevSlide
(vSlide, vScore) <- Best vertical slide w.r.t prevSlide
if both hSlide and vSlide are found, then
if hScove >= vScore, then
prevSlide <- hScove
else
prevSlide <- vScove
else if hSlide not found, then
prevSlide <- vSlide
else if vSlide not found, then
prevSlide <- hSlide
else
preSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos
res += prevSlide
done
return result
How to find the best horizontal slide w.r.t the previous slide?
Obviously, we just iterate over the horizontal photos and take the photo with the highest score after the previous slide .
Actually, we don’t need to search all the photos in the pool of horizontal photos hPool
, instead, we only need to look for those having at least one common tag with the previous slide, otherwise, the score must be zero. The search space is restricted in those candidates
Essentially, we can keep track of the photos sharing the same tag. We store this information in a HashMap[String, HashSet[Photo]]
call hDict
, which is a mapping from tag to a list of photos containing the tag.
How to find the best vertical slide w.r.t the previous slide?
This is the most tricky part of the problem.
One solution: we can iterate over all the combination of 2 vertical photos and choose the best vertical slide as we did for horizontal slide. However, the time complexity for each step is very high, . If you compute the candidates
size for each previous slides, we will find that it is, in average, about 7200 vertical photos in case D and 60000 in case E. The brute force solution may work for D, but never works for E.
A more reasonable solution is heuristic. Given a previous slide, we can first take the best vertical photo, denoted as , and then find the second vertical photo maximizing the score between the previous slide and the vertical slide of and .
Can we do better?
We notice that the sequence can be built in parallel. More precisely, we can hash partition the input photos into blocks. Each block of photos can be used to build a sub-sequence independently, and then all the sub-sequences could be joined into a sequence. This is a classical map-reduce pattern which cam largely improve the performance.
Code
You may find the code here (comments included)
https://github.com/invkrh/google-hash-code/blob/master/src/main/scala/q2019/SlideShowSolver.scala
Result
With the heuristic algorithm and parallelization trick, we can obtain the following result. The final score is 1112255 which around top 20 in the qualification round, see here. Moreover, the time for D and E are no more than 5 mins which are also acceptable during the competition.
Case: a_example
Time: 119 ms
Score: 2
Case: b_lovely_landscapes
Time: 17569 ms
Score: 201792
Case: c_memorable_moments
Time: 165 ms
Score: 1736
Case: d_pet_pictures
Time: 228491 ms
Score: 428041
Case: e_shiny_selfies
Time: 294582 ms
Score: 480684
Final score = 1112255
Conclusion
The competition has been well organized as usual. However, I am a little disappointed at the problem itself. The provided data set is very easy to be cracked brute forced. One can just randomize the slide sequence several times and return the best one. Some teams did that and had a good result. Fortunately, these randomized brute force solutions are not good enough for the finalist. Anyway, this reminds me again the maxim Done is better than perfect.
Auxiliary
The CPU information of the computer on which the programme has been run.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
Stepping: 3
CPU MHz: 800.015
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
...