Google Hash Code 2019

My thoughts on the qualification round of Google Hash Code 2019. If you are interested in the past Online Qualification Round and Finals problem statements, you should check this page.

General information

Based on the problem statement, your team of 4 need to solve 5 different cases of the problem. Each case is represented by an input text file. An output file should be generated for each input file and submitted to the Judge System. The Judge System will give a score for each output file. You can submit as many as you can, but only the highest score will be counted for each case. The summation of all the scores is the final score of the team. The top teams will be invited to the final round which will be held at Google Dublin on April 27, 2019.

The problem is always NP-hard, which means that you can never find the best solution in polynomial time. Fortunately, you can make the result better and better. The greedy algorithm and dynamic programming are very helpful.

Problem statement

A slideshow contains a list of slides
A slide contains either one horizontal photo or two vertical photos
A photo contains a list of tags (string type)
The tag set of a slide is the set of all tags of the photos in the slide
- If the slide containing 1 horizontal photos: $S = tags(h)$
- If the slide containing 2 vertical photos: $S = tags(v_1) \cup tags(v_2)$
The slideshow is scored based on how interesting the transitions between each pair of subsequent slides are
For two subsequent slides $S_i$ $S_{i}$ and $S_{i+1}$ $S_{i + 1}$ , the score is the minimum (the smallest number of the three) of:
1. the number of common tags between $S_i$ and $S_{i+1}$
2. the number of tags in $S_i$ but not in $S_{i+1}$
3. the number of tags in $S_{i+1}$ but not in $S_i$
Score formula: $min(\vert S_i \cap S_{i+1} \vert, \vert S_i \setminus S_{i+1} \vert, \vert S_{i+1} \setminus S_i \vert)$

For simplicity, horizontal slide refers to a slide containing 1 horizontal slide and vertical slide refers to a slide containing 2 vertical slides

Solution

First of all, let’s look at how the tags are distributed in the input files. This may help us understand the problem better.

Orientation	Metric	A	B	C	D	E
H	# photos	2	80000	500	30000	NA
H	# distict tags	4	840000	1558	220	NA
H	# tags per photo	2.5	18.0	9.43	10.04	NA
H	# photos per tag	1.25	1.71	3.02	1369.09	NA
V	# photos	2	NA	500	60000	80000
V	# distict tags	3	NA	1548	220	500
V	# tags per photo	2	NA	9.52	10.01	19.09
V	# photos per tag	1.3	NA	3.07	2732.06	3055.96

Some observations:

B, D, E will come up with a large amount of score because there are many photos in these cases
B contains horizontal photos only, which would be a good start point to try ideas
B has many distinct tags which are far more than D, E
In B and E, photos have twice more tags than D
In term of photos per tag, D and E are thousands of times larger than B
E contains vertical photos only which would be the last challenge

Intuition

Actually, we can consider the problem as building a sequence of slides and maximizing its overall transition score. Given the tag set of the previous slide, we can find a slide which maximizes the current transition. If every subsequent transition is optimized, we could eventually optimize the final score.

Greedy Algorithm

prevSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos
result += prevSlide

while some photos are unused:
    (hSlide, hScore) <- Best horizontal slide w.r.t prevSlide
    (vSlide, vScore) <- Best vertical slide w.r.t prevSlide
    if both hSlide and vSlide are found, then
        if hScove >= vScore, then
            prevSlide <- hScove
        else
            prevSlide <- vScove
    else if hSlide not found, then
        prevSlide <- vSlide
    else if vSlide not found, then
        prevSlide <- hSlide
    else
        preSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos
    res += prevSlide
done

return result

How to find the best horizontal slide w.r.t the previous slide?

Obviously, we just iterate over the horizontal photos $h_i$ and take the photo with the highest score after the previous slide $p$ .

h_{i}^{*} = \underset{h_i \in H}{\operatorname{argmax}} score(p, h_i)

Actually, we don’t need to search all the photos in the pool of horizontal photos hPool, instead, we only need to look for those having at least one common tag with the previous slide, otherwise, the score must be zero. The search space is restricted in those candidates

Essentially, we can keep track of the photos sharing the same tag. We store this information in a HashMap[String, HashSet[Photo]] call hDict, which is a mapping from tag to a list of photos containing the tag.

How to find the best vertical slide w.r.t the previous slide?

This is the most tricky part of the problem.

One solution: we can iterate over all the combination of 2 vertical photos and choose the best vertical slide as we did for horizontal slide. However, the time complexity for each step is very high, $O(N_v^2)$ . If you compute the candidates size for each previous slides, we will find that it is, in average, about 7200 vertical photos in case D and 60000 in case E. The brute force solution may work for D, but never works for E.

A more reasonable solution is heuristic. Given a previous slide, we can first take the best vertical photo, denoted as $v_i$ , and then find the second vertical photo $v_j$ maximizing the score between the previous slide $p$ and the vertical slide of $v_i$ and $v_j$ .

v_{i}^{*} = \underset{v_i \in V}{\operatorname{argmax}} score(p, v_i)

v_{j}^{*} = \underset{v_j \in (V - v_i^*)}{\operatorname{argmax}} score(p, v_{i}^{*} + v_j)

Can we do better?

We notice that the sequence can be built in parallel. More precisely, we can hash partition the input photos into $M$ blocks. Each block of photos can be used to build a sub-sequence independently, and then all the sub-sequences could be joined into a sequence. This is a classical map-reduce pattern which cam largely improve the performance.

Code

You may find the code here (comments included)
https://github.com/invkrh/google-hash-code/blob/master/src/main/scala/q2019/SlideShowSolver.scala

Result

With the heuristic algorithm and parallelization trick, we can obtain the following result. The final score is 1112255 which around top 20 in the qualification round, see here. Moreover, the time for D and E are no more than 5 mins which are also acceptable during the competition.

Case: a_example
Time: 119 ms
Score: 2

Case: b_lovely_landscapes
Time: 17569 ms
Score: 201792

Case: c_memorable_moments
Time: 165 ms
Score: 1736

Case: d_pet_pictures
Time: 228491 ms
Score: 428041

Case: e_shiny_selfies
Time: 294582 ms
Score: 480684

Final score = 1112255

Conclusion

The competition has been well organized as usual. However, I am a little disappointed at the problem itself. The provided data set is very easy to be cracked brute forced. One can just randomize the slide sequence several times and return the best one. Some teams did that and had a good result. Fortunately, these randomized brute force solutions are not good enough for the finalist. Anyway, this reminds me again the maxim Done is better than perfect.

Auxiliary

The CPU information of the computer on which the programme has been run.

$ lscpu   
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
Stepping:              3
CPU MHz:               800.015
CPU max MHz:           4000.0000
CPU min MHz:           800.0000
...