Thank you to all those who participated in the 2018 Challenge! Between the two tracks, a total of 123 partipants formed 24 registered teams and made a combined 151 valid submissions. Each team was allowed to make one submission per day and 10 total over the entire competition. Teams were ranked based on the score of their best submission. Score is computed as the average of the Top-1 accuracy and Top-5 accuracy. The 2018 winners are listed below:
Congratulations to all the teams! See below for the official leaderboard and submission reports.
Reports are informal essays optionally submitted by the participants, for academic exchange only,
which are neither considered as proceeding papers nor publications.
Rank | Team Name | Entry Description | Top-1 Acc. | Top-5 Acc. | Score |
---|---|---|---|---|---|
1 | DEEP-HRI | Ensemble-C3D [report] | 0.3864 | 0.6719 | 0.5291 |
2 | Megvii | Spatial Temporal: 2D + 3D + flow + audio [report] | 0.3750 | 0.6503 | 0.5126 |
3 | Qiniu | [report] | 0.3641 | 0.6371 | 0.5006 |
4 | Alibaba-Venus Video Analysis | structure: i3d, nonlocal, trn, vlad+
modality: rgb, flow, acoustic [report] |
0.3551 | 0.6366 | 0.4959 |
5 | Xtract AI | Xtract Boosted Fusion [report] | 0.3199 | 0.5983 | 0.4591 |
6 | SSS | v2 3 models [report] | 0.3195 | 0.5756 | 0.4476 |
7 | CMU-AML | [report] | 0.3103 | 0.5842 | 0.4473 |
8 | UNSW-Data-Science | [report] | 0.3038 | 0.5490 | 0.4398 |
9 | fengwuxuan | trn: inceptionv3 | 0.2861 | 0.5490 | 0.4176 |
10 | SYSU_isee | Dynamic fusion [report] | 0.2731 | 0.5386 | 0.4058 |
11 | Moments in Time Team | Pretrained InceptionV3 Temporal Relation Network (TRNmultiscale with 8 segments) | 0.2731 | 0.5386 | 0.4047 |
12 | STAIR Lab | Our method combines multiple predictions from different models. Models are 2DCNN, 3DCNN, Audio, and Caption, respectively. Combination function is either average, MLP or SVM. [report] | 0.2721 | 0.5357 | 0.4039 |
13 | AR Team | I3D | 0.2959 | 0.5074 | 0.4016 |
14 | mmmm | RGB Stream | 0.2704 | 0.5282 | 0.3993 |
15 | FR | - | 0.2525 | 0.5098 | 0.3811 |
16 | SIAT_MMLAB | [report] | 0.2496 | 0.5093 | 0.3794 |
17 | w452261940 | 3D Conv [report] | 0.2590 | 0.4983 | 0.3787 |
18 | SCZH | CNN | 0.2454 | 0.4953 | 0.3704 |
19 | pingchuan | pingchuan network architecture | 0.2340 | 0.4831 | 0.3585 |
20 | Lee99 | - | 0.2315 | 0.4718 | 0.3516 |
21 | LQQ | - | 0.2281 | 0.4656 | 0.3468 |
22 | cdy MIT team | [report] | 0.2247 | 0.4574 | 0.3411 |
23 | j4f | RGB | 0.2507 | 0.3661 | 0.3084 |
24 | TYY | Ensemble method | 0.1897 | 0.4063 | 0.2980 |
25 | IBM ARL | Action recognition using deep 3D conv nets. It is based on DenseNet, pre-trained with ImageNet, but is extended to 3D (spatial + temporal dimensions). | 0.1749 | 0.3953 | 0.2851 |
26 | AIST | 3D ResNeXt pretrained on Kinetics-400 [report] | 0.1800 | 0.3843 | 0.2821 |
27 | Indy_500 | C3D_svm - Spatio-temporal features are extracted from image. Linear svm classifier trained. Image features are extracted from videos. Classifier is then trained. | 0.1130 | 0.2676 | 0.1903 |
28 | mms_2000 | We use a combination of audio and video features to train the classifier. | 0.1073 | 0.2376 | 0.1724 |
29 | MB | Resnet Feature extraction with temporal model learning | 0.0033 | 0.0152 | 0.0092 |
Rank | Team Name | Entry Description | Top-1 Acc. | Top-5 Acc. | Score |
---|---|---|---|---|---|
1 | SYSU_isee | [report] | 0.3316 | 0.6228 | 0.4772 |
2 | beihang university | CNN | 0.3132 | 0.5966 | 0.4549 |
3 | MiRA | Earlyfusion [report] | 0.5861 | 0.3159 | 0.4510 |
4 | cdy MIT team | TRN: InceptionV3——model fuse [report] | 0.3059 | 0.5813 | 0.4436 |
5 | The Dragon Warrior | TRN based method: using trn and p3d to classify MiT dataset | 0.3046 | 0.5792 | 0.4419 |
6 | Cardinal Vision | Muti-Stream Pipeline: ensemble model with multiple prediction stream, including Resnet, TRN, and YOLO. | 0.2892 | 0.5196 | 0.4044 |
7 | j4f | 0.2765 | 0.5282 | 0.4024 | |
8 | Moments in Time Team | 0.2422 | 0.4803 | 0.3613 | |
9 | HERO_AN | Multiple Segments Relation Network (MSRN) [report] | 0.2096 | 0.4504 | 0.3300 |
10 | Big FIsh | Pretrained I3D with imagenet and kinetics, finetune on moment in time. | 0.1886 | 0.3956 | 0.2921 |
11 | Activity Recognition in Large Scale Short Videos | Visual text features based of ResNext architecture [report] | 0.1838 | 0.3816 | 0.2827 |
12 | USTC | Refined I3D | 0.0036 | 0.0247 | 0.0142 |