# A complete explanation of yolox core foundation of Yolo series

2021-08-27 09:27:54

stay Yolov4、Yolov5 Just came out , Da Bai wrote about Yolov3、Yolov4、Yolov5 The article , And made some videos to explain , The response was pretty good .

And from 2015 Year of Yolov1,2016 year Yolov2,2018 Year of Yolov3, Until then 2020 Year of Yolov4、Yolov5,Yolo The series is also evolving .

Just when everyone questioned ,Yolo How to further improve , Kuangshi science and technology published a report on research and improvement Yolox Algorithm .

Big white for Yolox Article and related code , Learned , Found many ways to improve .

such as Decoupled Head、SimOTA Methods such as , The effect is very good , It's worth learning . But because it is difficult to learn directly and visually , understand Yolox And before Yolo The difference between related algorithms .

Therefore, this paper , White pair Yolox Some of the details of , And before Yolov3、Yolov4、Yolov5 Algorithm comparison , Analyze and explain in simple terms , Discuss and study with you .

1 Yolox Relevant basic knowledge points

1.1 Yolox Thesis and code

Yolox The title of the paper ：《YOLOX: Exceeding YOLO Series in 2021》

1.2 Yolox Network structure diagram of each version

Want to learn an algorithm , It's best to look at it from a visual point of view , Get to know . If you look purely at the code , Probably confused . and Yolox There are many kinds of network structures , For example, the following network structure weight files .

So you can use , Convert each model file into onnx Format , Reuse netron How the tool opens , Visual learning of network structure .

1.2.1 Netron Tools

If someone is right netron The tools are not very familiar , Here's still netron Detailed process of tool installation .

You can move on to another article of Dabai ：《 Network visualization tools netron Detailed installation process 》.

1.2.2 each Yolox Of onnx file

each onnx file , You can use... In the code ,tools/export_onnx.py Script , convert .

Besides , This section in the official code , All versions have been converted onnx, You can also download it directly .

1.2.3 each Yolox Network structure chart

But consider , Some students may be inconvenient , Use netron see . therefore , Dabai also uploaded and used netron The open , Picture of each network structure diagram , You can also view it directly .

（1）Yolox-Nano

Yolox-Nano yes Yolox The smallest structure of the series , The network parameters are only 0.91M.

（2）Yolox-Tiny

（3）Yolox-Darknet53

Yolox-Darknet53 Is in Yolov3 On the basis of , Improvements made , It is also the network structure mainly introduced later .

（4）Yolox-s

Yolox-s Is in Yolov5-s On the basis of , Improvements made , It is also the network structure mainly introduced later .

（5）Yolox-m

（6）Yolox-l

（7）Yolox-x

2 Yolox Core knowledge points

2.1 Yolov3&Yolov4&Yolov5 Network structure chart

I'm learning Yolox Before , Let's get to know Yolov3、Yolov4、Yolov5 Network structure diagram , And then there's Yolox The Internet , Are all extended on this basis .

① Yolov3 Network structure chart

Yolov3 Is in 2018 in , It is also a widely used target detection algorithm in industry .

But in the Yolox In the series ,Yolox-Darknet53 Model , Adopted Baseline Benchmark network , It is not Yolov3 edition , It's an improved Yolov3_spp edition .

and Yolov3 and Yolov3_spp The difference is ,Yolov3 Behind the backbone network , Added spp Components , Here we need to pay attention to .

② Yolov4 Network structure chart

Above, DarknetAB A great god , stay 2020 Put forward in Yolov4 Algorithm . In this algorithm , Many places on the Internet , All of them have been improved .

Input end ： use Mosaic Data to enhance ;

Backbone： Adopted CSPDarknet53、Mish Activation function 、Dropblock Methods such as ;

Neck： Adopted SPP（ according to DarknetAB Set up ）、FPN+PAN structure ;

Output terminal ： use CIOU_Loss、DIOU_Nms operation .

So we can see that ,Yolov4 Yes Yolov3 Every part of , Have carried out a lot of integration and innovation .

③ Yolov5 Network structure chart

And in the Yolov5 In the network , and Yolov4 Different , The biggest innovation is , The author puts the network structure , It is made into a way of optional configuration .

Such as backbone network structure , According to the width of each network 、 Different heights , Can be divided into Yolov5s、Yolov5l、Yolov5s、Yolo5x Equal Edition . This transformation , In the field of target detection , Led an upsurge of network splitting .

In this paper, the Yolox Algorithm , From this point of view , take Yolox Model , Into a variety of optional networks , Such as standard network structure and lightweight network structure .

（1） Standard network structure ：Yolox-s、Yolox-m、Yolox-l、Yolox-x、Yolox-Darknet53.

（2） Lightweight network architecture ：Yolox-Nano、Yolox-Tiny.

In the actual project , You can according to different project needs , Select and use .

3.2 Yolox Basic knowledge points

From the description above , We can know Yolox Overall improvement ideas ：

（1） Benchmark model ：Yolov3_spp

choice Yolov3_spp structure , And add some common improvements , As Yolov3 baseline Benchmark model ;

（2）Yolox-Darknet53

Yes Yolov3 baseline Benchmark model , Add a variety of trick, such as Decoupled Head、SimOTA etc. , obtain Yolox-Darknet53 edition ;

（3）Yolox-s、Yolox-m、Yolox-l、Yolox-x series

Yes Yolov5 Four versions , Use these effective trick, Make improvements one by one , obtain Yolox-s、Yolox-m、Yolox-l、Yolox-x Four versions ;

（4） Lightweight networks

Designed Yolox-Nano、Yolox-Tiny Lightweight networks , And tested some trick The applicability of ;

On the whole , A lot of work has been done in this paper , Let's join you , From the above point of view , Yes Yolox The network structure of the algorithm , And each innovation point .

3.2.1 Benchmark model ：Yolov3_spp

When designing algorithms , For comparison and improvement trick The stand or fall of , It is often necessary to select the benchmark model algorithm .

And in choosing Yolox The benchmark model , The author considers ：

Yolov4 and Yolov5 series , From the perspective of anchor box based algorithm , There may be some over optimization , So I finally chose Yolov3 series .

But there is no direct choice Yolov3 In the series , The standard Yolov3 Algorithm , Instead, I chose to add spp Components , And then the performance is better Yolov3_spp edition .

The following is the explanation in the paper ：

Considering YOLOv4 and YOLOv5 may be a little over-optimized for the anchor-based pipeline, we choose YOLOv3 [25] as our start point (we set YOLOv3-SPP as the default YOLOv3).

For your understanding , Big white is in front Yolov3 Based on the structure diagram , add spp Components , Change to... As shown in the figure below Yolov3_spp The Internet .

You can see , Backbone network Backbone Back , Added one SPP Components .

Of course, on this basis , Many places in the process of network training , All of them have been improved , such as ：

（1） Added EMA Weight update 、Cosine Learning rate mechanism and other training skills

（2） Use IOU Loss function training reg Branch ,BCE Loss function training cls And obj Branch

（3） Added RandomHorizontalFlip、ColorJitter And multi-scale data expansion , Removed RandomResizedCrop.

On this basis ,Yolov3_spp Of AP Its value 38.5, In the figure below Yolov3 baseline.

But in the study of the above figure , A little doubt ：

YOLOv3_ultralytics Of AP The value is 44.3, When cited in the paper , At present Yolov3_spp In the algorithm, , The version with the best accuracy .（the current best practice of YOLOv3）.

Then look at this code , Found that as the paper said , A lot more trick Of Yolov3_spp edition ,AP The value is 44.3.

and Yolox The benchmark model of , Is the most primitive Yolov3_spp edition , After a series of improvements ,AP Its value 38.5.

On this basis , added Strong augmentation、Decoupled head、anchor-free、multi positives、SimOTA, etc. 5 Kind of trick, At last AP47.3.

But the question is ？

If you use YOLOv3_ultralytics Of Yolov3_spp edition , Add the above 4 Kind of trick（ Get rid of strong augmentation, Because the code has been improved ）, Whether there will be better AP promote ？

2.2.2 Yolox-Darknet53

We know from the front , When you get Yolov3 baseline after , The author adds a series of trick, The final improvement is Yolox-Darknet53 Network structure .

The picture above shows Yolox-Darknet53 Network structure chart .

In order to facilitate analysis, improvement points , We are right. Yolox-Darknet53 Split the network structure , Into four plates ：

① Input end ：Strong augmentation Data to enhance

② BackBone Backbone network ： The backbone network has not changed , still Darknet53.

③ Neck： Nothing has changed ,Yolov3 baseline Of Neck Layer or FPN structure .

After a series of improvements ,Yolox-Darknet53 Finally achieve AP47.3 The effect of .

Let's talk about Yolox-Darknet53 The input end of the 、Backbone、Neck、Prediction Four parts , Carry out detailed disassembly .

2.2.2.1 Input end

（1）Strong augmentation

At the input of the network ,Yolox Mainly adopted Mosaic、Mixup Two data enhancement methods .

These two kinds of data are used to enhance , Direct will Yolov3 baseline, Promoted 2.4 percentage .

① Mosaic Data to enhance

Mosaic Enhanced way , yes U edition YOLOv3 Introduced a very effective enhancement strategy .

And in Yolov4、Yolov5 In the algorithm, , It's also widely used .

By randomly scaling 、 Random cutting 、 Random layout of the way to splice , The detection effect of small targets is improved , It's pretty good .

② MixUp Data to enhance

MixUp Is in Mosaic On the basis of , An additional enhancement strategy added .

Mainly from 2017 year , Summit meeting ICLR A paper on 《mixup: Beyond Empirical Risk Minimization》. At that time, it was mainly used in the task of image classification , It can be done with almost no additional computational overhead , Steadily improve 1 Percentage point classification accuracy .

And in the Yolox in , It is also applied to target detection , Code in yolox/datasets/mosaicdetection.py In this file .

In fact, the way is very simple , For example, we are doing the task of face detection .

First read a picture , Fill both sides of the image , Zoom to 640*640 size , namely Image_1, The face detection box is a red box .

Then select a picture at random , Fill the image up and down , Also shrink to 640*640 size , namely Image_2, The face detection box is blue .

Then set a fusion coefficient , Like in the picture above , Set to 0.5, take Image_1 and Image_2, Weighted fusion , Finally get the right Image.

As can be seen from the picture on the right , The red and blue boxes of the face are superimposed . We know , stay Mosaic and Mixup On the basis of ,Yolov3 baseline Added 2.4 percentage .

But there are two things to note ：

（1） At the end of the training 15 individual epoch, These two data enhancements will be turned off .

And before that ,Mosaic and Mixup Data to enhance , It's all open , This detail needs attention .

（2） Due to the stronger data enhancement , The author found that ,ImageNet Pre training will be meaningless , therefore , All the models , They all train from scratch .

2.2.2.2 Backbone

Yolox-Darknet53 Of Backbone Backbone network , And the original Yolov3 baseline All backbone networks are the same .

Are all used Darknet53 Network structure .

2.2.2.3 Neck

stay Neck In structure ,Yolox-Darknet53 and Yolov3 baseline Of Neck structure , It's the same , Are all used FPN The structure of .

As shown in the figure below ,FPN The top-down , Put the high-level feature information , Transfer fusion is carried out by up sampling , Get the feature map for prediction .

And in the Yolov4、Yolov5、 Even later Yolox-s、l And so on , Are all used FPN+PAN In the form of , Here we need to pay attention to .

2.2.2.4 Prediction layer

In the output layer , Mainly from four aspects ：Decoupled Head、Anchor Free、 Label assignment 、Loss Calculation .

So let's see Decoupled Head, At present, there are similar applications in many one-stage networks , such as RetinaNet、FCOS etc. .

And in the Yolox in , The author added three Decoupled Head, Be commonly called 「 Decoupling head 」.

Da Bai is right here from two aspects Decoupled Head Explain ：

② Decoupled Head The details of the ？

From the top right Prediction in , We can see , There are three Decoupled Head Branch .

Before we understand the principle , Let's first understand the reasons for the improvement . Why the original Yolo head, It is amended as follows Decoupled Head Well ？

Let's look at a table in the paper first ：

in front 2.2.1 In the benchmark network , We know Yolov3  baseline Of AP The value is 38.5.

The author wants to continue to improve , For example, the output is improved to End-to-end The way （ That is no NMS In the form of ）. But the unexpected discovery , After improvement AP Only value 34.3.

And in the 2020 year 12 month , Published by Kuangshi technology 《End-to-End Object Detection with Fully Convolution Network》 in . In the face of FCOS Improved to none NMS when , stay COCO On , Achieved with NMS Of FCOS, Quite good performance .

Then it's strange , Why is it Yolo Improve on , Will fall so much ？

By chance , The author will End-to-End Medium Yolo Head, It is amended as follows Decoupled Head The way .

A surprise discovery ,End-to-end Yolo Of AP value , from 34.3 Add to 38.8.

that End-to-end In a way that works ,Yolov3 baseline Does it also have an effect ？

Then the author will Yolov3 baseline in Yolo Head, It is also revised to Decoupled Head.

Find out AP value , from 38.5, Add to 39.6.

Of course, the author also found in the experiment , It's not just an improvement in accuracy . Replace with Decoupled Head after , The convergence speed of the network is also accelerated .

Therefore, we can get a very key conclusion ：

* at present Yolo Detection head used in series , Expression skills may be lacking , No, Decoupled Head Better expression ability .

The curve shows ：Decoupled Head It converges faster , And the accuracy is higher .

But here's the thing ： Decouple the detection head , It will increase the computational complexity .

Therefore, the author has made a trade-off between speed and performance , End use 1 individual 1x1 First reduce the dimension of the convolution , And in the last two branches , Each used 2 individual 3x3 Convolution , Finally, it is adjusted to add only a little network parameters .

And after decoupling here , There is a deeper importance ：

Yolox The network architecture of , It can work with many algorithm tasks , Integration .

（1）YOLOX + Yolact/CondInst/SOLO , Implement end-to-end instance segmentation .

（2）YOLOX + 34 Layer output , Achieve end to side human body 17 Key point detection .

② Decoupled Head The details of the ？

I understand Decoupled Head The source of the , Look again. Decoupled Head The details of the .

We will Yolox-Darknet53 in ,Decoupled Head① extracted , Through the front Neck layer , here Decouple Head① The length and width entered is 20*20.

As you can see from the diagram ,Concat There are three branches before ：

（1）cls_output： Mainly for the category of the target box , Forecast score . because COCO In total, the dataset has 80 Categories , And mainly N Two classification judgments , So after Sigmoid After activating function processing , Turn into 20*20*80 size .

（2）obj_output： Mainly judge whether the target box is the foreground or background , So after Sigmoid To deal with , Turn into 20*20*1 size .

（3）reg_output： Mainly for the coordinate information of the target frame （x,y,w,h） To make predictions , So the size is 20*20*4.

Last three output, after Concat Merge together , obtain 20*20*85 Characteristic information of .

Of course , This is just Decoupled Head① Information about , Right again Decoupled Head② and ③ To deal with .

Decoupled Head② Output characteristic information , And carry on Concate, obtain 40*40*85 Characteristic information .

Decoupled Head③ Output characteristic information , And carry on Concate, obtain 80*80*85 Characteristic information .

Right again ①②③ Three messages , Conduct Reshape operation , And make an overall Concat, obtain 8400*85 Forecast information .

And once Transpose, Turn into 85*8400 Two dimensional vector information of size .

there 8400, It refers to the number of prediction boxes , and 85 Is the information for each prediction box （reg,obj,cls）.

With the information of the prediction box , Let's learn more about , How to combine these prediction boxes and labeled boxes , namely groundtruth Association , To calculate Loss function , Update network parameters ？

（2）Anchor-free

We're going to introduce Anchor The content of , In the current industry , There are mainly Anchor Based and Anchor Free Two ways .

stay Yolov3、Yolov4、Yolov5 in , It's usually done with Anchor Based The way , To extract the target box , And then with the marked groundtruth compare , Judge the gap between the two .

① Anchor Based The way

For example, input image , after Backbone、Neck layer , Finally, the characteristic information , To the output Feature Map in . At this time , Just set some Anchor The rules , Associate the prediction box with the dimension box .

So in training , Calculate the difference between the two , Loss function , Then update the network parameters . For example, in the figure below , The last three Feature Map On , Based on each cell , There are three anchor frames of different sizes .

Here for a more vivid display , In white Yolov3 In the video , Enter the image size 416*416 For example .

When the input is 416*416 when , The size of the last three characteristic graphs of the network is 13*13,26*26,52*52.

We can see , The yellow box shows the dog's Groundtruth, That is, the dimension box .

And the blue box , Is the cell where the center point of the dog is located , Corresponding anchor frame , Each cell has 3 A blue frame . When COCO Data sets , That is to say 80 Categories .

Based on each anchor box , There are x、y、w、h、obj（ Foreground Background ）、class（80 Categories ）, common 85 Parameters . So there will be 3*(13*13+26*26+52*52）*85=904995 A prediction .

If the input is from 416*416, Turn into 640*640, The size of the last three feature graphs is 20*20,40*40,80*80. Will produce 3*（20*20+40*40+80*80）*85=2142000 A prediction .

② Anchor Free The way

and Yolox-Darknet53 in , Then Anchor Free The way .

We have two aspects , Come on Anchor Free Get to know .

a. Output parameter quantity

Let's calculate , When you get all the output information including the target box , The number of parameters required ？

What needs to be noted here is ： The last yellow 85*8400, It's not like Yolov3 Medium Feature Map, It's the eigenvector .

It can be seen from the picture that , When the input is 640*640 when , The final output eigenvector is 85*8400.

Let's take a look at , And before Anchor Based The way , What is the difference in the number of prediction results ?

By calculation ,8400*85=714000 A prediction , Bi Ji Yu Anchor Based The way , Less 2/3 Parameters of .

b.Anchor Box information

in front Anchor Based in , We know , Every Feature map Cells of , There are 3 An anchor box of different sizes .

that Yolox-Darknet53 No ？

It's not , It's just clever , Put the front Backbone in , The size information of down sampling is introduced .

Like in the picture above , The top branch , Down sampling 5 Time ,2 Of 5 The second party is 32. also Decoupled Head① Output , by 20*20*85 size .

So as shown above ： Last 8400 In a prediction box , Among them is 400 Boxes , The size of the corresponding anchor frame , by 32*32.

Same principle , The middle branch , In the end 1600 A prediction box , The size of the corresponding anchor frame , by 16*16.

The lowest branch , In the end 6400 A prediction box , The size of the corresponding anchor frame , by 8*8.

When there's a 8400 Information about a prediction box , Each picture also has the information of the marked target box .

The anchor frame at this time , It's like a bridge .

What needs to be done at this time , Will be 8400 Anchor frames , Associate with all the target boxes on the picture , Select the positive sample anchor frame .

And the corresponding , The position corresponding to the positive sample anchor frame , The positive sample prediction box , Pick it out .

The association method used here , Label assignment .

（3） Label assignment

When there's a 8400 individual Anchor Behind anchor frame , Every anchor frame here , All corresponding 85*8400 Prediction frame information in eigenvector . But you need to know , Only a few of these prediction frames are positive samples , Most of them are negative samples .

So what are the positive samples ？

Here you need to use the relationship between the anchor box and the actual target box , Select a part of the appropriate positive sample anchor frame .

For example 3、10、15 An anchor frame is a positive sample anchor frame , Then it corresponds to the output of the network 8400 In a prediction box , The first 3、10、15 A prediction box , Is the corresponding positive sample prediction box .

During training , On the basis of anchor frame , Constantly predict , Then iterate continuously , This updates the network parameters , Let the network forecast more and more accurate .

So in Yolox in , How to select the positive sample anchor box ？

Here are two key points ： Preliminary screening 、SimOTA.

① Preliminary screening

There are two main ways of preliminary screening ： Judge according to the center point 、 Judge according to the target box ; This part of the code , stay models/yolo_head.py Of get_in_boxes_info Function .

a. Judge according to the center point ：

The rules ： seek anchor_box Center point , Fall in the groundtruth_boxes All within the rectangle anchors.

For example get_in_boxes_info In the code of , adopt groundtruth Of [x_center,y_center,w,h], Calculate each... Of each picture groundtruth Top left corner of 、 Lower right coordinates .

To make it easier for everyone to understand , Big white draws pictures with the task of face detection ：

Through the formula above , The left face picture can be , Calculate the upper left corner （gt_l,gt_t）, The lower right corner （gt_r,gt_b）.groundtruth The range of the rectangle determines , Then select the appropriate anchor frame according to the range . Here, draw the center point of the anchor box ,（x_center,y_center）.

And the picture on the right , Is looking for anchor frames and groundtruth Correspondence of . That is, calculate the center point of the anchor frame （x_center,y_center）, And the upper left corner of the face annotation box （gt_l,gt_t）, The lower right corner （gt_r,gt_b） The corresponding distance between the two corners .

For example, the first four lines of code in the following code picture ：

And on the fifth line , After superimposing the four values , Through the sixth line , Determine whether they are greater than 0？ You can fall on groundtruth All within the rectangle anchors, It's all extracted .

because ancor box The center of , Only fall within the rectangle , At this moment b_l,b_r,b_t,b_b Is greater than 0.

b. Judge according to the target box ：

Except according to the anchor frame center point , and groundtruth The way to judge the distance between the two sides , The author also sets up a method to judge according to the target box .

The rules ： With groundtruth The center point is the datum , Set the side length to 5 The square of , Select all anchor frames in the square .

Also in get_in_boxes_info In the code of , adopt groundtruth Of [x_center,y_center,w,h], A drawing with a side length of 5 The square of .

For everyone to understand , Big white still draws pictures with the task of face detection ：

In the face picture on the left , Based on the center point of face annotation box , Using the formula above , A drawing with a side length of 5 The square of . The top left corner is （gt_l,gt_t）, The lower right corner is （gt_r,gt_b）. At this time groundtruth The square range determines , Then select the anchor frame according to the range .

And the picture on the right , Is to find all the center points （x_center,y_center） The anchor frame in the square .

The first four lines of code in the code picture , It is also the center point of the anchor frame , The distance from both sides of the square .

Through the superposition of the fifth line , Then on the sixth line , Judge c_l,c_r,c_t,c_b Are they all greater than 0？

You can drop it on the side with a length of 5 Within the square of , be-all anchors, It's all extracted , Because the c_l,c_r,c_t,c_b Is greater than 0.

After the above two choices , The preliminary screening is completed , Pick out some candidates anchor, Enter the next step of fine screening .

② Fine screening

In fine screening , Just use the... Mentioned in the paper SimOTA 了 ：

In terms of promotion effect , introduce SimOTA after ,AP It's worth raising 2.3 percentage , It's still very effective .

and SimOAT The method is put forward , It mainly comes from Kuangshi technology ,2021 Beginning of the year CVPR Last paper ：《Ota: Optimal transport assignment for object detection》.

We will SimOTA Before and after the disassembly process , Let's take a look at how to fine screen ？

The whole screening process , It is mainly divided into four stages ：

a. Primary screening positive sample information extraction

b.Loss Function calculation

c.cost Costing

d.SimOTA solve

For the sake of understanding , Let's assume that there is 3 Target boxes , namely 3 individual groundtruth. Let's assume that the current project is face and human detection , Therefore, the detection category is 2.

In the last section , We know there are 8400 Anchor frames , But after preliminary screening , Suppose there is 1000 An anchor frame is a positive sample anchor frame .

a. Primary screening positive sample information extraction

First screened 1000 The location of a positive sample anchor frame , We know . And the position of all anchor frames , And the last output of the network 85*8400 Eigenvectors are one-to-one correspondence .

So according to the location , The candidate detection frame position predicted by the network can be bboxes_preds、 Foreground Background target score obj_preds、 Category score cls_preds Etc , extracted .

The above code is located in yolo_head.py Of get_assignments Function .

Take the previous hypothetical information as an example , In the code picture bboxes_preds_per_image Because it is the information of the candidate detection box , So the dimension is [1000,4].

obj_preds Because it's the target score , So the dimension is [1000,1].

cls_preds Because it's a category score , So the dimension is [1000,2].

b.Loss Function calculation

For the filtered 1000 A candidate detection box , and 3 individual groundtruth Calculation Loss function . Code for calculation , Also in the yolo_head.py Of get_assignments Function .

The first is the location information loss value ：pair_wise_ious_loss

Through the first line of code , You can calculate that 3 Target boxes , and 1000 Candidate box , Between each frame iou Information pair_wise_ious, Because the vector dimension is [3,1000].

Re pass -torch.log Calculation , Get position loss , In the code pair_wise_iou_loss.

Then there is the integration of category information and target information loss value ：pair_wise_cls_loss

Through the first line of code , The conditional probability of the category is multiplied by the a priori probability of the target , Get the category score of the target .

Then pass the second line of code ,F.binary_cross_entroy To deal with , obtain 3 A goal box and 1000 Synthesis of candidate boxes loss value , namely pair_wise_cls_loss, The vector dimension is [3,1000].

c.cost Costing

With reg_loss and cls_loss, The two loss functions can be weighted and added , Calculation cost The cost function . This involves a formula mentioned in the paper ：

Corresponding , Corresponding to yolo_head.py Of get_assignments Code in function ：

It can be seen that , The weighting coefficient in the formula , In the code 3.

d.SimOTA

With the above series of information , Label assignment problem , Just convert to standard OTA problem . But classic Sinkhorn-Knopp Algorithm , It takes many iterations to find the optimal solution .

The author also mentioned , The algorithm will lead to 25% Extra training time , So a simplified version of SimOTA Method , Solve the approximate optimal solution . The corresponding function here , yes get_assignments Function self.dynamic_k_matching：

The process is as follows ：

First step ： Set the number of candidate boxes

First of all, in accordance with the cost Value size , Create a new full 0 Variable matching_matrix, Here is [3,1000].

Through the second line of code above , Set the number of candidate boxes to 10. Then through the third line of code , From the front pair_wise_ious in , Give each goal box , choose 10 individual iou The largest candidate box .

Because it is assumed that 3 Goals , So here topk_ious The dimensions are [3,10].

The second step ： adopt cost Pick the candidate box

Let's go through topk_ious Information about , Dynamic selection of candidate boxes , Here is the key . Code such as dynamic_k_matching Function , Shown below ：

For your understanding , Big white first makes the first line into a graphic effect . there topk_ious, yes 3 Target box and forecast box , Maximum iou Of 10 Candidate box ：

after torch.clamp function , Get the final right dynamic_ks value . We knew , Target box 1 and 3, Assign him 3 Candidate box , And the target box 2, Assign it 4 Candidate box .

So what are the criteria for allocation ？

At this time, we should use the previous calculation cost value , namely [3,1000] Loss function weighting information . stay for In circulation , Select... For each target box , Corresponding cost Some candidate boxes with the lowest values .

Like the one on the right matching_matrix in ,cost Some of the lowest values , Values for 1, The rest are 0.

Because the target box 1 and 3,dynamic_ks Values are 3, therefore matching_matrix The first and third lines of , Yes 3 individual 1. And the target box 2,dynamic_ks The value is 4, therefore matching_matrix The second line of , Yes 4 individual 1.

The third step ： Filter common candidate boxes

But in the analysis matching_matrix when , We found that , The first 5 There are two 1. That means , The candidate box corresponding to the fifth column , Detected by the target box 1 and 2, Are associated .

So for these two positions , And use cost Value for comparison , Select a smaller value , Further screening .

And this is just for the sake of understanding , Still use the way of illustration ：

First, the first line of code , take matching_matrix, Add each column .

At this time anchor_matching_gt in , As long as there is more than 1 Of , Explain that there is a common situation . In the case above , Show that No 5 Columns are shared .

Then use the third line of code , take cost in , The first 5 The value of the column is taken out , And compare , Calculate the number of rows corresponding to the minimum value , And scores .

We will be the first to 5 Column two positions , Assuming that 0.4 and 0.3.

After the third line of code , The smallest value that can be found is 0.3, namely cost_min by 0.3, Number of rows corresponding to ,cost_argmin by 2.

After the fourth line of code , take matching_matrix The first 5 All columns are set to 0.

Then use the fifth line of code , take matching_matrix The first 2 That's ok , The first 5 The position of the column changes to 1.

In the end, we can get 3 Target boxes , Some of the most appropriate candidate boxes , namely matching_matrix in , all 1 The corresponding position .

（4）Loss Calculation

Label assignment through part III , The target box can be mapped to the positive sample prediction box . The error between the two can be calculated below , namely Loss function . Code for calculation , be located yolo_head.py Of get_losses Function .

We can see ：

Check the position of the frame iou_loss,Yolox Use traditional iou_loss, and giou_loss Two kinds of , You can choose . and obj_loss and cls_loss, Are all used BCE_loss The way .

Besides, of course , There are two other points to note ：

a. In the previous fine screening , Used reg_loss and cls_loss, Filter out the prediction box corresponding to the target box . So here's iou_loss and cls_loss, Only the target box and the filtered positive sample prediction box are calculated . and obj_loss, Is still aimed at 8400 A prediction box .

b. stay Decoupled Head in ,cls_output and obj_output Used sigmoid Function to normalize , But in training , Not used sigmoid function , The reason is that it is used in training nn.BCEWithLogitsLoss function , Already included sigmoid operation .

And in the process of reasoning , It's using Sigmoid Functional .

PS： Comparison of different experimental data

Because I want to test Yolox Different trick Performance of , And friend pan Daqiang are using their own data , For many trick During the comparative test, it was found that ：

① Scheme 1 ：Yolox-s+ Data to enhance +(obj_output Of Loss function , use BCELoss

②   Option two ：Yolox-s+ Data to enhance +(obj_output Of Loss function , Change it to FocalLoss)

Compared with ： When training with your own dataset , If you will obj_loss Of BCE_Loss, It is amended as follows Focal_Loss, It is found that the effect is obvious , There are also many rising points . and iou_loss Convergence is better , I don't know if any friends have tried ？ You can discuss it in the comments section .

2.2.3 Yolox-s、l、m、x series

In the face of Yolov3 baseline Continuously optimize , On the basis of getting good results .

The author is also right. Yolov5 series , such as Yolov5s、Yolov5m、Yolov5l、Yolov5x Four network structures , Also use a series of trick Improvement .

Let's take a look first , What improvements have been made ？

We are mainly for Yolov5s Contrast , The picture below is Yolov5s Network structure diagram ：

Let's see Yolox-s Network structure ：

From the comparison of the above two figures , As can be seen from the above ,Yolov5s and Yolox-s The main difference is that ：

（1） Input end ： stay Mosa On the basis of data enhancement , Added Mixup Data enhancement ;

（2）Backbone： The activation function uses SiLU function ;

（3）Neck： The activation function uses SiLU function ;

（4） Output terminal ： Change the detection head to Decoupled Head、 use anchor free、multi positives、SimOTA The way .

in front Yolov3 baseline On the basis of , The above tricks, Achieved a very good rise point .

stay Yolov5 In a series of frameworks ？

Below is the Yolov5s、Yolov5m、Yolov5l、Yolov5x Comparison of improvement effects of four networks ：

It can be seen that , As the speed increases 1ms Left and right ,AP Accuracy achieved 0.8~2.9 The rising point of the market .

And the lighter the network structure , such as Yolox-s When , Up the most , achieve 2.9 The rising point of the market .

With the deepening of network depth and width , The rising point slowly decreases , Final Yolox-x Yes 0.8 The rising point of the market .

2.2.4 Lightweight network research

In the face of Yolov3、Yolov5 After the series is improved , The author also designs two lightweight networks , And Yolov4-Tiny、 and Yolox-Nano Contrast .

In the course of research , The author has two findings , Mainly from the lightweight network , And the advantages and disadvantages of data enhancement , Describe from two angles .

2.2.4.1 Lightweight networks

Because of the needs of the actual scene , Many students want to Yolo Migrate to edge devices .

Therefore, the author aims at Yolov4-Tiny, To build the Yolox-Tiny Network structure .

in the light of FCOS Style NanoDet, To build the Yolox-Nano Network structure .

As can be seen from the table above ：

（1） and Yolov4-Tiny comparison ,Yolox-Tiny When the parameter quantity decreases 1M Under the circumstances ,AP It's worth it 9 The rising point of a point .

（2） and NanoDet comparison ,Yolox-Nano When the parameter quantity decreases , have only 0.91M Under the circumstances , Realized 1.8 The rising point of a point .

（3） So we can see that ,Yolox The overall design of , In terms of lightweight models , There are still good improvements .

stay Yolox In many comparative tests , Both use data enhancement .

But different network structures , Some deep and some shallow , The learning ability of the network is different , So is uncontrolled data enhancement really better ？

The author team , A comparative test is also carried out on this problem .

Through the above table, the following findings ：

① Mosaic and Mixup Hybrid strategy

（1） For lightweight networks ,Yolox-nano Come on , When in Mosaic On the basis of , Added Mixup The way data is enhanced ,AP The value does not increase but decreases , from 25.3 drop to 24.

（2） And for deeper networks ,Yolox-L Come on , stay Mosaic On the basis of , Added Mixup The way data is enhanced ,AP On the contrary, the value has increased , from 48.6 Add to 49.5.

（3） Therefore, different network structures , The strategy of using data enhancement is also different , such as Yolox-s、Yolox-m, perhaps Yolov4、Yolov5 series , You can try different data enhancement strategies .

② Scale Enhancement strategy

stay Mosaic Data enhancement , Code Yolox/data/data_augment.py Medium random_perspective function , When generating affine transformation matrix , For the zoom factor of the picture , Will generate a random value .

（1） about Yolox-l Come on , Random range scale Set in the [0.1,2] Between , That is, the default parameters set in the article .

（2） And when using lightweight models , such as YoloNano when , On the one hand, only Mosaic Data to enhance , On the other hand, the random range scale, Set in the [0.5,1.5] Between , Weaken Mosaic Enhanced performance .

2.3 Yolox The achievement of

2.3.1 Accuracy and speed comparison

We learned about Yolox Various trick Reason and principle of improvement , Let's take a look at the comparison of accuracy and speed of various models as a whole ：

The picture on the left is relatively standard , Comparative effect of network structure , Mainly in terms of speed and accuracy , Contrast . And the picture on the right , Is the contrast effect of lightweight networks , The main comparison is the parameter quantity and accuracy .

From the picture on the left, we can see ：

（1） And Yolov4-CSP Quite Yolov5-l Contrast ,Yolo-l stay COCO On dataset , Realization AP50% Indicators of , Surpass at almost the same speed Yolov5-l 1.8 percentage .

（2） and Yolox-Darknet53 and Yolov5-Darknet53 comparison , Realization AP47.3% Indicators of , At almost the same speed , Higher than 3 percentage .

And from the picture on the right ：

（1） and Nano comparison ,Yolox-Nano Parameters and GFLOPS There is a reduction , The parameter is 0.91M,GFLOPS by 1.08, But the accuracy can reach 25.3%, exceed Nano1.8 percentage .

（2） and Yolox-Tiny and Yolov4-Tiny comparison , Parameters and GFLOPS Are reduced , Far more accurate than Yolov4-Tiny 9 percentage .

2.3.2 Autonomous Driving competition

stay CVPR2021 Of the autopilot race ,Streaming Perception Challenge On the track , One of the main concerns of the challenge , It is a real-time video stream in the automatic driving scene 2D Target detection problem .

A server sends and receives pictures and test results , To simulate video streaming 30FPS In the video , The client performs real-time inference after receiving the picture .

In the competition, Kuang Shi technology adopts Yolox-l As a model , Use at the same time TensorRT Reasoning speeds up , In the end full-track and detection-only track, The first of the two track races .

therefore Yolox All kinds of improvement methods are still very good , It's worth learning , Take a closer look at .

3  Different deployment methods of landing models

When the model is trained , When you need to deploy in a project .

The author in the code , It also carefully sorted out the deployment methods of various versions ：

Like the above 5 Ways of planting ：

（1）MegEngine： Deep learning framework based on open vision technology ,MegEngine How to deploy . It's also Brain++ Core components , There are mainly C++ and Python Two ways .

（2）ONNX and Tensorrt Two ways ： NVIDIA supports both approaches , There are mainly C++ and Python Two way , Often used in GPU Server reasoning .

（3）NCNN： Tencent Youtu's open source mobile terminal reasoning framework , There are mainly C++ and Java edition .

（4）OpenViNO：Intel The company's open source deep learning application suite , There are mainly C++ and Python edition .

In general , You can choose Yolox-Nano、Yolox-Tiny、Yolox-s Used for mobile deployment .

Yolox-m、Yolox-l、Yolox-x be used for GPU Server deployment .

You can also according to the needs of your own project , Choose different deployment methods .

Of course, in addition to the head data set used in training , In the data set download section of the big white website , And hundreds of them , Different types of data sets ：