Reviews For Paper
  Reviews For Paper
Paper ID 526
Title Structured Attentions For Visual Question Answering

Masked Reviewer ID: Assigned_Reviewer_1
Review:
Question 
Paper Summary. Please summarize in your own words what the paper is about. The submission proposes a structured attention for Visual Question Answering task.
Presented method generalizes standard attention by modeling local pairwise relationship
between the elements of the CNN grid, and thus takes spatial relationsip between 'things'
into account. With the model, the authors have achieved impressive
results on CLEVR and VQA.
Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper * Impressive results on CLEVR and VQA.
* Interesting generalization of the attention mechanism.
* Mean Field and Loopy Belief Propagation as recurrent neural networks.
Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences * How the structured attention work in the case of non-local interactions?
The proposed model assumes only local interactions between two neighbouring elements.
* It is unclear if we need grid-structured CRF to define eq. 5. Why not to use sigmoids?
* Main ideas behind ERF could have been explained in the submission.
* Is the eq. 11 correct? e.g. b_j(z_j) doesn't bind to any j (only i and z_i is in the sum).
Moreover, shouldn't it be summation between ln b_i and ln b_j? Should the KL-divergence be
between q(z) and p(z | X,q) (KL(q | p) (lines 376-377)? What is \psi_i (line 398)?
* How does S in lines 267-268 help in reducing a covariate shift (change of distribution)?
Preliminary Rating Borderline
Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making. The submission is very promising with impressive results on both CLEVR and VQA. Before taking a final decision, I would like to clarify a few doubts expressed in Paper Weaknesses.
Confidence. Write “Very Confident" to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), “Confident” to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). "Not Confident" in all the other cases. Confident

Masked Reviewer ID: Assigned_Reviewer_2
Review:
Question 
Paper Summary. Please summarize in your own words what the paper is about. The paper proposed to use structured attention for visual question answering. By introducing structural constrained characteristic on attention map based on Conditional Random Field (CRF), attention for question requiring reasoning over spatial relation could be performed. The proposed method improves the performance in CLEVR and VQA dataset.
Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper The usage of structured attention for visual question answering is novel. The method is clearly written and experiments are correctly performed. Lots of qualitative comparison with other baselines provides good intuition how structured attention performs.
Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences Lack of control experiments on the number of inference steps for CRF.
Preliminary Rating Weak Accept
Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making. Using structured attention for visual question answering is important research direction and the paper suggest neat idea using CRF for attention map. My one concern is the usage of a few adhoc solutions for final evaluation such as
- using feature pooled with unary attention (Sec 3.4)
why adding feature pooled from unary attention is required?

Another question is the effect of the number of inference steps. Even though its effect on final performance on VQA dataset is briefly discussed in Section 4.3.3, no control experiments is performed. What is effect of the number of inference steps in SHAPES and CLEVR?
Confidence. Write “Very Confident" to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), “Confident” to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). "Not Confident" in all the other cases. Confident

Masked Reviewer ID: Assigned_Reviewer_3
Review:
Question 
Paper Summary. Please summarize in your own words what the paper is about. The paper models the task of visual attention for VQA as a multivariate distribution over a grid structured CRF on image regions. They use Mean Field and Loopy Field Propagation inference algorithms and implement the algorithms as parameterless recurrent layers. The proposed approach is evaluated on 3 VQA datasets -- SHAPES, CLEVR and real images VQA dataset.
Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper -- The paper points out the problem that current visual attention models ignore the spatial structure of images and only use CNN features to compute attention weights, which might not work well for questions requiring attention at multiple regions in the image far from each other.

-- The approach has been described in detail.

-- The experiment section is comprehensive, reporting results on 3 different VQA datasets, and highlighting the improvements clearly.
Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences -- How much extra time does the proposed approach take as compared to softmax or sigmoid unstructured attention?

-- Other than computation time, are there any other limitations of using structured attention as compared to unstructured attention? How do the failure cases look like where unstructured attention focuses on target and structured attention doesn't. More discussion about failure cases of structured attention, and future directions for improvement would help research community to build on this work.

-- Why does adding unstructured attention in addition to structured attention lead to better performance?

-- Any insights about which cases are more suitable for MF and which cases for LBP? Trying out both and picking the one with best validation accuracy is probably not the best way.

-- L574: which pooling method has been used in the experiments?

-- Lines 93-95: It is also important to analyze how many questions are there in real images VQA dataset which require spatial relations between regions. Have the authors looked into a way of quantifying it?
Preliminary Rating Weak Accept
Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making. The paper progresses the research in modeling visual attention for the task of Visual Question Answering, and shows that structured modeling leads to better attention maps and better VQA accuracy. I think this work will be helpful for the research community and should be accepted.
Confidence. Write “Very Confident" to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), “Confident” to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). "Not Confident" in all the other cases. Confident


 
0%