vllm.model_executor.layers.fused_moe.topk_weight_and_reduce ¶
TopKWeightAndReduceContiguous ¶
Bases: TopKWeightAndReduce
TopKWeightAndReduce implementation for a fused_experts output of shape (m, topk, K)
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceDelegate ¶
Bases: TopKWeightAndReduce
Useful in the case when some FusedMoEPermuteExpertsUnpermute implementation does not perform weight application and reduction but cannot address the needs of all the compatible PrepareAndFinalize implementations. For example, BatchedTritonExperts is compatible with both batched PrepareAndFinalize implementations like DeepEPLLPrepareAndFinalize and BatchedPrepareAndFinalize. Some PrepareAndFinalize implementations do the weight-application + reduction as part of the combine kernel, while BatchedPrepareAndFinalize needs an explicit implementation. To facilitate this case, the BatchedTritonExperts could use TopKWeightAndReduceDelegate so the PrepareAndFinalize implementations could choose how to weight + reduce.
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceNaiveBatched ¶
Bases: TopKWeightAndReduce
TopKWeightAndReduce implementation for a fused_experts output of shape (num_experts, batch_size, K)
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceNoOP ¶
Bases: TopKWeightAndReduce
The fused_experts outputs have already been weight applied and reduced. This implementation is a no-op.