Skip to content
This repository was archived by the owner on Dec 8, 2025. It is now read-only.

fix distributed initialization#285

Closed
ZiyueHuang wants to merge 1 commit into
bytedance:masterfrom
ZiyueHuang:fix_init
Closed

fix distributed initialization#285
ZiyueHuang wants to merge 1 commit into
bytedance:masterfrom
ZiyueHuang:fix_init

Conversation

@ZiyueHuang

@ZiyueHuang ZiyueHuang commented Aug 5, 2020

Copy link
Copy Markdown
Contributor

According to this line https://github.com/bytedance/byteps/blob/master/byteps/server/server.cc#L197, the server will initialize the stored parameter from the last init push which could be sent from an arbitrary worker.
cc @eric-haibin-lin

Comment thread byteps/mxnet/__init__.py
if rank() != self.root_rank:
param_arrays[0].__imul__(0)
byteps_push_pull(param_arrays[0], version=0, priority=0,
name="parameter_" + str(idx), is_average=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should use is_average=True here?

@ZiyueHuang ZiyueHuang Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to average, since the server just selects one received tensor and copy it to the stored buffer then send it to all workers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

@ZiyueHuang

ZiyueHuang commented Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

For sync mode, this initialization is also necessary since it make sure that the parameters on each worker are the same at the beginning. The current implementation is problematic since all the workers may actually initialize the parameters to 0 (except when the root node sends the last init push), no matter what initializer the user chooses.

@ZiyueHuang

Copy link
Copy Markdown
Contributor Author

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

@jasperzhong

Copy link
Copy Markdown
Contributor

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

I don't think there is such a problem in the first round of pushpull of gradients. The stored has been allocated and initialized during the pushpull of parameters.

BTW, there is already an API called bps.init().

jasperzhong pushed a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020
@ZiyueHuang

ZiyueHuang commented Aug 5, 2020

Copy link
Copy Markdown
Contributor Author

@vycezhong The gradient and the weight for the same parameter are two different buffers in the server since they have different keys. See https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L203-L206. The problem is that for a new key, in the first round of byteps_push_pull, the server is not performing the sum operation and this is weird.

bps.init() is not for a particular parameter; see kvstore init api in MXNet.

@jasperzhong

jasperzhong commented Aug 5, 2020

Copy link
Copy Markdown
Contributor

i see. you're right. I am sorry that i didn't notice parameter and gradient do not share the same buffer...

I think maybe we can reuse the parameter buffer? for example,

        for i, param in enumerate(self._params):
            byteps_declare_tensor("tensor_" + str(i)) 
...

    def _init_params(self):
                ....
                byteps_push_pull(param_arrays[0], version=0, priority=0,
                                 name="tensor_" + str(idx), is_average=False)

in this way, there is only one collection of buffers. And byteps_push_pull in _init_params will trigger initialization of the buffer. And then this buffer is used for gradients. I think it may be a solution.

@ZiyueHuang

Copy link
Copy Markdown
Contributor Author

This might be a solution for training the model, but it is still weird that the first round of byteps_push_pull performs a random selection instead of the sum...

Furthermore, it may not be neccesary to maintain two buffers update_buf_ and store_ in https://github.com/bytedance/byteps/blob/master/byteps/server/server.h, since BytePS transmits gradients bidirectionally (both from worker to server and from server to worker). MXNet KVStore maintains these two buffers since MXNet pushes gradient from worker to server, then updates the weight on the server by the optimizer, then pulls weight from server to worker.

@ZiyueHuang ZiyueHuang closed this Aug 5, 2020
@eric-haibin-lin

Copy link
Copy Markdown
Contributor

@bobzhuyb

bobzhuyb commented Aug 5, 2020

Copy link
Copy Markdown
Member

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020
@jasperzhong

Copy link
Copy Markdown
Contributor

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

Sorry for my careless. We found the logic is right.

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants