Skip to content

ARROW-1149: [Plasma] Create Cython client library for Plasma#797

Closed
pcmoritz wants to merge 61 commits into
apache:masterfrom
pcmoritz:plasma-cython
Closed

ARROW-1149: [Plasma] Create Cython client library for Plasma#797
pcmoritz wants to merge 61 commits into
apache:masterfrom
pcmoritz:plasma-cython

Conversation

@pcmoritz

@pcmoritz pcmoritz commented Jun 30, 2017

Copy link
Copy Markdown
Contributor

This PR introduces a Cython API to Plasma, a FindPlasma.cmake to make it easier to integrate Plasma with CMake projects and sets up packaging with pyarrow.

@wesm

wesm commented Jul 1, 2017

Copy link
Copy Markdown
Member

@pcmoritz I'm sorry about the delay, I will have a closer look most likely tomorrow

@pcmoritz

pcmoritz commented Jul 1, 2017

Copy link
Copy Markdown
Contributor Author

Thanks, looking forward to it! The central question is how much mixing between the plasma client and pyarrow we want and also how the directory structure should look like. I kept them pretty separate for now.

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! As far as the packaging, it's a little more work to add a FindPlasma.cmake file to the pyarrow build system, but for packaging simplicity (for pip and conda) it would be easiest to maintain this as pyarrow/plasma.pyx and have a single build system and package artifact

I left minor comments about proper usage of nogil (you have to have the with nogil: block for it to release the GIL; the nogil annotation in the cdef extern blocks tells Cython that the functions are safe to use in a nogil block, but the actual release must be explicit)

The other thing is the docstrings so that things will render properly when we add these APIs to the Arrow Python documentation (http://arrow.apache.org/docs/python/api.html -- it's in need of some love as you can see)

Comment thread cpp/src/plasma/plasma.h Outdated
#ifdef PyMODINIT_FUNC
#undef PyMODINIT_FUNC
#endif
#define PyMODINIT_FUNC extern "C" __attribute__((visibility ("default"))) PyObject*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very odd, will have to have a closer look at what is going on; I've never had any issues like this

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated

cdef extern from "plasma/common.h" nogil:

cdef cppclass CUniqueID" plasma::UniqueID":

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed indentation -- use 4 spaces everywhere?

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
self.data = CUniqueID.from_binary(object_id)

def __richcmp__(ObjectID self, ObjectID object_id, operation):
assert operation == 2, "only equality implemented so far"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise a ValueError here instead

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
for object_id in object_ids:
ids.push_back(object_id.data)
result[0].resize(ids.size())
check_status(self.client.get().Get(ids.data(), ids.size(), timeout_ms, result[0].data()))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add with nogil: here to release the GIL

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
release_delay (int): The maximum number of objects that the client will
keep and delay releasing (for caching reasons).
"""
check_status(self.client.get().Connect(store_socket_name.encode(), manager_socket_name.encode(), release_delay))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil: here (may require some unboxing of the Python objects outside the with nogil block)

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated

def release(self, ObjectID object_id):
"""Notify Plasma that the object is no longer needed."""
check_status(self.client.get().Release(object_id.data))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
object_id (str): A string used to identify an object.
"""
cdef c_bool is_contained
check_status(self.client.get().Contains(object_id.data, &is_contained))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
store, the string will have length zero.
"""
cdef c_vector[uint8_t] digest = c_vector[uint8_t](kDigestSize)
check_status(self.client.get().Hash(object_id.data, digest.data()))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil

Comment thread cpp/src/plasma/plasma/plasma.pyx Outdated
num_bytes (int): The number of bytes to attempt to recover.
"""
cdef int64_t num_bytes_evicted = -1
check_status(self.client.get().Evict(num_bytes, num_bytes_evicted))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nogil, and the other client calls below

Comment thread cpp/src/plasma/test/test.py Outdated

USE_VALGRIND = False

def random_name():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent with 4 spaces in this file

@pcmoritz

pcmoritz commented Jul 4, 2017

Copy link
Copy Markdown
Contributor Author

Thanks a lot for the review. The packaging simplicity is a good argument, let's make it part of pyarrow. I'm traveling right now but hope to get to it this weekend.

Comment thread ci/travis_script_plasma.sh Outdated
}

# run tests for python 2.7 and 3.6
# python_version_tests 2.7

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably comment this in?

read_multiple_files(mixed_paths)


@parquet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need an extra newline before @parquet

@pcmoritz pcmoritz force-pushed the plasma-cython branch 7 times, most recently from abc5600 to 48b6f90 Compare July 14, 2017 12:36
@xhochy

xhochy commented Jul 14, 2017

Copy link
Copy Markdown
Member

@pcmoritz please ping me when I should review or you run into problems. I'm ignoring this pull request for now as there's a lot of activity in here with commits.

@pcmoritz

pcmoritz commented Jul 14, 2017

Copy link
Copy Markdown
Contributor Author

@xhochy I'm wrapping it up right now, the last thing I'm really blocked on is the error in https://travis-ci.org/pcmoritz/arrow/jobs/253597521 in the manylinux build, see the error message
"ImportError: libplasma.so.0: cannot open shared object file: No such file or directory"

If you have an idea please let me know! I did add -DARROW_PLASMA=ON to python/manylinux1/build_arrow.sh and FindPlasma.cmake seems to find libplasma.so at least.

Other than that there are some doc fixes I need to do, I'll let you know when it is ready.

Thanks for your help :)

Comment thread .travis.yml Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about not also testing this on OSX, but I agree it might be too burdensome on Travis.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for mac os now, if it becomes too burdensome, we can switch it off

Comment thread python/pyarrow/io.pxi Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two newlines before and after this class

Comment thread python/pyarrow/plasma.pyx Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ray, this used to look more like

if self.plasma_client.alive:
    libplasma.release(self.plasma_client.conn, self.plasma_id)

I think the motivation for this was that Ray (on a single machine) by default kills the plasma store manually at the end of a script. After this, a PlasmaBuffer can still go out of scope and try to send a release message to the store, and the check_status will raise an exception leaving some ugly error or warning messages.

Or is this avoided in some other way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in the C++ client; the file descriptor is set to -1 on disconnect, our old shutdown is disconnect and release checks for the file descriptor being -1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that looks good.

Comment thread cpp/src/plasma/CMakeLists.txt Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also move these cmake_modules to the top-level so that they are available to any arrow-cpp component.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's where they belong. Shall we do this as a separate PR to make it a little more visible? It might break other people's scripts.

Comment thread python/manylinux1/build_arrow.sh Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check in line 67 below that plasma can be imported?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the pointer!

Comment thread cpp/src/plasma/test/test.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify valgrind and plasma options as part of the pytest call, see the --parquet option in pyarrow/tests/conftest.py

@pcmoritz pcmoritz force-pushed the plasma-cython branch 8 times, most recently from 8f7b360 to dc3fecb Compare July 19, 2017 07:38
@pcmoritz pcmoritz force-pushed the plasma-cython branch 4 times, most recently from de14fa9 to 1609f48 Compare July 19, 2017 18:41
@pcmoritz

Copy link
Copy Markdown
Contributor Author

The build is now green and it's ready to merge. One possible reservation is that travis times are quite long now, any thoughts on that?

@wesm

wesm commented Jul 23, 2017

Copy link
Copy Markdown
Member

I will give this a last look and merge later today assuming no issues now that the 0.5.0 release vote has passed. I will also look at the build times to see if there's anything that needs to be done

@pcmoritz

Copy link
Copy Markdown
Contributor Author

Great, thanks! Next steps from our side is a little bit more high level documentation (I'll create a PR once this is merged) and then the blog post.

@wesm

wesm commented Jul 23, 2017

Copy link
Copy Markdown
Member

Sounds good. We can probably publish the blog post right after the IP clearance process is concluded

@wesm

wesm commented Jul 24, 2017

Copy link
Copy Markdown
Member

There's a significant amount of redundant work happening in the build. We should probably only be testing the built-in third-party toolchain (the ExternalProjects) in one entry in the build matrix and using toolchain libraries in all the others. This would probably be simplest to tackle in a follow up patch. I can take a look

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments -- sorry to delay getting this in, but I think we should remove the MutableBuffer classes and instead have a mutable property that forwards buffer->is_mutable(). I can look into speeding up the builds in a separate patch

Comment thread python/doc/source/api.rst Outdated
ObjectID
PlasmaClient
PlasmaBuffer
MutablePlasmaBufferMutablePlasmaBuffer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Can address when working on documentation generally

Comment thread python/pyarrow/includes/libarrow.pxd Outdated
int64_t GetExtentBytesWritten()

cdef cppclass CFixedSizeBufferWrite" arrow::io::FixedSizeBufferWriter"(WriteableFile):
CFixedSizeBufferWrite(const shared_ptr[CBuffer]& buffer)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo

Comment thread python/pyarrow/io.pxi Outdated
return self.size


cdef class MutableBuffer(Buffer):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since buffers have an is_mutable() function, is this needed (versus adding a mutable property to Buffer)?

Comment thread python/pyarrow/plasma.pyx Outdated
self.wr_file.reset(new CFixedSizeBufferWrite(buffer.buffer))
self.is_readable = 0
self.is_writeable = 1
self.is_open = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to move this to io.pxi

Comment thread python/pyarrow/plasma.pyx Outdated
self.client.release(self.object_id)


cdef class MutablePlasmaBuffer(MutableBuffer):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if this can be collapsed per the discussion above with MutableBuffer

Comment thread python/pyarrow/tests/test_plasma.py Outdated
use_valgrind=os.getenv("PLASMA_VALGRIND") == "1")
# Connect to Plasma.
self.plasma_client = plasma.PlasmaClient()
self.plasma_client.connect(plasma_store_name, "", 64)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a top level function pyarrow.plasma.connect that returns a fully-baked client? This API is a slight rough edge

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, that seems better!

Comment thread python/setup.py
self.with_plasma = strtobool(
os.environ.get('PYARROW_WITH_PLASMA', '0'))
if self.with_plasma:
if self.with_plasma and "plasma" not in self.CYTHON_MODULE_NAMES:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function initialize_options seems to have been called twice and plasma was added twice to CYTHON_MODULE_NAMES, which broke the "python setup.py develop". If there is a better fix let me know!

Comment thread python/setup.py
def _failure_permitted(self, name):
if name == '_parquet' and not self.with_parquet:
return True
if name == 'plasma' and not self.with_plasma:

@pcmoritz pcmoritz Jul 24, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this for good measure, not sure where/if it is necessary

Comment thread python/pyarrow/plasma.pyx
c_string manager_socket_name

def __cinit__(self):
def __cinit__(self, store_socket_name, manager_socket_name, int release_delay):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one rough edge in Cython, if you call the ctor with the wrong parameters you will generally segfault, which is why factory methods are usually safer. We can tackle this in a follow up patch

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @pcmoritz!!

@asfgit asfgit closed this in a94f471 Jul 24, 2017
@pcmoritz

Copy link
Copy Markdown
Contributor Author

Thanks for merging! If you could look into the ExternalProjects build times that'd be great (I'm not sure what is involved there). I can see if we can make the actual plasma tests a little faster (and can also change the PlasmaClient -> connect API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants