Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.
|Number of pages||13|
|Journal||Advances in Database Technology - EDBT|
|State||Published - 20 Mar 2023|
|Event||26th International Conference on Extending Database Technology, EDBT 2023 - Ioannina, Greece|
Duration: 28 Mar 2023 → 31 Mar 2023