TY - JOUR
T1 - Efficiently Archiving Photos under Storage Constraints
AU - Davidson, Susan B.
AU - Gershtein, Shay
AU - Milo, Tova
AU - Novgorodov, Slava
AU - Shoshan, May
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s)
PY - 2023/3/20
Y1 - 2023/3/20
N2 - Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.
AB - Our ability to collect data is rapidly outstripping our ability to effectively store and use it. Organizations are therefore facing tough decisions of what data to archive (or dispose of) to effectively meet their business goals. We address this general problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained. We formalize the photo archival problem, analyze its complexity, and give two approximation algorithms. One algorithm comes with an optimal approximation guarantee and another, more scalable, algorithm that comes with both worst-case and data-dependent guarantees. Based on these algorithms we implemented an end-to-end system, PHOcus, and discuss how to automatically derive the inputs for this system in many settings. An extensive experimental study based on public as well as private datasets demonstrates the effectiveness and efficiency of PHOcus. Furthermore, a user study using business analysts in a real e-commerce application shows that it can save a tremendous amount of human effort and yield unexpected insights.
UR - http://www.scopus.com/inward/record.url?scp=85165026394&partnerID=8YFLogxK
U2 - 10.48786/edbt.2023.50
DO - 10.48786/edbt.2023.50
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85165026394
SN - 2367-2005
VL - 26
SP - 591
EP - 603
JO - Advances in Database Technology - EDBT
JF - Advances in Database Technology - EDBT
IS - 3
T2 - 26th International Conference on Extending Database Technology, EDBT 2023
Y2 - 28 March 2023 through 31 March 2023
ER -