Retrieval of similar cases is an important approach for the design of an automated system to support the radiologist decision making process. The ability to utilize machine learning for the task of image retrieval is limited by the availability of ground truth on similarity between dataset elements (e.g. between nodules). Consequently, past approaches have focused on manual feature extraction and unsupervised approaches. Currently, medical retrieval studies have focused on learning similarity from a binary classification task (e.g. malignancy), and the performance evaluation was also based on a binary classification framework. Such similarity measure is far from being adequate and fails to capture true retrieval performance. Current study explores the task of similarity learning in the context of lung nodule retrieval, using LIDC's public dataset. LIDC offers annotations of nodules that include ratings of 9 characteristics per each nodule. These rating are used as our golden-standard similarity measure. Four architectures that utilize the same core network are being explored. These architectures correspond to four unique tasks: binary classification, binary similarity, rating regression and similarity regression. Results show clear discrepancy between classic performance measures and the correlation to the reference similarity measure: all methods had precision in the range of 0.73-0.75, while rating correlation ranged 0.22 to 0.51, with the highest correlation achieved with the rating-regression approach. Additionally, a measure of the uniformity of the embedding space (Hubness) is introduced. The importance of Hubness, as an independent success criteria, is explained, and the measure is evaluated for all architectures. Our rating-regression network has reached state-of-the-art result in several tasks.