Deep subspace clustering is an effective method for clustering high-dimensional data, and it provides state-of-the-art results in clustering hyperspectral images~(HSI). However, these methods typically suffer from the size of the so-called self-expression matrix that increases quadratically in size with the image to be clustered. This can result in significant demands on computing power and storage capacity, making it challenging to apply these methods to large-scale data. Recently emerging Efficient Deep Embedded Subspace Clustering focuses on learning the basis of different subspaces which need much fewer parameters. Here, we extend and generalize this approach to account for local and non-local spatial context. We propose a structured model-aware deep subspace clustering network for hyperspectral images where the contextual information is captured in the appropriately defined loss functions. A self-supervised loss captures the local spatial structure, and the non-local structure is incorporated through a contrastive loss that encourages pixels with small feature distances to have the same prediction, while pixels with large feature distances have a distinct prediction. The experiments on real-world hyperspectral datasets demonstrate clear advantages over state-of-the-art methods for subspace clustering.