I used the following tutorial series from Apple on ‘Capturing and Displaying Photos’. They’re great. Just that it took me a bit to be able to piece together how components from AVFoundation work together.

Working with AVFoundation isn’t really the kind where you read documentation and then can piece things together. It’s complicated. You might understand an individual step, but won’t understand when it should be done. There’s just too many components and while it’s not low-level, it’s certainly a different kind of iOS. This is why I think using good tutorials are crucial.

High Level Anatomy of a camera capturing session.

  • Input: The device that can give you video, photo or audio. Examples: Rear-facing camera, front-facing camera or built-in mic.
  • Output: The actual output from the device. Can be a photo, video stored to disk. Or it can be something that you process against. Examples of processing are: detecting faces, scanning barcodes, or applying filters. The output also provides methods to capture photos or record videos.
  • Preview: A preview layer that is a mirror of the camera input. This is also known as the viewfinder.
  • Stream: The output of the camera isn’t a singular image. It’s a stream of frames. You usually capture a single frame of it as a photo or a series of it as video. Though modern cameras often merge frames because of low light, or to reduce noise, etc. The stream is also provided to the preview so you can see.
  • Session: The object which gives you an interface to add/remove inputs or outputs and setup configuration of the session. Start session or end it, etc.
         [ Camera Device (Input) ]
                   │
                   ▼
         ┌────────────────────────┐
         │   AVCaptureSession     │ ←─ central controller
         └────────────────────────┘
                   │
           (Stream of Frames)
                   │
        ┌──────────┼───────────┐
        ▼                      ▼
[ AVCaptureVideoPreviewLayer ] [ AVCaptureOutput ]
     (Live Preview UI)            │
                                  ▼
          ┌───────────────────────────────────────────┐
          │          Choose one or more outputs:      │
          │                                           │
          │  ▸ AVCaptureMovieFileOutput   (record)    │
          │  ▸ AVCaptureVideoDataOutput   (process)   │
          │  ▸ AVCapturePhotoOutput       (take photo)│
          └───────────────────────────────────────────┘

Some interesting notes from docs about capturing a photo:

💡 When you take a photo, you want to capture an image with the highest possible resolution. This contrasts with the preview images, which tend to have a lower resolution to facilitate rapidly updating previews.

💡You might wonder why capturePhoto doesn’t just return the photo. That’s because capturing a photo takes time: the camera may need to focus, or wait for the flash, and then there’s the exposure time. The capturePhoto method is asynchronous, with the captured photo typically arriving a short time after you tap or click the shutter button.

There were two other similar components that took me a bit to distinguish:

Role Scope Examples
Capture Settings Changes how the photo is captured. Should use flashlight, photo quality, codec type, etc.
Stream Settings Changes how the stream is provided from the input to the output video orientation, whether the video should be mirrored (for front-facing cameras), and stabilization settings, can also disable microphone

In short:

Component Role
AVCaptureSession Central pipeline that connects inputs, outputs, and preview layers
AVCaptureDeviceInput Camera (or mic) source input
AVCaptureVideoPreviewLayer Visual live feed for user-facing preview
AVCaptureOutput Output target for the capture session
AVCapturePhotoSettings Changes how the photo is taken
AVCaptureConnection Changes how the stream is provided from the input to the output

To take this up a notch:

From Apple Docs - The photo nicely highlights how you can combine inputs to create different outputs.

Congratulations. Now you know how a photo a taken. But that’s not enough for you to be able to show it in Swift. You need to jump through a few more hoops before you can render it in an image. Continue reading

AVCapturePhoto vs CGImage vs Image vs PHAsset

  • AVCapturePhoto is the raw output from the camera. It’s not displayable.
  • PHAsset: A Photos framework object that represents an image or video in the user’s Photos library. It’s just a reference — it doesn’t contain pixel data. You fetch the actual bitmap/video using PHImageManager or PHAssetResourceManager.
  • CGImage: Is the bit map. You can draw it in a context, or pass it to SwiftUI’s Image.
  • Image (SwiftUI): This is just a view. It’s the UI representation of something visual, like a CGImage, or a named asset. An Image doesn’t store pixel data — it simply renders what you give it in SwiftUI.

The photo (raw model) is something with an encoding stored in disk along with some association to the Photos library (manged by PHAsset), the SwiftUI Image is a view in your app that can’t process the AVCapturePhoto, it needs the photo in the form of a CGImage before it can process it.

Typical flow: AVCapturePhoto → (decode) → CGImageImage (SwiftUI) → display, or save to Photos → results in a PHAsset for later retrieval.

Other Notes

Viewfinder

Also the term Viewfinder is used for the component of a camera where the photographer looks through to view a scene.

This was interesting to know because otherwise you might be asking yourself, what is there to find? Then I realized it’s a common photography term.

An Optical View Finder is used to help the photographer view a scene

Can I test camera using the simulator?

Yes/No.

  • For the iOS simulator you can’t.
  • For the macOS, you’re building the app straight into your mac. This allows you to access the front camera.

It’s a nice workaround when you’re away from your iPhone.

Apple’s sample can be a bit overwhelming

While I did say the sample code was great, often times the sample code does ‘everything’. This means the sample code may overdo things while you just wanna learn the basics. The sample code may also not explain why it does everything. Also they may miss a thing or two or have mistakes in it. The worst ones are code path that you can’t hit. You start to think why did Apple add this. My guess is that it’s their way of documenting things. It doesn’t matter to them if the code path isn’t hit.

Example: This code never gets hit when I capture a photo. I’m guessing it’s not being hit because the album I used is smartAlbumUserLibrary which already adds every image taken. ‌But I’m not sure.

if let albumChangeRequest = PHAssetCollectionChangeRequest(for: assetCollection), assetCollection.canPerform(.addContent) {
    let fastEnumeration = NSArray(array: [assetPlaceholder])
    albumChangeRequest.addAssets(fastEnumeration)
}