A Transcription Factor Atlas of Directed Differentiation

J.J. and F.Z. conceived the study. J.J. designed all experiments. J.J., P.C.K., V.K.V., and A.S. performed experiments. S.M., T.T., K.R.G., and O.K. prepared SHARE-seq libraries under the supervision of A.R. and J.D.B. B.G. and M.A.A. performed and analyzed the electrophysiological recordings under the supervision of Z.F. J.J. analyzed data with assistance from S.M., T.T., and W.E.A. and guidance from A.R. O.O.A. performed ChIP-seq motif analysis. J.J., P.C.K., and J.S.G. quantified immunofluorescence images. F.Z. supervised the research and experimental design with support from R.K.M. J.J., R.K.M., A.R., and F.Z. wrote the paper with input from all authors.

The publisher's final edited version of this article is available free at Cell This article has been corrected. See Cell. 2024 May 01; : .

Associated Data

Sup Data 2: Data S2 ∣ Duration of established TF-driven differentiation protocols. For each protocol, the cell type, TFs, time required for TF overexpression, and PubMed reference number are listed.

GUID: C76FB1E0-47A3-451A-876F-EEC0C3FC34F2

Sup data 1: Data S1 ∣ TF splice isoforms in the MORF library. The MORF library consisted of 1,836 TF genes encoded by 3,548 splice isoforms that overlapped between RefSeq and Gencode annotations, as well as 2 control vectors expressing GFP and mCherry. 593 of the 3,548 isoforms were obtained from the Broad Genomic Perturbation Platform (Broad GPP). Some of the Broad GPP TF ORFs contained V5 epitope tags. The rest of the isoforms were synthesized by Genewiz. All constructs were sequence verified and paired with unique 24-bp barcodes.

GUID: AB4F14A7-CE8E-4F8F-BC1B-698FD35EB800

Sup data 3: Data S3 ∣ TF enrichment for bulk screening in different cell culture media. Fold enrichment of 3,548 TFs after differentiating in 7 different cell culture media (M1-M7) for 7 days. For unsorted cells, enrichment was measured relative to the lentiviral library. For sorted cells, enrichment was measured as the TF barcode count in cells expressing low levels of pluripotency markers relative to those expressing high levels.

GUID: 2209EFA1-7DFA-4FAF-81F0-87C2BCEB7BEE

Sup data 5: Data S5 ∣ Splice isoform annotations and differentiation outcome. Splice isoform annotations for 9 TF genes indicating differences between isoforms and whether domains that may be important for function are missing. The diffusion pseudotime P-values are included for each isoform. AA, amino acid.

GUID: B56C9336-BE0B-45A9-915D-F71FF61EDC40

Sup data 4: Data S4 ∣ Pseudotime analysis of TF overexpression. Diffusion and RNA velocity pseudotime analysis methods were used to order cells from the TF Atlas in a differentiation trajectory. (A) Differentially expressed genes over pseudotime. Linear regression was applied to identify genes that were significantly differentially expressed (FDR < 0.05) over pseudotime. Genes were ranked by the slope of the linear regression fit. The 1,000 most upregulated and downregulated genes are listed. (B) Effects of overexpressing each TF on pseudotime. Pseudotimes for each TF were compared to GFP and mCherry controls to determine the average difference and associated P-value.

GUID: AE25FE16-A07B-43E0-9322-11B57AB5D124

Sup data 6: Data S6 ∣ Grouping of TF ORFs using NMF and pairwise correlation. (A) Hierarchical clustering results for TF grouping methods using non-negative matrix factorization (NMF) or pairwise correlation. The order of TFs and assigned clusters are listed for each method. (B) NMF gene programs. The top 100 genes with the highest weights for each of the 100 NMF gene programs are shown. Programs 1-50 are enriched and programs 51-100 are depleted.

GUID: 7C0E79B5-B8CB-40C9-B634-1670B920427C

Sup data 8: Data S8 ∣ Cluster marker genes for differentiated cells in the TF Atlas. Cells were ordered by diffusion pseudotime relative to Cluster 0. Linear regression was applied to identify genes that were significantly differentially expressed (FDR < 0.05) over pseudotime. The slope of the linear regression fit and associated P-values are listed for each marker gene.

GUID: 12DD8D22-4F99-4149-9940-05847F08CCAA

Sup Data 10: Data S10 ∣ Marker genes used to validate candidate TFs for differentiation of reference cell types. Marker genes commonly used for each cell type were selected based on available literature. For each marker gene, the associated cell type, literature references, and whether the marker gene was identified as a differentially expressed gene (DEG) for the respective cell type by the TF Atlas are provided. EMT, epithelial-mesenchymal transition; PNS, peripheral nervous system.

GUID: 25EFAF82-6F01-43FA-90CA-D4DAD83A41C5

Sup Data 12: Data S12 ∣ Targeted TF screening results. (A) Screens of 90 TFs in hESCs for iNP differentiation. Ranks for each screening method are reported. Ranks were considered “NA” if the results of TF overexpression were not detectable. (B) Screen of 54 TFs in RFX4-iNPs for astrocyte differentiation using flow-FISH. S1 and S2 represent biological replicates. (C) Screen of 80 TFs in EOMES-iCMs for atrial, ventricular, or mature cardiomyocyte differentiation using flow-FISH.

GUID: 9490E47F-A575-49CB-89A2-5F3286808E0A

Sup data 13: Data S13 ∣ Differentially expressed genes in bulk RNA-seq datasets. Genes that were significantly differentially expressed (FDR < 0.05) relative to control are listed with associated fold change and P-values for TF ORF overexpression (A) and DYRK1A perturbations (B).

GUID: E5BCD41B-3DF8-42BD-A656-54FAD55FD2F0

Sup data 15: Data S15 ∣ Cluster marker genes for each scRNA-seq dataset. The top 30 marker genes and associated P-values are listed for each dataset and cluster. (A) Cells that have been spontaneously differentiated from iNPs for 8 weeks. iNPs were derived using RFX4, NFIB, ASCL1, or PAX6 with n = 2 biological replicates. (B) iNPs derived using three iNP differentiation methods: dual SMAD inhibition, embryoid body formation, and RFX4 overexpression combined with dual SMAD inhibition. Data represents n = 2 batch replicates. (C) Cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. (D) Cardiomyocytes differentiated using EOMES overexpression or GSK and WNT inhibition for 4 weeks. Data represents n = 2 biological replicates.

GUID: 8B5A7118-5DA9-49C5-855D-0E7B5E2CC92E

Sup data 18: Data S18 ∣ Genes with TF ChIP-seq peaks. Genes with transcriptional start sites that that were within 10kb of the TF ChIP-seq peak region identified by MACS are listed.

GUID: F424383E-2625-4AB0-94ED-1CE96C177F10

Sup data 21: Data S21 ∣ Predicted TF combinations for reference cell types. (A) Hierarchical clustering of TF ORFs for TF Atlas differentiated cells into 365 clusters. (B) Predicted double TFs for reference cell types from the human fetal cell atlas (Cao et al., 2020). Combinations were ranked based on the cell type-specific gene signature score. Only the top 10% of possible combinations are shown. Combinations are presented as clusters from (A). (C) Same as (A), for 151 clusters. (D) Same as (B), for triple combinations using clusters from (C). Only the top 1% of possible combinations are shown.

GUID: 2C132A36-DFF2-48FB-83EB-38D8A3365676

Sup data 22: Data S22 ∣ List of reagents used in this study. Reagents included sgRNAs (A), primers (B), TaqMan qPCR probes (C), Flow-FISH probes (D), and antibodies (E).

GUID: 109D06D2-4939-462A-8648-250777E9F506

sup data 9: Data S9 ∣ Differential gene expression analysis for differentiated cells. Smoothened heatmap showing expression of marker genes for each cluster of differentiated cells from Figure 3A . Cells are sorted by cluster followed by diffusion pseudotime.

GUID: F0ED13B7-71D1-4BD7-B52E-399436C4BA4A

Sup data 7: Data S7 ∣ Unbiased clustering of TFs based on Pearson correlation of gene expression. (A) Heatmap showing pairwise Pearson correlation for mean expression profiles of 3,266 TF ORFs. TFs are ordered by hierarchical clustering. Relations between some groups of TFs are annotated, prioritizing groups that overlap with those from Figure 2 . Numbers in parentheses indicate the number of TF isoforms in the same group. (B,C) Zoomed in subsets of (A).

GUID: 09BD5234-D38F-4DB6-9825-A425F4C22838

sup data 11: Data S11 ∣ Validation of candidate TFs in other stem cell lines for differentiation towards nominated cell types. (A,B) Expression of marker genes for each nominated cell type in H9 hESCs (A) or 11a iPSCs (B) after 7 days of candidate TF or GFP overexpression. Numbers after TF gene names indicate the isoform. n = 4. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant; ND, not detected. (C) Controls for data in Figure 4D - ​ -K K (immunostaining of marker genes in H1 hESCs). Scale bar, 25 μm. (D) Heatmap showing enrichment of TF Atlas candidate TFs in the bulk TF screen across 7 media conditions (M1-M7).

GUID: 2A96F18D-826D-4BDC-B4D5-9E9CD6C97832

sup data 14: Data S14 ∣ Validation of candidate TFs driving iNP differentiation. (A) Western blot showing expression of candidate TFs measured via the V5 epitope tag after 7 days of differentiation. (B,C) Heatmaps showing correlation between bulk RNA-seq expression profiles of iNPs and human fetal cortex (Nowakowski et al., 2017) (B) or brain organoid (Quadrato et al., 2017) (C) cell types. D07 and D12 indicate number of days of ORF overexpression. RG, radial glia; IPC, intermediate progenitor cell; IN, interneuron; div, dividing; oRG, outer radial glia; tRG, truncated radial glia; vRG, ventricular radial glia; MGE, medial ganglionic eminence; nEN, newborn excitatory neurons, EN, excitatory neurons; PFC, prefrontal cortex; V1, primary visual cortex; nIN, newborn interneurons; CTX, cortex; CGE, cortical ganglionic eminence; STR, striatum; OPC, oligodendrocyte precursor cells; Glyc, cells expressing glycolysis genes; Pro, proliferating progenitors; NE, neuroepithelium; DN, dopaminergic neurons; CLN, callosal neurons; CFN, corticofugal neurons; Meso, mesodermal progenitors. (D) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (NG2) in cells spontaneously differentiated for 1, 2, 4, or 8 weeks from iNPs produced by candidate TFs in HUES66 hESCs. (E,F) Expression of NP marker genes in iNPs generated using 11a iPSCs (E) or H1 hESCs (F) after 7 days of TF overexpression. (G,H) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (NG2 and PDGFRA) in cells spontaneously differentiated from 11a iPSC (G) or H1 hESC (H) iNPs for 8 weeks. Scale bar, 100μm.

GUID: 8AEA67F6-30C9-45B6-897F-1F7B7DC9F7BC

sup data 16: Data S16 ∣ Profiling cells spontaneously differentiated from iNPs using scRNA-seq. (A) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. Horizontal lines distinguish between major cell types. Pro, uncommitted progenitors; RP, retinal progenitors; RPE, retinal pigment epithelium; PR, photoreceptors; RGC, retinal ganglion cells; DNP, dorsal neural progenitors; RG, radial glia; Astro, astrocytes; CN, CNS neurons; EPD, ependyma; EP, epithelial progenitors; BE, bronchial epithelium; CE, cranial epithelium; CNC, cranial neural crest; CNCP, cranial neural crest progenitors; P, proliferative cells. (B-D) UMAP of 4,162 neurons from clusters CN 1-3 of Figure S4O. (B,C) Marker genes for general regions of the central nervous systems (B) and neuronal subtypes (C) are shown. Color indicates gene expression. (D) Neurons spontaneously differentiated from each candidate TF are highlighted. Colors indicate biological replicates, S1 and S2.

GUID: 9BEECC96-58C7-49E7-A5E5-4ED9C07CD9A4

sup data 19: Data S19 ∣ Optimization of RFX4-iNP and EOMES-iCM differentiation for disease modeling. (A-D) Optimization of RFX4-iNP differentiation. (A) Schematic for different media conditions (M1-M8) tested. SMAD inhibitors dorsomorphin (DM) and SB-431542 (SB) were added to the media at the indicated concentrations. mTeSR stem cell media was changed to different NP media (NP, EB, and DS; see Methods) over 7 days of differentiation. (B) Heatmaps showing expression of neuron marker genes TUJ1 and MAP2 relative to GAPDH control in cells from iNPs that have undergone spontaneous neurogenesis for 2 or 4 weeks. iNPs were differentiated for 5 or 7 days using each of the media conditions in (A) and seeded at low or high densities prior to spontaneous neurogenesis. Colors represent mean expression from n = 4 biological replicates. (C) Same as (A), for additional media conditions tested. (D) Same as (B), for the media conditions shown in (C). (E-G) UMAP of scRNA-seq data from 26,111 cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. Marker genes for general regions of the central nervous system (E), radial glia subtypes (F), and GABAergic interneuron subtypes (G) are shown. Colors indicate gene expression. (H,I) Optimization of EOMES-iCM differentiation. Data shows percent of cells that stained for cardiomyocyte marker TNNT2 at day 10 pooled from n = 3 biological replicates. (H) Different EOMES induction duration and seeding densities. (I) EOMES induction for 2 days or GSK and Wnt inhibition at different seeding densities. (J-N) Modeling neurodevelopmental disorders using RFX4-iNPs with DYRK1A perturbation. KO, knockout; NT, nontargeting. (J-L) Volcano plots showing the number of genes that were significantly differentially expressed (FDR < 0.05) and had an absolute log2 fold change >1 for DYRK1A KO sgRNA 1 (J), KO sgRNA 2 (K), and ORF (L) conditions. For a full list of genes, see Data S13B. The KO sgRNAs 1 and 2 conditions were compared to both NT sgRNAs. The ORF condition was compared to GFP control. n = 3. (M) Representative images of MAP2 immunostaining during spontaneous differentiation for NT sg1 and DYRK1A KO sg2. Scale bar, 100μm. (N) Action potential (AP) properties measured using electrophysiology for different DYRK1A perturbations from n = 12-36 neurons with evoked APs. Values represent mean ± SEM.

GUID: BADA372C-A332-493F-86DC-B3FE3CA37EBA

sup figure 1: Figure S1 ∣ Bulk TF screening in different cell culture media, related to Figure 1 . (A-D) Comparison of ORF and CRISPRa. HUES66 hESCs were transduced with ORF, ORF with untranslated regions (UTRs), or SAM CRISPRa to upregulate NEUROD1 (A,B) or NEUROG2 (C,D) for 7 days. n = 4. Values represent mean ± SEM. Scale bar, 50 μm. ****P < 0.0001; ***P < 0.001. NT, nontargeting; sg, single guide RNA. (A,C) Expression of NEUROD1 mRNA and protein (A) or NEUROG2 mRNA (C). (B,D) Immunostaining of marker genes for neurons (MAP2) and neural progenitors (PAX6). (E) MORF vector design. WPRE, Woodchuck Hepatitis Virus Posttranscriptional Regulatory Element. (F) Schematic of bulk TF screening in H1 hESCs. MOI, multiplicity of infection. (G) Duration of established TF-driven differentiation protocols. Dashed line indicates the median. (H) Scatterplots comparing the TF barcode distribution for the plasmid library, lentiviral library, and unsorted cells cultured in stem cell media (M7) after 7 days of TF ORF overexpression. CPM, counts per million; BR1 and BR2, biological replicates; skew, ratio of the 90 th and 10 th percentile CPM. (I) Heatmap showing TF barcode counts in each media condition (M1-M7; Methods) relative to the lentivirus library. The 10 most enriched and depleted TFs are labeled. Numbers after the TF gene name indicate the isoform. (J) Heatmap showing pairwise Pearson correlation between each of the conditions in (I), ordered by hierarchical clustering. (K,L) Scatterplots showing the relation between TF barcode counts and ORF length for the lentivirus library (K) and unsorted cells (L), averaged across 7 media conditions. (M) Heatmap showing TF barcode counts in the sorted differentiated cells relative to stem cells. The 50 most enriched TFs are labeled. (N) Subset of data in (M), highlighting TFs with known roles in development or differentiation. (O) Heatmap showing the pairwise Pearson correlation between each of the conditions in (M), ordered by hierarchical clustering. The top 5% of TFs with the highest average fold change were evaluated. (P) Box plots showing enrichment of 67 developmentally critical TFs from (Parekh et al., 2018) and (N). Whiskers indicate the 10 th and 90 th percentiles.

GUID: 4D14E9A2-1F40-4320-BF60-70116248F2FF

sup figure 2: Figure S2 ∣ TF Atlas data quality control and pseudotime analysis, related to Figure 1 . (A) Violin plots showing distribution of genes, unique molecular identifiers (UMIs), and percent mitochondrial counts per cell in the TF Atlas. (B) Comparison of TF ORF representation between the bulk TF screen and the TF Atlas scRNA-seq. CPM, counts per million. (C) Distribution of cells overexpressing each TF isoform. Dashed lines indicate subsampling thresholds. (D) Scatterplot showing the relation between average TF ORF expression per cell and TF ORF length. (E) Density scatterplot showing, for each cell, expression of the exogenous TF ORF measured via TF barcode counts and the corresponding endogenous TF. (F) UMAPs of TF Atlas scRNA-seq data highlighting cells with indicated ORF. Numbers after TF gene names indicate the isoform. (G) Heatmaps showing percentage of cells with the indicated TF ORF that were assigned to each Louvain cluster from Figure 1B . (H,I) Force-directed graph (FDG) representation of TF Atlas scRNA-seq data. Colors indicate Louvain clusters (H) and diffusion pseudotime (I). (J,K) UMAP of TF Atlas scRNA-seq data. (J) Stream plot of RNA velocities. Line weights indicate speed and colors indicate Louvain clusters. (K) Colors indicate RNA velocity pseudotimes. (L,M) FDG representations of (J) and (K), respectively. (N-Q) Density scatterplots showing RNA velocity pseudotime (N), genes (O), UMIs (P), and TF barcode counts (Q) over diffusion pseudotime for each cell. (R) Smoothened heatmap of the top 1,000 upregulated and downregulated genes over RNA velocity pseudotime. Genes are ordered by change over pseudotime. (S) Scatterplot comparing the differentiation results of the scRNA-seq pseudotime analysis to the bulk TF screen. (T) Significance of the pseudotime difference between cells expressing each TF isoform and those expressing controls. Subset of Figure 1H . Dashed line indicates FDR < 0.05. (U-V) Heatmaps showing pairwise Pearson correlation between TF potential vectors for the KLF (U) and LIM homeodomain (V) TF families. TFs are ordered by hierarchical clustering.

GUID: 9A5629F2-5B6B-4D40-86F8-6FDAF54713E0

sup figure 3: Figure S3 ∣ Mapping and validating differentiated cells against reference cell types, related to Figures 3 and ​ and4. 4 . (A-C) Integration of TF Atlas differentiated cells and human fetal expression atlas (Cao et al., 2020) datasets. (A,B) UMAPs of scRNA-seq data with colors indicating reference cell types (A) and source dataset (B). (C) Heatmap showing the percentage of major cell types annotated by Seurat label-transfer that were mapped by dataset integration (D,E) SingleCellNet (Tan and Cahan, 2019) annotation of TF Atlas differentiated cells. (D) Heatmap showing the scores for TF Atlas cells (columns) against reference cell types (rows). (E) Heatmap showing the percentage of major cell types annotated by Seurat label-transfer that were mapped by SingleCellNet. (F) Heatmaps showing percentage of cells from each cluster that mapped to the indicated reference cell type (top) and enrichment of Gene Ontology (GO) terms in differentially expressed genes (bottom). EMT, epithelial-mesenchymal transition; ENS, enteric nervous system. CNS, central nervous system; diff., differentiation; pos., positive; reg., regulation of; dev., development; migr., migration. (G,H) Heatmaps showing correlation between TF potential vectors and expression profiles of reference cell types that matched (G) or did not match (H) by label-transfer. The top 10 TFs with the highest correlation for each cell type are included. Numbers after TF gene names indicate the isoform. (I-K) Heatmaps showing distances between expression profiles of TF Atlas differentiated cells (columns) and reference cell types (rows). Values represent Euclidean distance in the latent space for the integrated datasets (Figure S3A,B). Cell types are ordered by hierarchical clustering. Cell types derived from the circulatory system (I), myocytes (J), and endoderm (K) are shown. (L,M) Expression of marker genes measured by quantitative PCR after 7 days of candidate TF or GFP overexpression. n = 4. (L) Heatmap showing expression relative to GFP control. (M) Expression of marker genes in H1 hESCs. (N-P) Left, expression of marker genes measured by immunostaining in H1 hESCs after 7 days of TF ORF overexpression. Right, intensity of marker gene staining normalized to GFP control from n = 6 images. Scale bar, 25 μm. Marker genes for stromal (N), lung ciliated epithelial (O), and intestinal epithelial (P) cells are shown. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant.

GUID: 0D343F10-DAEF-4646-90C7-85F2D28B6243

sup figure 4: Figure S4 ∣ Developing a targeted TF ORF screening platform for generating cell types of interest, related to Figure 5 . (A) Prediction of TFs for iNP differentiation based on correlation between TF potential vectors in the TF Atlas and reference iNP expression datasets. The top 20 TFs are listed. Numbers after TF gene names indicate the isoform. (B) FACS histograms showing EGFP expression in reporter cell lines with or without the targeted TF library. High and low bins indicate populations sorted for sequencing of TF barcodes. (C) Scatterplot showing TF enrichment in reporter cell line screens. n = 3. (D,E) Representative FACS plots showing expression of 2 (D) or 10 (E) NP marker genes labeled by pooled FISH probes. High and low bins indicate populations sorted for sequencing of TF barcodes. (F) Scatterplot showing TF enrichment in flow-FISH screens targeting 2 or 10 NP marker genes. n = 3. (G) Comparison of TF enrichment in reporter cell line and flow-FISH screens. (H-J) TF screening using scRNA-seq for 60,997 cells. (H) Violin plots showing distribution of genes, UMIs, and percent mitochondrial counts per cell. (I) UMAP of scRNA-seq data with colors indicating Louvain clusters. (J) Heatmap showing correlation between mean TF-induced expression profiles and human radial glia (RG) from reference datasets (Eze et al., 2021; Nowakowski et al., 2017; Pollen et al., 2015; Quadrato et al., 2017; Zhong et al., 2018). (K) Scatterplot showing TF enrichment in the arrayed screen. Expression was measured by quantitative PCR. N = 3. (L) Top, expression of NP markers VIM and NES measured by immunostaining after 7 days of TF overexpression. Bottom, heatmap showing correlation between bulk RNA-seq expression profiles of iNPs and human fetal cortex cell types (Pollen et al., 2015). Cell culture media is indicated in parentheses. Scale bar, 50μm. D07 and D12 indicate number of days of ORF overexpression. RG, radial glia; IPC, intermediate progenitor cell; N, neuron; IN, interneuron. (M) Schematic of spontaneous differentiation. dox, doxycycline; EGF, epidermal growth factor; FGF, fetal growth factor. (N) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (PDGFRA) in cells spontaneously differentiated for 1, 2, 4, or 8 weeks from iNPs produced by candidate TFs. Scale bar, 100μm. (O-R) ScRNA-seq profiling of 53,113 cells that have been spontaneously differentiated from iNPs for 8 weeks. iNPs were derived using RFX4, NFIB, ASCL1, or PAX6 with n = 2 biological replicates. Pro, uncommitted progenitors; RP, retinal progenitors; RPE, retinal pigment epithelium; PR, photoreceptors; RGC, retinal ganglion cells; DNP, dorsal neural progenitors; RG, radial glia; Astro, astrocytes; CN, CNS neurons; EPD, ependyma; EP, epithelial progenitors; BE, bronchial epithelium; CE, cranial epithelium; CNC, cranial neural crest; CNCP, cranial neural crest progenitors; P, proliferative cells. (O,P) UMAP of scRNA-seq data with colors indicating Louvain clusters (O) and replicates from each TF (P), S1 and S2. (Q) Heatmap showing the percentage of cells from each replicate that were grouped into each cluster. (R) Distribution of general cell types produced by each replicate. (S-V) ScRNA-seq profiling of iNPs differentiated using different methods. EB, embryoid body; DS, dual SMAD; NP, neural progenitors; CN, CNS neurons; CNC, cranial neural crest. Data represents n = 2 batch replicates with 15,211 RFX4-DS, 11,148 EB, and 16,421 DS. (S) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. (T,U) Box plots showing intra- (T) or inter- (U) replicate Wasserstein distances between cells. Whiskers indicate the 5 th and 95 th percentiles. (V) UMAP of scRNA-seq data highlighting genes that differentiate RFX4-DS-iNPs from EB- and DS-iNPs. Color indicates expression.

GUID: F4997524-78BB-492D-B2D0-BDE88513B70B

sup figure 5: Figure S5 ∣ Characterization of RFX4-iNPs and EOMES-iCMs for sequential TF screening and disease modeling, related to Figure 5 . (A-D) ScRNA-seq profiling of 26,111 cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. (A,B) UMAP of scRNA-seq data. Colors indicate expression of marker genes for major cell types (A) and replicates from each time point (B), S1 and S2. (C) Heatmap showing the percentage of cells from each biological replicate that were grouped into each cluster. CN, CNS neurons; RG, radial glia; MNG, meninges; P, proliferative cells. (D) UMAP of scRNA-seq data showing marker genes for neuronal subtypes. Colors indicate gene expression. (E-M) Characterization of EOMES-iCMs. (E) Expression of cardiomyocyte markers TNNT2 and NKX2.5 by immunostaining at day 30 after 2 days of EOMES induction or GSK and Wnt inhibition (GW). Scale bar, 100μm. (F) UMAP of scRNA-seq data from 16,698 cells that have been spontaneously differentiated for 4 weeks after EOMES induction or GW. Data represents n = 2 biological replicates. Colors indicate Louvain clusters. VCM, ventricular cardiomyocytes; ACM, atrial cardiomyocytes; SMM, smooth muscle cells; SKM, skeletal muscle cells; EPTH, epithelial cells; P, proliferating cells. (G) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. (H,I) UMAP of scRNA-seq data with colors indicating marker gene expression (H) and replicates from each differentiation method (I), S1 and S2. (J) Heatmap showing the percentage of cells from each replicate that were grouped into each cluster. (K) Distribution of general cell types produced by each replicate. (L,M) Expression of marker genes for mature cardiomyocytes shown on UMAP (L) and violin plots (M). (N,O) Scatterplots showing TF enrichment in sequential TF screens using flow-FISH. N = 2 replicates, S1 and S2. Numbers after TF gene names indicate the isoform. (N) Screen of 54 TFs in RFX4-iNPs for differentiation of astrocytes. After 7 days, cells were sorted for expression of astrocyte marker genes (NCAN, AQP4, FAM107A, and GFAP). (O) Screen of 80 TFs in EOMES-iCMs for production of atrial, ventricular, or mature cardiomyocytes. After 4 weeks, cells were sorted for expression of atrial (MYBPHL, GJA5, MYH6), ventricular (MYL2, GJA1, and PLN), and mature (MYL2, TNNI3, and TNNT2) cardiomyocyte marker genes. (P-X) Modeling effects of DYRK1A perturbation in RFX4-iNPs derived from 11a iPSCs. sg, single guide RNA; KO, knockout; NT, nontargeting. (P) Percent indels in RFX4-iNPs transduced with DYRK1A KO sgRNAs. n = 3. (Q) DYRK1A expression measured using quantitative PCR probes targeting the endogenous sequence or the ORF sequence. n = 4. (R,S) Western blots of DYRK1A at 7 days after transduction with Cas9 and DYRK1A KO sgRNAs (R) or DYRK1A ORF (S). (T,U) Bulk RNA-seq of DYRK1A perturbations to identify significantly differentially expressed genes (DEGs; FDR < 0.05). n = 3. (T) Venn diagram summarizing the DEGs. (U) Heatmap showing DYRK1A dosage-dependent DEGs. (V-X) Characterization of DYRK1A perturbations by electrophysiology. (V) Representative traces for neurons with or without evoked action potentials (AP) and spontaneous excitatory postsynaptic currents (EPSCs). (W) Proportion of neurons with or without AP and EPSCs from n = 31-45 neurons per condition. (X) Intrinsic membrane properties from n = 12-36 neurons with evoked AP per condition. Values represent mean ± SEM. *P < 0.05; ND, not detected.

GUID: 43B73A81-8031-486F-9C80-1E57160432AB

sup figure 7: Figure S7 ∣ Combinatorial TF screening, modeling, and prediction, related to Figure 7 . (A) Heatmap showing pairwise Pearson correlation between mean expression profiles from the combinatorial screen of 10 TF ORFs in combinations, including 44 doubles and 3 triples, as well as 10 singles. TF combinations are ordered by hierarchical clustering. (B-F) Modeling effects of TF combinations with linear regression. (B-D) Heatmaps showing the coefficient weights (B,C) and score (D) for linear regression. (E) Annotated relationships for each TF combination based on the linear regression coefficients. (F) Heatmaps showing average expression profile of double TFs with those of respective single TFs for example combinations with annotated relationships. (G-M) Predicting TF combinations using the TF Atlas. (G,H) Percent accuracy for different approaches to predict TFs for measured double (G) or triple (H) TF expression profiles. For comparison to the combinatorial TF screen dataset, profiles of the 10 corresponding TFs from the TF Atlas were used. (I-M) Cell type prediction results for triple TF profiles. Predicted combinations for hepatoblasts (I), bronchiolar and alveolar epithelial cells (J), metanephric cells (K), vascular endothelial cells (L), and trophoblast giant cells (M) are shown. As gene signature scores were discrete, the percentile ranks were reported as ranges. TFs that are part of known combinations, developmentally critical, or specifically expressed in the target cell types are indicated in blue.

GUID: 21F78C98-6404-43BE-9C8E-DD1DB70635D1

sup figure 6: Figure S6 ∣ Joint profiling of chromatin accessibility and gene expression on a subset of TF ORFs, related to Figure 6 . (A,B) Violin plots showing distribution of UMIs and genes for scRNA-seq (A), as well as UMIs and fraction of reads in the top 500,000 peaks for scATAC-seq (B) from the joint profiling dataset. Data included 69,085 cells overexpressing 198 TF ORFs for 4 or 7 days. (C) Representative fragment histogram for scATAC-seq data using the first two megabases of chromosome 1. (D) Transcriptional start site (TSS) enrichment score for scATAC-seq data. (E) RNA (left) and ATAC (right) UMAP of joint profiling data. Colors indicate small local moving (SLM) clusters. (F) Distribution of cells from each time point across clusters from Figure 6A . Asterisks indicate clusters with >75% cells from either time point. (G) Weighted nearest neighbor (WNN) UMAP of joint profiling data from Figure 6A , colored by diffusion pseudotime. (H) Violin plots comparing diffusion pseudotimes of each time point. (I) Heatmaps showing gene regulatory networks (GRN) for downstream TFs nominated by evaluating motif enrichment in ATAC peaks with significant peak-gene associations. Only the 10 most significantly enriched TF ORFs are shown for each cluster. TF ORFs that were identified as top ORFs and downstream TFs are labeled in blue.

GUID: A32DFF4F-55F1-404A-BACB-0365C800744F 28. GUID: 6C2F3D70-37E9-489A-ABEF-15BB9985B80C

All raw and processed sequencing data generated as part of this study have been deposited to the Gene Expression Omnibus (GSE216481). Code for the analyses described in this paper are available on Github (https://github.com/fengzhanglab/Joung_TFAtlas_Manuscript).

Summary

Transcription factors (TFs) regulate gene programs, thereby controlling diverse cellular processes and cell states. To comprehensively understand TFs and the programs they control, we created a barcoded library of all annotated human TF splice isoforms (>3,500) and applied it to build a TF Atlas charting expression profiles of human embryonic stem cells (hESCs) overexpressing each TF at single cell resolution. We mapped TF-induced expression profiles to reference cell types and validated candidate TFs for generation of diverse cell types, spanning all three germ layers and trophoblasts. Targeted screens with a subset of the library allowed us to create a tailored cellular disease model and integrate mRNA expression and chromatin accessibility data to identify downstream regulators. Finally, we characterized the effects of combinatorial TF overexpression, developing and validating a strategy for predicting combinations of TFs that produce target expression profiles matching reference cell types to accelerate cellular engineering efforts.

Introduction

Achieving a comprehensive understanding of the gene regulatory networks governing cell states is a fundamental goal in molecular biology. Transcription factors (TFs) bind to specific sequences in the genome to alter gene expression and specify cell states (Amit et al., 2009; Lambert et al., 2018; Ravasi et al., 2010). Overexpression of single TFs can drive profound changes in cell fate. For instance, single TFs have been shown to direct differentiation of pluripotent stem cells towards many different cell types (Ng et al., 2021; Parekh et al., 2018), including muscle (Weintraub et al., 1989) and neurons (Zhang et al., 2013). Overexpression of TF combinations can produce even larger changes in gene regulatory networks, (Pang et al., 2011; Perrier et al., 2004; Sugimura et al., 2017), such as overexpression of the four “Yamanaka factors” (Oct4, Sox2, Klf4, c-Myc) to reprogram fibroblasts into stem cells (Takahashi et al., 2007; Takahashi and Yamanaka, 2006; Yu et al., 2007). These findings underscore the power of TFs to drive changes in cell state and highlight the utility of TF overexpression for understanding the gene expression programs that control cell fates (Yao et al., 2013).

The human genome contains >1,800 TF loci encoding >3,500 isoforms, creating a vast landscape of possible regulatory outcomes. Previous studies have explored aspects of this landscape through observational studies, such as quantitative trait locus mapping to associate TFs with phenotype, and perturbation studies that overexpress or inhibit TFs in model systems. However, perturbation studies have typically had to choose between breadth of perturbation and phenotypic content, either conducting large screens with a simple readout, or small focused screens with detailed readout. For example, in the context of TF overexpression screens, a recent study screened a large library of 1,732 TF isoforms for the focused readout of pluripotency marker expression (Ng et al., 2021); conversely, a smaller analysis of 61 TFs assessed the comprehensive impact of each through single cell profiling (Parekh et al., 2018).

Fully deciphering regulatory circuits, however, requires a systematic approach that integrates both the breadth of a comprehensive screen and the depth of a rich readout, especially of the transcriptomic changes induced by each TF. Here, we systematically mapped expression changes driven by all human TF isoforms at single cell resolution and used this data to identify TFs and their combinations that direct differentiation in human embryonic stem cells (hESCs). We created a barcoded ORF library of 3,548 TF splice isoforms and developed a screening platform to build a TF Atlas of >1 million cells charting expression profile changes induced by TF overexpression in hESCs. Our comprehensive TF Atlas enables both a systematic identification of TFs that drive changes in cell state and generalized observations such as classification of orphan TFs by gene program. Furthermore, we applied our TF Atlas to predict and validate TF combinations for target reference cell types. Our TF library and Atlas thus provide a valuable resource for systematic elucidation of TF gene programs towards a complete understanding of gene regulatory networks that govern cell states.

Results

Development of Multiplexed Overexpression of Regulatory Factors (MORF) library

We first established the most effective method for TF upregulation by comparing CRISPR activation (CRISPRa) (Konermann et al., 2015) and ORF overexpression. ORF overexpression of either NEUROD1 or NEUROG2 in hESCs efficiently induced neuronal differentiation (Zhang et al., 2013), but CRISPRa and ORFs with endogenous untranslated regions (UTRs) did not (Figure S1A-D). This may indicate that hESCs have post-transcriptional regulatory mechanisms encoded in UTRs that buffer against protein expression. We therefore proceeded with ORF overexpression.

To enable pooled screening, we created a barcoded human TF ORF library, Multiplexed Overexpression of Regulatory Factors (MORF; Figure 1A , Figure S1E, and Data S1). The MORF library consists of all 3,548 annotated splice isoforms encoded by 1,836 genes, including histone modifiers. Each isoform is associated with a unique barcode that facilitates TF identification and minimizes ORF length-dependent PCR bias (Dabney and Meyer, 2012). As an arrayed library that could be selectively pooled for targeted screening, MORF is a generalizable resource for discovering TFs that induce phenotypes of interest.

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0001.jpg

Building a TF Atlas of directed differentiation.

(A) Schematic of TF Atlas setup. MOI, multiplicity of infection. (B-D) Uniform manifold approximation and projection (UMAP) of scRNA-seq data from 671,453 cells overexpressing 3,266 TF isoforms. Colors indicate Louvain clusters (B), gene expression (C), and diffusion pseudotime (D). (E) Smoothened heatmap of the top 1,000 upregulated and downregulated genes over diffusion pseudotime. Genes are ordered by change over pseudotime. (F,G) Most enriched pathways among the top 100 upregulated (F) and downregulated (G) genes. (H) Heatmap showing the significance of the pseudotime difference between cells expressing each TF isoform and those expressing controls. Only 320 TF genes with multiple isoforms, at least one of which is significant, are included. See also Figures S1 and S2.

Construction of a TF Atlas of directed differentiation

We first applied MORF to comprehensively profile gene programs regulated by each TF. As existing cellular differentiation protocols often use different culture media, we tested 7 culture media to identify an optimal media condition that could capture the broadest range of effects of TFs. To this end, we pooled the MORF library TFs, transduced H1 hESCs with the library, and cultured cells for 7 days in each of 7 media conditions (Figure S1F,G and Data S2; Methods). We sorted cells into two populations (top and bottom 10%) based on the expression level of pluripotency markers (TRA-1-60 and SSEA4), as a proxy for differentiation, and sequenced the TF barcodes. Despite even TF distributions in the initial libraries (skew = 5), the distributions became very skewed across all media conditions (skew = 105-115; Figure S1H). As the distributions were remarkably consistent across replicates and media conditions (Pearson r > 0.94; Figure S1I,J and Data S3), the increase in skew is likely a result of TF-dependent effects, rather than media-derived factors. TFs that promote pluripotency (e.g., IDs 1, 3, and 4; and YAF2) (Hayashi et al., 2016; Zhao et al., 2018) increased cell fitness, whereas TFs involved in DNA damage (e.g., BRCA1 and TP53BP1) (Daley and Sung, 2014) decreased cell fitness (Figure S1I). TF representation negatively correlated with TF length (Figure S1K,L). We found that TF distributions in the sorted populations were relatively consistent across media conditions (Figure S1M-O and Data S3), suggesting that culture media does not strongly influence differentiation potential under TF overexpression. Of the 7 media conditions, we selected STEMdiff APEL because it produced the highest enrichment and most even distribution for known developmentally critical TFs (Parekh et al., 2018) (Figure S1N,P).

To build an expression atlas of all TF overexpression effects, we transduced hESCs with the MORF library, cultured cells in STEMdiff APEL media for 7 days, and profiled the cells by single cell RNA-sequencing (scRNA-seq) using a combinatorial indexing protocol based on SHARE-seq (Ma et al., 2020) ( Figure 1A ; Methods). We obtained >1.1 million cell profiles (3,761 UMIs per cell on average; Figure S2A,B) and down-sampled the data by TF ORF to 671,453 cells covering 3,266 TFs (92% of the MORF library; Figure S2C; Methods) to ensure even representation (3-1,000 cells, with an average of 206 cells per TF ORF). Expression level of TF ORFs did not correlate with TF length or expression of the respective endogenous TF (Figure S2D,E). Cluster analysis showed that clusters 6-8 expressed higher levels of differentiation genes (FBN2 and TTN) and lower levels of pluripotency genes (LIN28A and OCT4 (POU5F1); Figure 1B , ​ ,C). C ). Cells overexpressing developmentally critical TFs such as Brachyury (TBXT) and KLF4 had distinct transcriptomes and clustered together (Figure S2F,G). Together, these results show that our approach for profiling TF effects by scRNA-seq maximizes the range of possible TF-induced cell states and lays the groundwork for detailed characterization of the impact of TF overexpression.

Over a quarter of TFs direct differentiation of hESCs

To study the effects of TF overexpression on hESC differentiation, we computationally inferred differentiation trajectories from the TF Atlas. We ordered TF-overexpressing cells in pseudotime using two approaches, diffusion (Haghverdi et al., 2016) and RNA velocity (La Manno et al., 2018), based on expression profile similarity to cells expressing GFP or mCherry controls ( Figure 1D and Figure S2H-M; Methods). Inferred pseudotimes were comparable between the two methods and did not correlate with quality control variables (Figure S2N-Q).

Examination of the expression profiles across pseudotime showed that expression of genes that drive differentiation (e.g., FBN2, TTN, and SOX5) increased over pseudotime, whereas those that maintain pluripotency (e.g., CD24, LIN28A, and OCT4 (POU5F1)) decreased ( Figure 1E , Figure S2R and Data S4A). Further confirming our pseudotime inference, differentiation pathways such as axonogenesis and heart morphogenesis were enriched in pseudotime-upregulated genes ( Figure 1F ). Pluripotency maintenance pathways such as telomere and stem cell maintenance were enriched in pseudotime-downregulated genes ( Figure 1G ). Remarkably, translation (i.e., ribosomal subunits and translation factors) was by far the top pseudotime-downregulated pathway ( Figure 1G ), suggesting its importance during differentiation (Saba et al., 2021).

Co-functional TF modules annotate uncharacterized TFs

We next leveraged the comprehensive scope of our TF Atlas to group co-functional modules of TFs that impact the same programs and classify unknown, orphan TFs. We first inferred gene programs using non-negative matrix factorization and then clustered the TFs by their effects across the programs (cluster Pearson correlation P-value < 10 −7 ; Figure 2 and Data S6). Clustering TFs using pairwise correlation of their mean expression profiles produced similar groupings (Data S6,7). Our analysis grouped together splice isoforms and functionally equivalent TFs (e.g., NEUROD4 and NEUROG1 (Halluin et al., 2016), PAX2 and PAX5 (Bouchard et al., 2000), ESRRB and NANOG (Festuccia et al., 2012)), as well as TFs from the same family, including 18 LIM homeodomain TFs (LHXs 1-6, 8, 9 and LMXs 1A and 1B), 9 posterior HOX genes (HOXs A7, B8, D8, B9, A10, A11, and C12), and 8 nuclear receptors (NR1H2, NR1H3, NR1I2, NR1I3, PPARD, and ESRRA) ( Figure 2B , ​ ,C C ).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0002.jpg

Unbiased grouping of TFs based on gene programs.

(A) Heatmaps showing pairwise Pearson correlation (top) and enrichment of 100 gene programs (bottom) identified using non-negative matrix factorization (NMF) on mean expression profiles of 3,266 TF ORFs. TFs are ordered by hierarchical clustering. Relations between some groups of TFs are annotated. Numbers in parentheses indicate the number of TF isoforms in the same group. (B,C) Zoomed in subsets of (A) with top enriched pathway for each gene program. (D) UMAP of TF Atlas scRNA-seq data highlighting enrichment of each gene program.

This analysis helps annotate relatively uncharacterized TFs by their functional association with well-characterized TFs. For instance, KLF17 has little functional information and is distantly related to other KLF TF family members, which can act as activators or repressors (McConnell and Yang, 2010). As KLF17 induces a similar gene program to KLF activators (KLFs 1, 2, 4, and 5), it is likely an activator ( Figure 2C ). As another example, we found that overexpression of HDX, which was previously assigned to the POU homeobox class based on phylogenetic analyses despite the lack of a POU-specific domain (Holland et al., 2007), produces similar gene programs as CDX1 and CDX2, suggesting that HDX likely belongs to the caudal-related homeobox family instead ( Figure 2A ). TFs from different families also group together based on similarities in gene programs, such as DMRTA2 and TBXT in mesoderm development and FERD3L and NEUROD1 in neural development ( Figure 2B ).

To identify differences between TFs that regulate similar gene programs, we defined a “TF potential” vector for each TF that characterizes its differentiation potential using a linear regression model to fit the TF-induced expression changes against pseudotime, defining TF potential as the slope (Methods). We applied this analysis to distinguish between TFs within the same family that share similar DNA-binding domains. For example, in the KLF TF family, TF potential analysis showed that the two KLF5 isoforms are nearly identical and more similar to KLF1, whereas KLF17 is more distinct (Figure S2U). Similarly, for the LIM homeodomain TF family, LHX1 and LHX3 are functionally closer to LMX1A and LMX1B, whereas ISL1 and ISL2 are more like each other than the rest of the TF family (Figure S2V). Together, these results demonstrate that our TF Atlas enables systematic functional annotation of uncharacterized TFs, as well as in depth investigation of TF differentiation potential, providing guidance for further study.

Mapping TF effects to reference cell types

To characterize the ability of TFs to drive differentiation to particular endpoints, we mapped TF-induced expression profiles to those of reference cell types from the human fetal expression atlas (Cao et al., 2020). We subclustered the TF Atlas differentiated cells (clusters 6-8 from Figure 1B ) and annotated cells by label-transfer (Stuart et al., 2019) ( Figure 3A , ​ ,B; B ; Methods). Dataset integration (Korsunsky et al., 2019a) and Random Forest classifiers (Cahan et al., 2014; Tan and Cahan, 2019) produced similar annotations (Figure S3A-E; Methods). The mapping results suggest that we generated cells resembling types from each of the three germ layers, such as (i) squamous epithelial and neurons (ectoderm), (ii) smooth muscle and metanephric (mesoderm), and (iii) intestinal epithelial and bronchiolar and alveolar epithelial (endoderm), as well as from the extraembryonic lineage (syncytio- and villous cyto-trophoblast) ( Figure 3B ). Each cluster is comprised of cells with distinct groups of TF ORFs (adjusted mutual information score of 0.43 for TFs with >5% cells) and differentially expressed genes, indicating the diversity and specificity of TF-induced differentiation states ( Figure 3C and Data S8,9). The biological pathways enriched in each cluster were consistent with cell type annotations (Figure S3F).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0003.jpg

Mapping TF ORFs in differentiated cells to reference cell types.

(A,B) UMAP of scRNA-seq data from 28,825 differentiated cells (clusters 6-8 in Figure 1B ). Colors indicate Louvain clusters (A) and nominated reference cell types (Cao Science 2020) with score >0.3 (B). (C,D) Heatmaps showing percentage of cells with the indicated TF ORF for each cluster (C) or nominated cell type (D). Numbers after TF gene names indicate the isoform. Percentages are normalized to the total number of cells with the indicated TF ORF in the TF Atlas. For each cluster, only the 5 most enriched TF ORFs >5% are shown. EMT, epithelial-mesenchymal transition; ENS, enteric nervous system. See also Figure S3.

Matching TF ORFs to cell types nominated candidate TFs that could induce differentiation of each cell type ( Figure 3D ). Notably, several candidate TFs are known to be important for specifying the target cell type during development. For instance, FERD3L is important for neurogenesis (Ono et al., 2010), FLI1 for endothelial development (Li et al., 2015), and KLF4 for intestinal epithelial homeostasis (Ghaleb et al., 2011).

For reference cell types with insufficient numbers of cells mapped by label-transfer, we applied our TF potential analysis to identify TFs that drive differentiation towards those cell types (Methods). We first verified that TF potential analysis could recover TFs nominated by label-transfer ( Figure 3D and Figure S3G). For unmapped reference cell types, our analysis nominated several known TFs, such as HNF4A and HNF1A for reprogramming towards hepatoblasts (Du et al., 2014; Huang et al., 2014) (Figure S3H). Other nominated TFs were essential for generating the target cell type during development. Examples included PTF1A for acinar cells (Hoang et al., 2016), KLF2 for erythrocytes (Basu et al., 2005), ATOH7 for retinal ganglion cells (Brown et al., 2001), and BHLHE22 for retinal and peripheral neurons (Feng et al., 2006; Ross et al., 2010).

Furthermore, we could predict cell fate bottlenecks by using the distances between expression profiles of TF Atlas cells and reference cell types as a proxy for differentiation complexity. Reference cell types that were farther from TF Atlas cells overexpressing a single TF may require induction by multiple TFs and have bottlenecks in cell fate specification. Our analysis identified bottlenecks in known developmental trajectories, supporting this approach (Figure S3I-K). For instance, lymphatic endothelial cells were farther from TF Atlas cells than vascular endothelial cells, reflecting the specification of lymphatic vasculature from blood vasculature during development (Oliver, 2004) (Figure S3I). Similarly, microglia were more distant than blood cell types and acinar cells were more distant than islet endocrine cells, following their respective developmental trajectories (Ginhoux et al., 2013; Pan and Brissova, 2014) (Figure S3I,K). In addition, we identified potential bottlenecks for specifying cell types with more specialized functions, such as cardiomyocytes (Figure S3J) and acinar cells (Figure S3K). Our findings demonstrate that the TF Atlas enables discovery of cell fate specification bottlenecks that were not apparent from examining the reference cell atlas (Cao et al., 2020) alone. Together, these results show that our TF Atlas could nominate TFs that drive hESCs toward specific cell fates and predict bottlenecks, underscoring its utility for cellular engineering.

Validation of differentiation-directing TFs

To validate the cell type mapping results, we selected 24 candidate TFs that were predicted to generate 10 distinct cell types, including NEUROD1 as a positive control (Zhang et al., 2013). After 7 days of TF overexpression, most candidates (22 out of 24) induced expression of known marker genes that delineate each predicted cell type ( Figure 4A , Figure S3L,M, and Data S10). For instance, NEUROD1, FERD3L, and LMX1B produced peripheral neuron-like cells; FLI1 produced vascular endothelial-like cells; KLF4, HNF4A, and NR5A2 produced intestinal epithelial-like cells; and NHLH1 and ASCL2 produced lung ciliated epithelial-like cells. Within each cell type, different TFs sometimes generated distinct expression profiles of marker genes, indicating differences in differentiation efficiencies or trajectories. For example, ZBTB7B, DMRTA2, and BHLHA9 induced smooth muscle cells expressing HAND1, suggesting a lateral plate mesoderm origin (Barnes et al., 2010), whereas TBXT and MSGN1 did not, consistent with their roles in paraxial mesoderm specification (Chalamalasetty et al., 2014; Yamaguchi et al., 1999) ( Figure 4A ). GRHL3, which was predicted to induce both trophoblasts and ureteric bud cells, only generated trophoblast-like cells ( Figure 4A and Figure S3M). EOMES and GLIS1 induced expression of general stromal cell markers LUM and COL1A1, but not the subpopulation marker ENG (Figure S3M). PAX2 and PAX5 were the only candidate TFs that did not produce the predicted cell type (Figure S3M). TF effects on expression were remarkably consistent in H9 hESCs and 11a iPSCs (Pearson r = 0.84 and 0.89, respectively), suggesting that the results extend beyond the screening cell line ( Figure 4B , ​ ,C C and Data S11A,B).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0004.jpg

Validation of candidate TFs for differentiation towards nominated cell types.

(A) Expression of marker genes for each cell type measured by quantitative PCR in H1 hESCs after 7 days of TF ORF or GFP overexpression. Numbers after TF gene names indicate the isoform. N = 4. (B,C) Scatterplot comparing expression of all marker genes (205 from Figure 4A and Figure S3L,M) in H1 hESCs to H9 hESCs (B) or 11a iPSCs (C). Expression is measured relative to GFP control. (D-K) Left, expression of marker genes measured by immunostaining in H1 hESCs after 7 days of TF ORF overexpression. Right, intensity of marker gene staining normalized to GFP control from n = 6 images. Scale bar, 25 μm. Marker genes for neuron (D), EMT smooth muscle (E), endothelial (F), smooth muscle (G), metanephric (H), intestinal epithelial (I), lung ciliated epithelial (J), and trophoblast (K) cells are shown. EMT, epithelial-mesenchymal transition. Immunostaining controls are shown in Data S11C. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05. See also Figure S3.

We then further characterized 8 cell types generated by a subset of 17 candidate TFs by immunostaining. Out of 17 TFs, 15 significantly upregulated protein expression of marker genes and induced morphologies that resembled those of reference cell types ( Figure 4D - ​ -K, K , Figure S3N-P, and Data S11C). For instance, FERD3L cells had neuron-like axonal projections, FLI1 cells established vascular endothelial-like tight junctions, KLF4 cells organized into clumps resembling intestinal crypts, and GRHL3 cells formed syncytia typical of trophoblasts. HNF4A and ASCL2 did not significantly upregulate protein expression of marker genes due to low differentiation efficiencies (Figure S3O,P). These results indicate that our analysis robustly predicts candidates that direct specific cell fates, some of which were missed by the bulk TF screen due to choice of pluripotency markers (Ramirez et al., 2011) (Data S11D), highlighting the importance of unbiased expression profiling.

Targeted TF screening to create tailored cellular disease models

Cellular disease models are a tractable system that can be perturbed, genetically or chemically, to assess effects in a cell type-specific context (Robinton and Daley, 2012; Wu and Hochedlinger, 2011). However, it remains challenging or impossible to generate many cell types. The best differentiation methods are often labor-intensive and can require months to produce even heterogenous or immature cell populations. We sought to address this challenge through targeted TF screening.

As a generalizable approach for constructing targeted TF libraries, we used available expression data to select 90 TF isoforms specifically expressed in a target cell type, induced neural progenitors (iNPs; Data S12A; Methods). Although we could use our TF Atlas to predict TFs that generate iNPs (Figure S4A), we sought to establish a universal approach for producing any cell type of interest. We introduced the pooled, targeted TF library into hESCs and differentiated the cells for 7 days ( Figure 5A ). We explored three methods for selecting iNPs that can simultaneously assay different numbers of marker genes: reporter cell line (1 gene), flow-FISH (2-10 genes), and scRNA-seq (up to ~2,000 genes; Figure 5A ; Methods). We obtained concordant screening results (Spearman correlation P-value e.g., NFIB (Steele-Perkins et al., 2005), PAX6 (Gotz et al., 1998), and ASCL1 (Casarosa et al., 1999)).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0005.jpg

Targeted TF overexpression screening platform for directed differentiation.

(A) Schematic of targeted TF screening. MOI, multiplicity of infection. (B) Comparison of TF ranks from 5 iNP differentiation screens. (C) Expression of markers for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (PDGFRA) after spontaneous differentiation from RFX4-iNPs. Scale bar, 100 μm. (D-H) ScRNA-seq profiling of iNPs differentiated using different methods. EB, embryoid body; DS, dual SMAD; NP, neural progenitors; CN, CNS neurons; CNC, cranial neural crest. Data represents n = 2 batch replicates with 15,211 RFX4-DS, 11,148 EB, and 16,421 DS. (D,E) UMAP of scRNA-seq data with colors indicating Louvain clusters (D) or batch replicates (E). (F) Heatmap showing the percentage of cells from each replicate in each cluster. (G,H) Box plots showing intra- (G) or inter- (H) batch Euclidean distances between cells. Whiskers indicate the 5 th and 95 th percentiles. (I-K) ScRNA-seq data from 26,111 cells spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. RG, radial glia; CN, CNS neurons; MNG, meninges; P, proliferating cells. (I) UMAP with colors indicating Louvain clusters. (J) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. (K) Distribution of cell types produced by each replicate. (L) RFX4 ChIP-seq reads at NR2F1 and NR2F2 promoter regions. (M) Expression of NR2F1 and NR2F2 measured by bulk RNA-seq after 7 days of RFX4 or GFP overexpression. (N-Q) Modeling effects of DYRK1A perturbation in RFX4-iNPs derived from 11a iPSCs. (N,O) Percentage of EdU labeled cells after spontaneous differentiation for DYRK1A knockout (N) or overexpression (O). N = 3. (P,Q) Intensity of MAP2 staining for neurons after spontaneous differentiation for DYRK1A knockout (P) or overexpression (Q). N = 12 images. Values represent mean ± SEM. KO, knockout; NT, non-targeting; sg, single guide RNA. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant. See also Figures S4 and S5.

For downstream analysis, we focused on 9 TFs (NFIB, RFX4, NFIC, EOMES, OTX1, LHX2, PAX6, FOS, and ASCL1) with the highest average screen ranking. Although all 9 TFs induced expression of VIM, a screening selection marker, 4 (RFX4, NFIB, PAX6, and ASCL1) produced multipotent iNPs that, like NPs, could spontaneously differentiate into neurons and astrocytes ( Figure 5C , Figure S4L-N, Data S13A,S14, and Note S1). ScRNA-seq profiles of spontaneously differentiated cells revealed a broad range of cell types that was distinct between TFs, with RFX4-iNPs producing more CNS cell types (Figure S4O-R, Data S15A,S16, and Note S2). Chromatin immunoprecipitation with sequencing identified TF motifs, transcriptional co-regulators, and candidate target genes (Data S17,S18, and Note S3).

We further optimized RFX4-iNPs by combining RFX4 overexpression with dual SMAD inhibition (Data S19A-D) and compared our optimized protocol (RFX4-DS) to two previous NP differentiation methods (Schafer et al., 2019; Shi et al., 2012a). ScRNA-seq profiling showed that RFX4-DS-iNPs were most consistent within and between replicates ( Figure 5D - ​ -H, H , Figure S4S-V, and Data S15B). Moreover, spontaneously differentiated cells from RFX4-DS-iNPs were remarkably reproducible and predominantly consisted of radial glia and neurons, with only 2-6% meningeal cells ( Figure 5I - ​ -K, K , Figure S5A-C, and Data S15C,S19E-G). The propensity for RFX4-DS-iNPs to spontaneously differentiate into GABAergic neurons (Figure S5D), rather than glutamatergic neurons like iNPs produced by alternative methods (Schafer et al., 2019; Shi et al., 2012b), may stem from RFX4 target genes, NR2F1 and NR2F2, which mark GABAergic neurons (Delgado et al., 2022; Mayer et al., 2018; Reinchisi et al., 2012) ( Figure 5L , ​ ,M, M , Figure S4V, and Data S13A,S18). Similarly, in human brain spheroids, RFX4 expression was associated with GABAergic neurons (Trevino et al., 2020).

Intriguingly, the candidate TF EOMES generated induced cardiomyocytes (iCMs; Figure S5E-M, Data S15D,S19H,I and Note S4). We capitalized on this serendipitous finding to demonstrate sequential TF screening, which mirrors the successive upregulation of TFs during development towards more mature cell types. We used flow-FISH to screen 54 TFs in RFX4-iNPs for astrocytes and 80 TFs in EOMES-iCMs for atrial, ventricular, and mature iCMs (Methods). Candidate TFs included those known to produce the target cell type (e.g., NOTCH2 for astrocytes (Tchorz et al., 2012) and TBX5 and GATA4 for ventricular iCMs (Ieda et al., 2010; Song et al., 2012)), as well as developmentally critical TFs (e.g., SHOX2 and NR2F2 for atrial iCMs (Espinoza-Lewis et al., 2009; Wu et al., 2013) and GATA6 for ventricular and mature iCMs (Xin et al., 2006)), supporting the screening results (Figure S5N,O and Data S12B,C). Furthermore, the screens nominated TFs for further study, including TRPS1 for astrocytes, FOS for ventricular and mature iCMs, and MITF for mature iCMs.

To explore the utility of RFX4-iNPs for modeling neurological disorders, we evaluated the effects of DYRK1A perturbation in this model. DYRK1A knockout and overexpression have been implicated in autism spectrum disorder (De Rubeis et al., 2014; Iossifov et al., 2014) and Down syndrome (Smith et al., 1997), respectively. Bulk RNA-seq identified 42 genes, including those involved in neuronal migration and synapse formation, that were expressed in a DYRK1A dosage-dependent manner (Figure S5P-U and Data S13B,S19J-L). During spontaneous differentiation, DYRK1A knockout increased, whereas DYRK1A overexpression decreased, the proportion of proliferating iNPs ( Figure 5N , ​ ,O). O ). Interestingly, DYRK1A perturbations ultimately reduced neurogenesis: when knocked out, increased iNP proliferation deterred neurogenesis ( Figure 5P and Data S19M), whereas when overexpressed, there were fewer iNPs due to lower initial proliferation ( Figure 5Q ). Electrophysiological characterization of spontaneously differentiated neurons showed that both DYRK1A knockout and overexpression reduced the proportion of mature neurons (Figure S5V-X and Data S19N). Our findings are consistent with previous DYRK1A studies in other model systems (Fotaki et al., 2002; Hammerle et al., 2011; Park et al., 2010; Yabut et al., 2010) and provide additional insight. By combining cellular and genome engineering, we have outlined a versatile approach using the MORF library to create various cell types for studying development and disease.

Discovery of TF regulatory networks by joint profiling of chromatin accessibility and expression

As TFs often alter chromatin state to regulate gene expression, integrating single cell chromatin accessibility with expression using SHARE-Seq (Ma et al., 2020) provides an additional layer of information for deciphering the complex TF regulatory network. We selected 198 TFs that induced representative gene programs ( Figure 2A ), introduced the TF library into hESCs and, after 4 and 7 days, performed SHARE-seq (average of 3,317 and 2,384 UMIs per cell for scATAC- and scRNA-seq, respectively; Figure S6A-D). We constructed a weighted nearest neighbor graph that integrated the ATAC and RNA profiles into a single representation for joint cluster analysis (Hao et al., 2021) ( Figure 6A , ​ ,B B and Figure S6E-H; Methods).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0006.jpg

Discovery of TF regulatory networks by joint profiling of chromatin accessibility and expression.

(A) Weighted nearest neighbor (WNN) UMAP of joint chromatin accessibility and expression profiles from 69,085 cells overexpressing 198 TFs for 4 or 7 days. Colors indicate smart local moving (SLM) clusters. (B) Dot plot showing marker genes for each cluster. Color and circle size indicate expression level and chromatin accessibility, respectively, relative to other clusters. (C) Heatmaps showing gene regulatory networks (GRN) containing the top TF ORFs (left) and nominated downstream TFs (right) for each cluster. Left, percentage of cells with the indicated TF ORF. Numbers after TF gene names indicate the isoform. Percentages are normalized to the total number of cells with the TF ORF. Only the 6 most enriched TF ORFs >5% are shown for each cluster. Right, average area under the ROC curve (AUC) of TF motif enrichment and RNA expression is shown for significantly enriched (FDR < 0.05) TFs. TFs that were identified as top ORFs and downstream TFs are labeled in blue. (D) Examples of GRNs identified by matching the top TF ORFs (left) with nominated downstream TFs (right). Color of arrows indicates chromatin accessibility of the downstream TF promoter region relative to cluster 0. (E) Schematic showing the relation between differentially expressed genes (DEGs) induced by upstream and downstream TFs. (F) Heatmaps showing the percentage of downstream TF DEGs included in upstream TF DEGs using TFs in each row as upstream (top) or downstream (bottom) relative to TFs in each column. See also Figure S6.

Based on the joint profiles, we identified gene regulatory networks (GRNs) by nominating TFs downstream of each TF ORF. Specifically, for each cluster, we matched top TF ORFs with putative downstream TFs that had enriched expression and motif accessibility (Methods). GRNs of TF ORFs were consistent with their induced cell types (e.g., GRHL1 and GRHL3 targeted TFAP2C and TEAD family TFs to induce trophoblasts (Krendl et al., 2017), and FLI1 targeted AP-1 family TFs (JUN and FOS) and ETV2 to induce vascular endothelial cells (Dejana et al., 2007)), as well as their roles in development (e.g., CDX1, CDX2, and HOXD11 targeted posterior HOX genes to specify the anterior-posterior axis (Neijts et al., 2017)) ( Figure 6C ). For 18 TF ORFs, such as KLF5, FIGLA, MSGN1, and ATOH7, the endogenous TF itself was nominated as the putative downstream TF, suggesting a positive feedback mechanism that enhances TF expression. The MORF library design allows this distinction, because the ORF sequence is too distant (>1 kb) from the 3’ end of the transcript to be captured by 3’ scRNA-seq (Figure S1E and S2E). A complementary approach to identify downstream TFs based on motif enrichment in functional ATAC peaks yielded similar relationships (Figure S6I; Methods).

We further investigated three GRNs: GRN 8 (CDX1, CDX2, and HOXD11 regulation of CDX4 and posterior HOX genes), GRN 3 (ASCL2-4 and NHLH1 regulation of FOX genes), and GRN 6 (MSGN1, MESP1, and MESP2 regulation of GATA2-6, MEIS2, TWIST1, and SNAI1-3). In the corresponding clusters of each GRN, the promoter regions of the respective downstream TFs were more accessible ( Figure 6D ). To validate the TF relationships, we examined differentially expressed genes (DEGs) of each TF ORF using our TF Atlas. In theory, the set of DEGs of an upstream TF should include those of downstream TFs, but not vice versa ( Figure 6E ). Our DEG analysis confirmed the GRN relationships ( Figure 6F ), highlighting how TF regulatory networks can be deciphered with joint profiling, providing a systems-level understanding of gene regulation.

Combinatorial TF screening and prediction

Finally, as TFs often act in combination, we explored how TF ORFs combine to produce the resulting expression state. To model this, we first generated a scRNA-seq dataset for 10 TF ORFs in combinations, including 44 doubles and 3 triples, as well as 10 singles. Low dimensionality embedding and cluster analysis showed that expression profiles of combinations with similar TFs often grouped together, sometimes with the single TF profile of one member of the respective pair (e.g., CDX1, FLI1, and KLF4; Figure 7A , Figure S7A, and Data S20A,B). In other cases, two TFs generated a continuum (e.g., FERD3L and NR5A2). We quantitively modeled TF interactions by a linear regression model ( a b = c 1 ∗ a + c 2 ∗ b + c 3 ∗ a ∗ b ) that fits the double TF expression profiles ( a b ) as a linear combination of the respective single TF profiles ( a and b ) along with an interaction term ( a ∗ b ) (Capaldi et al., 2008; Dixit et al., 2016) (Figure S7B-D). The model suggested that although most TF combinations were overall additive, some TFs tended to be buffering (PTF1A), synergistic (FLI1), or dominant (CDX1; Figure S7E,F).

An external file that holds a picture, illustration, etc. Object name is nihms-1900584-f0007.jpg

Combinatorial TF screening and prediction.

(A) UMAP of scRNA-seq profiles from the combinatorial screen of 10 TF ORFs in combinations, including 44 doubles and 3 triples, as well as 10 singles. Each circle represents the mean expression profile of cells with the indicated TF ORF(s). (B,C) Percent accuracy for different approaches to predict TFs for measured double (B) or triple (C) TF expression profiles. (D-I) Cell type prediction results for double TF profiles. Known combinations (D) or predicted combinations for hepatoblasts (E), bronchiolar and alveolar epithelial cells (F), metanephric cells (G), vascular endothelial cells (H), and trophoblast giant cells (I) are shown. As gene signature scores were discrete, the percentile ranks were reported as ranges. TFs that are part of known combinations, developmentally critical, or specifically expressed in the target cell types are indicated in blue. (J) Prediction results for known combinations of triple TF profiles. To expand the number of combinations, parts of known combinations with >3 TFs were included. (K,L) Marker gene expression for each cell type measured by quantitative PCR (K; n = 4) or immunostaining (L; n = 6 images) after 7 days of TF ORF or control overexpression. Numbers after TF gene names indicate the isoform. Intensity is normalized to GFP control. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ND, not detected. See also Figure S7.

We then used the combinatorial dataset to nominate TF combinations that could produce a measured combinatorial expression profile. We ranked TF combinations based on how well their respective single profiles combine to fit the measured combinatorial profile (Methods). We tested different approaches for combining, including averaging or using linear and nonlinear regression methods. As the baseline, we randomly selected TF combinations from the same set of possible combinations (i.e., 45 combinations for doubles and 120 combinations for triples). Surprisingly, simply computing the average outperformed regression-based approaches ( Figure 7B , ​ ,C), C ), potentially because most TF combinations were additive (Figure S7E). For double TF profiles, averaging achieved an accuracy of 81% when evaluating only the top TF combination and 91% when evaluating the top 10% of possible combinations (i.e., top 4 out of 45 total combinations; Figure 7B ). For triple TF profiles, averaging correctly predicted all 3 sets of TFs when evaluating the top ~2% of possible combinations ( Figure 7C ). Furthermore, we could still predict TF combinations by using single TF profiles from the TF Atlas, though with lower accuracy (Figure S7G,H and Data S20C-F).

We applied our findings from the combinatorial dataset to develop an approach for nominating TF combinations that could differentiate hESCs into more mature cell types. First, we averaged single TF profiles from the TF Atlas to estimate combinatorial TF profiles. We then scored each combinatorial TF profile for enrichment of reference cell type expression signatures (Cao et al., 2020; Cortal et al., 2021) and ranked potential TF combinations. We confirmed that our approach enriched for experimentally validated double TF combinations for the respective cell types, including hepatoblasts (HNF4A and FOXA1 (Sekiya and Suzuki, 2011)), astrocytes (SOX9 and NFIB (Canals et al., 2018)), and inhibitory neurons (ASCL1 and DLX2 (Yang et al., 2017)) ( Figure 7D ). Moreover, the top predicted combinations included TFs that have been individually validated. For instance, combinations included NHLH1 for bronchiolar and alveolar epithelial cells, CDX1 for metanephric cells, and GRHL3 for trophoblasts, and suggested other TFs that could improve differentiation efficiency and fidelity ( Figure 7E - ​ -I I and Data S21A,B). Furthermore, predicted combinations included TFs that are crucial for establishing the respective cell type during development, supporting the prediction results. Examples include KLF6 for hepatocytes (Zhao et al., 2010) ( Figure 7E ) and ERG for vascular endothelial cells (Birdsey et al., 2015) ( Figure 7G ). Similarly, our approach enriched for triple TF combinations that were experimentally validated and developmentally relevant ( Figure 7J , Figure S7I-M, and Data S21C,D).

To validate our prediction approach, we experimentally tested 12 predicted TF combinations for 3 cell types. Most combinations (11 out of 12) induced expression of known marker genes for the target cell type ( Figure 7K , ​ ,L). L ). Of these, 9 induced higher expression of at least 2 marker genes compared to the single TF, suggesting that the TF combination may produce target cell types with increased fidelity or efficiency. Combining TFs also provides an opportunity to fine-tune induced cell types. Out of 4 TF combinations that included ERG and produced vascular endothelial-like cells expressing canonical marker genes (CDH5, PECAM1, and KDR), the combination with MIXL1 did not express CD34, suggesting that MIXL1 directs differentiation towards non-hemogenic endothelium (Gritz and Hirschi, 2016) ( Figure 7K ). Thus, our results illustrate an approach to reduce the exponentially vast search space of combinatorial TF effects for follow up empirical experimentation, accelerating the pace of cellular engineering.

Discussion

To achieve a comprehensive understanding of gene regulation, we developed a systematic, flexible approach for TF overexpression and characterization. We created a library of all human TF splice isoforms and built a TF Atlas that maps TF overexpression to corresponding expression changes. The TF Atlas enabled systematic identification of TFs that drive profound cell state changes, including production of cell types from all three germ layers and trophoblasts, as well as broad-spectrum findings, such as classification of orphan TFs. We used our library for targeted and sequential TF screens to establish cellular disease models. As only a small number of expressed TFs induced differentiation (e.g., 4 out of 90 for iNPs), TF screening facilitates efficient identification of TFs. We extensively profiled RFX4-iNPs and EOMES-iCMs to show that TF overexpression could lock cells in defined fates. We also integrated expression and chromatin accessibility data to characterize TF regulatory networks. Finally, we modeled the effects of TF ORF combinations using a combinatorial dataset and developed an approach for predicting TF combinations for reference cell types, which can help reduce the combinatorial search space.

The accessibility and flexibility of our screening approach lends itself to scalable extensions of the technology to additional contexts. For example, increasing the screening MOI to simultaneously overexpress multiple TFs could identify combinations of TFs that drive specific phenotypes. Incorporating TF target analysis and combinatorial prediction with TF screening may further facilitate derivation of complex cell types. Furthermore, TF screening can be applied to study trans-differentiation (Pang et al., 2011; Song et al., 2012), aging (Ocampo et al., 2016), and cancer (Darnell, 2002; Suva et al., 2014). Moreover, our ORF barcoding approach allows for a variety of screening selection methods and could be extended to pooled ORF screening of other protein families.

Future applications of our MORF library in other contexts will illuminate factors driving nearly any cellular phenotype of interest. Similarly, as single cell profiling becomes more affordable, we anticipate that the resolution of this TF Atlas will increase. Our MORF library and TF Atlas provide valuable resources that lay the foundation for deciphering TF circuits towards a comprehensive understanding of GRNs that govern cell states.

Limitations of the Study

Some conclusions of our study were limited by technical aspects of our experimental design. As we were constrained by sequencing cost, we chose one starting cell type, time point, and media condition for our TF Atlas. Expression changes induced by TF overexpression may depend on cell type. We chose a time point of 7 days to obtain more differentiated cells for mapping to reference cell types, which precluded identification of immediate TF target genes. Future studies at earlier time points will enable mapping of direct TF target genes. While we showed that cell culture media did not strongly alter TF-induced differentiation outcome at 7 days, exogenous factors and media conditions may influence differentiation of more mature cell types. For the TF Atlas cell types that we mapped and validated, further functional characterization, as we have done for RFX4-iNPs, is necessary. Finally, our approach for predicting TF combinations assumes that TF effects are additive, which is often but not necessarily always the case. More complex integration approaches will increase the precision of TF combination modeling.

STAR Methods

Resource availability

Lead contact

Requests for further information should be directed to and will be fulfilled by the lead contact, Feng Zhang (gro.etutitsnidaorb@gnahz).

Materials availability

The pooled and arrayed versions of the MORF library have been deposited to Addgene.

Data and code availability

All raw and processed sequencing data generated as part of this study have been deposited to the Gene Expression Omnibus (GSE216481). Code for the analyses described in this paper are available on Github (https://github.com/fengzhanglab/Joung_TFAtlas_Manuscript).

Experimental model and subject details

Cell culture and differentiation

HEK293FT cells were maintained in high-glucose DMEM with GlutaMax and pyruvate, 10% fetal bovine serum, and 1% penicillin/streptomycin. Cells were passaged every other day at a ratio of 1:4 or 1:5 using TrypLE Express.

Unless otherwise specified, human embryonic stem cells (hESCs) used in these experiments were from H1 hESCs. HUES66 hESCs were used for the induced neural progenitor (iNP) screens, iNP candidate TF validation, induced cardiomyocyte (iCM) characterization, and iCM scRNA-seq. Sequential TF screens were performed in iNPs and iCMs derived from H1 hESCs. Other stem cell lines used in this study include H9 hESCs and 11a human induced pluripotent stem cell (iPSCs). hESCs and iPSCs were maintained in cell culture dishes coated with 1% Geltrex membrane matrix in mTeSR1 medium. For routine maintenance, stem cells were passaged 1:10-1:20 using ReLeSR. For lentivirus transduction and differentiation, cells were dissociated using Accutase and seeded in mTeSR1 with 10 μM ROCK Inhibitor Y27632. All stem cells were maintained below passage 30 and confirmed to be karyotypically normal and negative for mycoplasma every 5-10 passages. Normocin was used as an antibiotic for stem cell culture and differentiation.

During neuronal differentiation, stem cell media was incrementally shifted towards neuronal media (Neurobasal medium, B-27, and GlutaMAX) in 25% increments starting from day 2. On day 5, media was changed to 100% neuronal media.

During TF-iNP differentiation, stem cell media was gradually shifted towards NP media (DMEM/F-12 with HEPES, B-27, 20 ng/mL EGF, 20 ng/mL bFGF, and 2 μg/mL heparin) in 25% increments as described above for neuronal differentiation. Cells were passaged at day 4. For spontaneous differentiation, 2 μg/mL doxycycline was added to the media starting on day 0 for 7 days to induce TF expression. After 7 days, cells were maintained in NP media for 3 days before media was changed to differentiation media (DMEM/F-12 with HEPES, B-27, and 2 μg/mL heparin). Half of the media was refreshed every other day during spontaneous differentiation.

For RFX4-iNP protocol optimization, base media from the dual SMAD inhibition (DS) (Shi et al., 2012a) and embryoid body (EB) (Schafer et al., 2019) protocols were tested. DS media is a 1:1 mix of N-2 (DMEM/F12 with HEPES, N-2, 5 μg/mL insulin, 100 μM nonessential amino acids, and 100 μM 2-mercaptoethanol) and neuronal media. EB media (DMEM/F12 with HEPES, N-2, and B-27 minus vitamin A) was also tested. SMAD inhibitors dorsomorphin and SB-431542 were added where indicated. To provide the best comparison between RFX4-iNP, DS, and EB methods, the differentiation timelines were aligned such that the iNPs produced by the three methods were dissociated for scRNA-seq at the same time.

For EOMES-iCM differentiation, hESCs were seeded in mTeSR. After 2 days, when cells have reached confluency (day 0), 2 μg/mL doxycycline was added to the media for 2 days unless otherwise indicated. On day 1, media was switched to CM differentiation media (RPMI 1640 with GlutaMax, B-27 minus insulin, and 10 mg/mL Ascorbic acid). Media was refreshed on day 2 and every other day afterwards. On day 7, half of the media was replaced with CM maintenance media (RPMI 1640 with GlutaMax and B-27). On day 8, all of the media was replaced with CM maintenance media. For CM differentiation using GSK and Wnt inhibitors, 10 μM CHIR99021 and 5 mM IWP4 were used as described previously (Lian et al., 2013).

To select the optimal media condition for the TF Atlas, stem cell media was gradually shifted towards 7 medias in 25% increments starting from day 2 as described above. Medias tested include M1 (DMEM/F-12 with HEPES, N-2, B-27, and 100 μM nonessential amino acids), M2 (1:1 mix of neuronal media and DMEM/F-12 with HEPES, N-2, and 100 μM nonessential amino acids), M3 (StemPro-34 SFM and GlutaMAX), M4 (STEMdiff APEL 2), M5 (CM maintenance media), M6 (KnockOut DMEM, KnockOut Serum Replacement, GlutaMAX, and 100 μM nonessential amino acids), and M7 (mTeSR). M4 was selected for the TF Atlas and validation.

Lentivirus production

HEK293FT cells were cultured as described above. 1 day prior to transfection, cells were seeded at ~40% confluency in T25, T75, or T225 flasks. Cells were transfected the next day at ~90-99% confluency. For each T25 flask, 3.4 μg of plasmid containing the vector of interest, 2.6 μg of psPAX2, and 1.7 μg of pMD2.G were transfected using 17.5 μL of Lipofectamine 3000, 15 μL of P3000 Enhancer, and 1.25 mL of Opti-MEM. Transfection parameters were scaled up linearly with flask area for T75 and T225 flasks. Media was changed 5 h after transfection. Virus supernatant was harvested 48 h post-transfection, filtered with a 0.45 μm PVDF filter, aliquoted, and stored at −80 °C.

Lentivirus transduction

For transduction, 3 × 10 6 hESCs or iPSCs were seeded in 10-cm cell culture dishes with an appropriate volume of lentivirus. After 24 h, media was refreshed with the appropriate antibiotic. For 5 days, media with the appropriate antibiotic was refreshed every day, and cells were passaged after 3 days of selection. Concentrations for selection agents were determined using a kill curve: 150 μg/mL Hygromycin, 3 μg/mL Blasticidin, and 1 μg/mL Puromycin. Lentiviral titers were calculated by transducing cells with 5 different volumes of lentivirus and determining viability after a complete selection of 3 days (Joung et al., 2017).

Method details

Sequences and cloning

The plasmids lentiMPHv2 and lentiSAMv2 were used for CRISPR activation. LentiCRISPRv2 was used for CRISPR-Cas9 mediated homology-directed repair (HDR). The Puromycin resistance gene in lentiCRISPRv2 was replaced with the lentiSAMv2 Blasticidin resistance gene for CRISPR-Cas9 knockout of DYRK1A. Single guide RNA (sgRNA) spacer sequences used in this study are listed in Data S22A, and were cloned into the respective vectors as previously described (Joung et al., 2017). For dox-inducible gene expression, the plasmid pUltra-puro-RTTA3 was used for rtTA. The dox-inducible ORF vector was cloned by replacing the EF1a promoter in pLX_TRC209 with the pTight promoter. For DYRK1A overexpression, the codon-optimized DYRK1A sequence (NM_001396) was cloned into pLX_TRC209 for expression under EF1a and the Hygromycin resistance gene was replaced with the lentiSAMv2 Blasticidin resistance gene.

qPCR quantification of transcript expression

Cells were seeded in 96-well plates and grown to 60-90% confluency before RNA was reverse transcribed for qPCR as described previously (Joung et al., 2017). TaqMan qPCR was performed with custom or readymade probes (Data S22C).

Western blot

Protein lysates were harvested with RIPA lysis buffer containing protease inhibitor cocktail. Samples were standardized for protein concentration using the Pierce BCA protein assay and incubated at 70°C for 10 mins under reducing conditions. After denaturation, samples were separated by Bolt 4-12% Bis-Tris Plus Gels (Thermo Fisher Scientific NW04125BOX) and transferred onto a PVDF membrane using iBlot Transfer Stacks (Thermo Fisher Scientific IB401001).

For NEUROD1 and V5, blots were blocked with Odyssey Blocking Buffer (TBS) for 1 h at room temperature. Blots were then probed with different primary antibodies (Data S22E) in Odyssey Blocking Buffer overnight at 4°C. Blots were washed with TBST before incubation with secondary antibodies (Data S22E) in Odyssey Blocking Buffer for 1h at room temperature. Blots were washed with TBST and imaged using the Odyssey CLx (LiCOr).

For DYRK1A, blots were blocked with 5% BLOT-QuickBlocker in TBST for 1 h at room temperature. Blots were then probed with different primary antibodies (Data S22E) in 2.5% BLOT-QuickBlocker in TBST overnight at 4°C. Blots were washed with TBST before incubation with secondary antibodies (Data S22E) in 2.5% BLOT-QuickBlocker in TBST for 1 h at room temperature. Blots were washed with TBST and imaged using the Pierce ECL Western Blotting Substrate on the ChemiDox XRS+ (Bio-Rad).

Immunofluorescence and imaging

Cells were cultured on poly-D-lysine/laminin coated glass coverslips (VWR 354087) in 24-well plates as described above. Prior to staining, cells were washed with 1 mL PBS and fixed with 4% paraformaldehyde in PBS for 30 mins at room temperature. Cells were washed with PBS and blocked in PBS with 2.5% goat serum and 0.1% Triton X-100 for 1 h at room temperature. Cells were then stained with different primary antibodies (Data S22E) in PBS with 1.25% goat serum and 0.1% Triton X-100 overnight at 4°C. Cells were washed in PBS with 0.1% Triton X-100 before staining with the appropriate secondary antibodies (Data S22E) in PBS with 1.25% goat serum and 0.1% Triton X-100 for 1 h at room temperature. Cells were washed in PBS with 0.1% Triton X-100, mounted onto slides using ProLong Gold Antifade Mountant with DAPI, and nail polished. Immunostained coverslips for NPs were imaged on a Zeiss Axio Observer with a Hamatsu Camera using a Plan-Apochromat 20x objective and a 1.6x Optovar. Immunostained coverslips for TF Atlas validation were imaged on a Leica Stellaris 5 confocal microscope using a 20x objective. Images were taken from randomly selected regions using fixed exposure times.

Design and cloning of TF ORF libraries

The MORF barcoded human TF library consisted of 1,836 genes that were selected based on AnimalTFDB (Zhang et al., 2015) and Uniprot (UniProt, 2015) annotations and included histone modifiers (Data S1). The library included all 3,548 splice isoforms that overlapped between RefSeq and Gencode annotations, as well as 2 control vectors expressing GFP and mCherry. Each TF ORF isoform has a unique 24-bp barcode with a Hamming distance of at least 3 compared to all other barcodes. 593 of the 3,548 isoforms were obtained from the Broad Genomic Perturbation Platform. As ORF libraries generated from cDNA libraries often contain missense mutations that can result in screening artifacts, we individually synthesized the rest of the isoforms (Genewiz). TF ORFs were cloned in an arrayed format into pLX_TRC317 for expression under the EF1a promoter. All constructs in the MORF library have been sequence verified.

To create targeted TF ORF libraries, TFs specifically expressed in target cell types were selected using published single cell or bulk RNA-seq datasets. TFs that were identified in 2 or more datasets were included. For NPs, 70 TF genes were selected using 8 datasets from radial glia, neural stem cells, differentiated neural progenitors, and fetal astrocytes (Camp et al., 2015; Johnson et al., 2015; Llorens-Bobadilla et al., 2015; Pollen et al., 2015; Shin et al., 2015; Thomsen et al., 2016; Wu et al., 2010; Zhang et al., 2016). Then, isoforms that comprised >25% of the transcripts for the respective gene were selected using bulk RNA-seq data of human fetal astrocytes (Zhang et al., 2016), resulting in 90 TF isoforms (Data S12A). The targeted NP library was cloned into pLX_TRC209. For astrocytes, 44 TF genes were selected using 6 datasets from purified and differentiated astrocytes (Aibar et al., 2017; Sloan et al., 2017; Zeisel et al., 2015; Zhang et al., 2014; Zhang et al., 2016; Zhong et al., 2018). Isoforms were selected using bulk RNA-seq data of human fetal astrocytes (Zhang et al., 2016) as described for NPs, resulting in 54 TF isoforms (Data S12B). For CM, 49 TF genes were selected using 11 datasets from purified and differentiated cardiomyocytes (Cao et al., 2019; Churko et al., 2018; Cui et al., 2019; DeLaughter et al., 2016; Friedman et al., 2018; Lescroart et al., 2018; Li et al., 2016; Pijuan-Sala et al., 2019; Sahara et al., 2019; Sereti et al., 2018; Wamstad et al., 2012). Isoforms that comprised >5% of the transcripts were selected using bulk RNA-seq of the whole fetal human heart (Pervolaraki et al., 2018), resulting in 80 TF isoforms (Data S12C). A lower threshold was chosen for CMs because isoform prevalence in the whole fetal heart may not be representative of ventricular, atrial, and mature CMs.

To assess TF distribution, TF barcodes were amplified (Data S22B) and sequenced on the Illumina MiSeq or NextSeq platforms as previously described (Joung et al., 2017). For the pooled lentiviral library, lentiviral RNA was harvested using the QIAmp Viral RNA Mini Kit and reverse transcribed using the qScript Flex cDNA Kit with gene-specific priming (Data S22B) before barcode amplification. NGS reads that perfectly matched each barcode were counted and normalized to the total number of perfectly matched NGS reads for each condition. Skew ratio was calculated as the normalized count for the 10 th percentile divided by the 90 th percentile.

Reporter cell line screen

To generate reporter cell lines, EGFP from pLX_TRC209 followed by a T2A (GGCAGTGGAGAGGGCAGAGGAAGTCTGCTAACATGCGGTGACGTCGAGGAGAATCCTGGCCCA) self-cleaving peptide was inserted at the N-terminus of endogenous SLC1A3 and VIM genomic sequences. SLC1A3 and VIM were selected as NP marker genes based on convergence across published RNA-seq datasets and high expression levels (Camp et al., 2015; Johnson et al., 2015; Llorens-Bobadilla et al., 2015; Pollen et al., 2015; Shin et al., 2015; Thomsen et al., 2016; Wu et al., 2010; Zhang et al., 2016). Clonal reporter cell lines were generated using CRISPR-Cas9 mediated HDR. To construct the HDR plasmids for each gene, the HDR templates that consisted of the 850-1,000 bp genomic regions flanking the sgRNA cleavage sites were PCR amplified from HUES66 genomic DNA. Then EGFP-T2A flanked by HDR templates were cloned into pUC19. HUES66 hESCs were nucleofected with 10 μg of LentiCRISPRv2 plasmid and 6 μg of HDR plasmid using the P3 Primary Cell 4D-Nucleofector X Kit according to the manufacturer’s instructions. Cells were then seeded sparsely (2 electroporation reactions per 10-cm cell culture dish) to form single cell clones. After 18 h, cells were selected for Cas9 expression with 0.5 μg/mL Puromycin for 2 days and expanded until colonies can be picked (~1 week).

Cell colonies were detached by replacing the media with PBS and incubating at room temperature for 15 mins. Each cell colony was removed from the Petri dish using a 200 μL pipette tip and transferred a well in a 96-well plate for expansion. Clones with EGFP insertions were identified by 2-round PCR amplification (Data S22B), first with primers amplifying outside of the HDR template (HDR Fwd 1 and HDR Rev, 15 cycles) and then with primers amplifying the region of insertion (HDR Fwd 2 and HDR Rev, 15 cycles) to avoid detecting the HDR template plasmid as a false positive. Products were run on a gel to identify clones with insertions and Sanger sequencing confirmed that EGFP had been inserted at the intended site without mutations. For each reporter cell line, 3 clones with EGFP inserted into one of the two alleles were selected for further expansion and characterization.

Flow-FISH screen

10X single cell RNA sequencing (scRNA-seq) screen

10X scRNA-seq library preparation and sequencing

Cells were dissociated with Accutase for 10 mins (NP) or 50 mins (spontaneously differentiated cells) at 37°C and filtered using a 70 μm cell strainer to obtain single cells. Cells were loaded in the 10x Genomics Chromium Controller with 10,000 cells per channel. For cells from the scRNA-seq pooled screen and spontaneous differentiation of four candidate TFs, scRNA-seq libraries were prepared using the Chromium Single Cell 3’ Library & Gel Bead Kit v2 according to the manufacturer’s instructions. Libraries were sequenced on the NextSeq platform, aiming for a minimum coverage of 20,000 reads per single cell (paired-end; read 1: 26 cycles; i7 index: 8 cycles, i5 index: 0 cycles; read 2: 55 cycles). For cells from the NP method comparison and spontaneous differentiation of RFX4-DS-iNPs, scRNA-seq libraries were prepared using the Chromium Single Cell 3’ Library & Gel Bead Kit v3 and sequenced on the HiSeq X platform (paired-end; read 1: 28 cycles; i7 index: 8 cycles, i5 index: 0 cycles; read 2: 96 cycles).

Arrayed screen

Bulk MORF library screen

Bulk RNA sequencing (RNA-seq)

RNA from cells plated in 24-well plates and grown to 60-90% confluency was harvested using the RNeasy Plus Mini Kit. RNA-seq libraries were prepared using NEBNext Ultra RNA Library Prep Kit for Illumina and sequenced on the Illumina NextSeq platform (>9 million reads per biological replicate).

Chromatin immunoprecipitation with sequencing (ChIP-seq)

Cells were plated in 10-cm cell culture dishes and grown to 60-80% confluency. For each condition, two biological replicates were harvested for ChIP-seq. Formaldehyde was added directly to the growth media for a final concentration of 1% and cells were incubated at 37°C for 10 mins to initiate chromatin fixation. Fixation was quenched by adding 2.5 M glycine in PBS for a final concentration of 125 mM glycine and incubated at room temperature for 5 mins. Cells were then washed with ice-cold PBS, scraped, and pelleted at 1,000×g for 5 mins.

Cell pellets were prepared for ChIP-seq using the Epigenomics Alternative Mag Bead ChIP Protocol v2.0 (Consortium, 2004). Briefly, cell pellets were resuspended in 100 μL of lysis buffer (1% SDS, 10 mM EDTA, 50 mM Tris-HCL pH 8.1) containing protease inhibitor cocktail and incubated for 10 mins at 4°C. Then 400 μL of dilution buffer (0.01% SDS, 1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris-HCl pH 8.1, and 167 mM NaCl) containing protease inhibitor cocktail was added. Samples were pulse sonicated with 2 rounds of 10 mins (30s on-off cycles, high frequency) in a rotating water bath sonicator (Diagenode Bioruptor) with 5 mins on ice between each round. 10 μL of sonicated sample was set aside as input control. Then 500 μL of dilution buffer (0.01% SDS, 1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris-HCl pH 8.1, and 167 mM NaCl) containing protease inhibitor cocktail and 1 μL of anti-V5 (Data S22E) was added to the sonicated sample. ChIP samples were rotated end over end overnight at 4°C.

For each ChIP, 50 μL of Protein A/G Magnetic Beads was washed with 1 mL of blocking buffer (0.5% TWEEN and 0.5% BSA in PBS) containing protease inhibitor cocktail twice before resuspending in 100 μL of blocking buffer. ChIP samples were transferred to the beads and rotated end over end for 1 h at 4°C. ChIP supernatant was then removed and the beads were washed twice with 200 μL of RIPA low salt buffer (0.1% SDS, 1% Triton x-100, 1 mM EDTA, 20 mM Tris-HCl pH 8.1, 140 mM NaCl, 0.1% DOC), twice with 200 μL of RIPA high salt buffer (0.1% SDS, 1% Triton x-100, 1 mM EDTA, 20 mM Tris-HCl pH 8.1, 500 mM NaCl, 0.1% DOC), twice with 200 μL of LiCl wash buffer (250 mM LiCl, 1% NP40, 1% DOC, 1 mM EDTA,10 mM Tris-HCl pH 8.1), and twice with 200 μL of TE (10 mM Tris-HCl pH8.0, 1 mM EDTA pH 8.0). ChIP samples were eluted with 50 μL of elution buffer (10 mM Tris-HCl pH 8.0, 5 mM EDTA, 300 mM NaCl, 0.1% SDS). 40 μL of water was added to the input control samples. 8 μL of reverse cross-linking buffer (250 mM Tris-HCl pH 6.5, 62.5 mM EDTA pH 8.0, 1.25 M NaCl, 5 mg/ml Proteinase K, 62.5 μg/ml RNAse A) was added to the ChIP and input control samples and then incubated at 65°C for 5h. After reverse crosslinking, samples were purified using 116 μL of SPRIselect Reagent. ChIP-seq libraries were prepared with NEBNext Ultra II DNA Library Prep Kit for Illumina and sequenced on the Illumina NextSeq platform (>60 million reads per condition).

Indel sequencing

Cells plated in 96-well plates were grown to 60-80% confluency and assessed for indel rates as previously described (Joung et al., 2017). Genomic DNA was harvested from cells using QuickExtract DNA Extraction Solution. The genomic region flanking the site of interest was amplified, first with region-specific primers (Data S22B) for 15 cycles and then with barcoded primers for 15 cycles. PCR products were sequenced on the Illumina MiSeq platform (>10,000 reads per condition).

Flow cytometry assays

TNNT2 immunostaining was performed using TNNT2 antibodies (Data S22E) as described previously (Palpant et al., 2017). For the EdU assay, cells plated in 24-well plates were differentiated and EdU incorporation was measured using the Click-iT EdU Alexa Fluor 488 Flow Cytometry Assay Kit according to a modified version of the manufacturer’s instructions. EdU was added to the culture medium to a final concentration of 10 μM for 2 h before cells were dissociated with Accutase for 15-45 mins at 37°C. Cells were transferred to a 96-well plate, pelleted at 200×g for 5 mins, and washed once with 200 μL of 1% BSA in PBS. Cells were resuspended in 100 μL of Click-iT fixative and incubated for 15 mins at room temperature in the dark. After fixing, cells were washed with 200 μL of 1% BSA in PBS twice, resuspended in 100 μL of Click-iT saponin-based permeabilization and wash reagent, and incubated for 15 mins in the dark. To each sample, 500 μL of Click-iT reaction cocktail was added, and the reaction mixture was incubated for 30 mins at room temperature in the dark. Cells were washed with 200 μL of Click-iT saponin-based permeabilization and wash reagent twice and resuspended in 200 μL of 1% BSA in PBS. For each sample, 10,000 cells were analyzed on a CytoFLEX Flow Cytometer (Beckman Coulter) and quantified with FlowJo.

Electrophysiology

Whole-cell patch-clamp recordings were performed as previously described (Nehme et al., 2018). Recording pipettes were pulled from thin-walled borosilicate glass capillary tubing (King Precision Glass KG33) on a P-97 puller (Sutter Instrument) and had resistances of 3-5 MΩ when filled with internal solution (128 mM K-gluconate, 10 mM HEPES, 10 mM phosphocreatine sodium salt, 1.1 mM EGTA, 5 mM ATP magnesium salt and 0.4 mM GTP sodium salt, pH 7.3, 300-305 mOsm). The cultured cells were constantly perfused at a speed of 3 mL/min with the extracellular solution (119 mM NaCl, 2.3 mM KCl, 2 mM CaCl2, 1 mM MgCl2, 15 mM HEPES, 5 mM glucose, pH 7.3-7.4; Osmolarity was adjusted to 325 mOsm with sucrose). All the experiments were performed at room temperature unless otherwise specified.

Cells were visualized with a 40X water-immersion objective on an upright microscope (Olympus) equipped with IR-DIC. Recordings were made using a Multiclamp 700B amplifier (Molecular Devices) and Clampex 10.7 software. In current clamp mode, membrane potential was held at −65 mV with a Multiclamp 700B amplifier, and step currents were then injected to elicit action potentials. The spontaneous AMPA receptor mediated excitatory postsynaptic currents (sEPSCs) were recorded after entering whole-cell patch clamp recording mode for at least 3 min.

TF Atlas SHARE-seq library preparation

SHARE-seq libraries were prepared as previously described (Ma et al., 2020). Briefly, cells were fixed and permeabilized. For joint measurements of single cell chromatin accessibility and expression (scATAC- and scRNA-seq), cells were first transposed by Tn5 transposase to mark regions of open chromatin. The mRNA was reverse transcribed using a poly(T) primer containing a unique molecular identifier (UMI) and a biotin tag. Permeabilized cells were distributed in a 96-well plate to hybridize well-specific barcoded oligonucleotides to transposed chromatin fragments and poly(T) cDNA. Hybridization was repeated three times to expand the barcoding space and ligate cell barcodes to cDNA and chromatin fragments. Reverse crosslinking was performed to release barcoded molecules. cDNA was separated from chromatin using streptavidin beads, and each library was prepared separately for sequencing. Libraries were sequenced on the Illumina NovaSeq platform, aiming for a minimum coverage of 20,000 reads per single cell (for scRNA-seq only, read 1: 100 cycles, read 2: 10 cycles, index 1: 99 cycles, index 2: 8 cycles; for scATAC- and RNA-seq, read 1: 50 cycles, read 2: 50 cycles, index 1: 99 cycles, index 2: 8 cycles). To pair TF barcodes with cell barcodes, TF and cell barcodes were PCR amplified from cDNA retained following the whole transcriptome amplification step and before tagmentation (Data S22B). The resulting amplicon was sequenced on the Illumina NovaSeq platform, aiming for a minimum coverage of 10,000 reads per single cell (read 1: 65 cycles; index 1: 99 cycles).

Quantification and statistical analysis

Image quantification

For quantification of MAP2 staining ( Figure 5P , ​ ,Q), Q ), the MeasureImageIntensity module in CellProfiler 3.1.8 (Carpenter et al., 2006) was used to measure mean intensity on grayscale MAP2 420 μm × 420 μm images. The IdentifyPrimaryObjects module in CellProfiler was used to identify and count nuclei in grayscale DAPI images with the following settings modified from default: Typical diameter of objects, in pixel units (Min, Max) = 25, 70; Threshold strategy = Adaptive; Threshold smoothing scale = 1.5; Lower and upper bounds on threshold = 0.06, 1.0. For quantification of marker gene staining ( Figure 4D - ​ -K, K , Figure S3N-P, and Data S11C), the MeasureImageIntensity module in CellProfiler 4.2.1 (Carpenter et al., 2006) was used to measure mean intensity on grayscale 580 μm × 580 μm images. The IdentifyPrimaryObjects module in CellProfiler was used to identify and count nuclei in grayscale DAPI images with the following settings modified from default: Typical diameter of objects, in pixel units (Min, Max): 25, 100; Threshold method = Otsu; Three-class thresholding; Assign pixels in the middle intensity class to the foreground; Threshold smoothing scale = 5; Threshold correction factor = 0.9; Lower and upper bounds on threshold = 0.02, 1.0; Size of smoothing filter = 10; Suppress local maxima that are closer than this minimum allowed distance = 15; speed up by using lower-resolution image to find local maxima = no.

10X scRNA-seq analysis

Sequencing data were aligned and quantified using the Cell Ranger Single cell Software Suite v3.1.0 (Zheng et al., 2017) against the GRCh38 human reference genome provided by Cell Ranger. Scanpy v1.7.2 (Wolf et al., 2018) was used to cluster and visualize cells. Cells with 400-7,000 detected genes and less than 10% total mitochondrial gene expression were retained for analysis. Genes that were detected in fewer than 3 cells were removed. Scanpy was used to log normalize, scale, and center the data and unwanted variation was removed by regressing out the number of UMIs and percent mitochondrial reads. Next, highly variable genes were identified and used as input for dimensionality reduction via principal component analysis (PCA). The resulting principal components were then used to cluster the cells, which were visualized using Uniform manifold approximation and projection (UMAP). Clusters were identified using Louvain by fitting the top 50 principal components to compute a neighborhood graph of observations with local neighborhood number of 20 using the scanpy.pp.neighbors function. Cells were then clustered into subgroups using the Louvain algorithm implemented as the scanpy.tl.louvain function. Cluster marker genes and associated p-values were identified using the scanpy.tl.rank_gene_groups function.

To map TF perturbations to expression profiles, for each cell, the TF whose corresponding barcode had the highest number of perfectly matching NGS reads was paired with the cell if the TF barcode had at least 2 reads and >25% more reads than the second highest TF. Otherwise, the cell was excluded from the scRNA-seq analysis. To identify TFs that produced similar expression profiles to radial glia, TF scRNA-seq signatures were correlated to available human fetal cortex or brain organoid scRNA-seq datasets (Eze et al., 2021; Nowakowski et al., 2017; Pollen et al., 2015; Quadrato et al., 2017; Zhong et al., 2018). The 1,121 most variable genes identified using the scanpy.pp.highly_variable_genes function with the parameters “min_mean=0.0125, max_mean=3 and min_disp=0.5” were used. Candidate TFs were ranked based on Pearson correlations between mean expression profiles of each TF ORF and radial glia from reference datasets.

To compare iNP differentiation methods, the cluster of spontaneously differentiated neurons was excluded. Intra- and inter-batch Euclidean distances were calculated on the 2,305 variable genes using the spatial.distance.pdist and spatial.distance.cdist functions, respectively, from SciPy. Wasserstein distances were determined using the wasserstein_distance function from SciPy.

Bulk RNA-seq analysis

Bowtie (Langmead et al., 2009) index was created based on the hg38 genome and RefSeq transcriptome. Next, RSEM v1.3.1 (Li and Dewey, 2011) was run with command line options “--estimate-rspd --bowtie-chunkmbs 512 --paired-end” to align paired-end reads directly to this index using Bowtie and estimate expression levels in transcripts per million (TPM) based on the alignments. TFs with similar RNA-seq signatures to reference cell types from human fetal cortex or brain organoid (Nowakowski et al., 2017; Pollen et al., 2015; Quadrato et al., 2017) were identified using Pearson correlation between expression profiles. For each TF ORF, the expression signature was defined as the top 2,000 genes with the highest fold change relative to the GFP control condition. For each reference cell type, the average expression profile in TPM was used. To identify genes that were differentially expressed, TPM values were log-transformed (log2(TPM+1)) and filtered for genes that were detectable (above or equal to 1) in either condition. TF overexpression conditions were compared to control conditions using the Student’s t-test. Only genes that were significant (FDR < 0.05) were reported.

ChIP-seq analysis

Bowtie (Langmead et al., 2009) was used to align paired-end reads to the hg38 genome with command line options q -X 300 --sam --chunkmbs 512”. Next, biological replicates were merged and Model-based Analysis of ChIP-seq (MACS) (Zhang et al., 2008) was run with command line options “-g hs -B -S --mfold 6,30” to identify TF peaks. HOMER (Heinz et al., 2010) was used to discover motifs in the TF peak regions identified by MACS. The findMotifsGenome.pl program from HOMER was run with the command line options “-size 200 -mask” and the top 3 known and de novo motifs were presented. TFs were considered potential regulators of a candidate gene if the TF peak region identified by MACS overlapped with the 20kb region centered around the transcriptional start site of the candidate gene RefSeq annotations.

Indel analysis

Indel analysis was performed using a custom Python script as previously described (Joung et al., 2017).

Electrophysiology analysis

Analysis was performed using Clampfit 10.7. Cells in which the series resistance (Rs) changed by >20% were excluded from data analysis. In addition, cells with Rs more than 20 MΩ at any time during the recordings were discarded.

SHARE-seq data preprocessing

SHARE-seq libraries were aligned as previously described (Ma et al., 2020). Briefly, SHARE-ATAC-seq reads were trimmed and aligned to the hg38 genome using bowtie2. Reads were demultiplexed using four sets of 8-bp barcodes in the index reads, tolerating one mismatched base per barcode. Reads mapping to the mitochondria and chrY were discarded. Duplicates were removed using Picard tools (http://broadinstitute.github.io/picard/). Open chromatin region peaks were called on individual samples using MACS2 peak caller (Zhang et al., 2008). Peaks from all samples were merged and peaks overlapping with ENCODE blacklisted regions (https://sites.google.com/site/anshulkundaje/projects/blacklists) were filtered out. Peak summits were extended by 150bp on each side and defined as accessible regions. The fragment counts in peaks and TF scores were calculated using chromVAR (Schep et al., 2017).

SHARE-RNA-seq reads were trimmed and filtered for reads that contain TTTTTT at the 11-16 bases of read 2 allowing for one mismatch. Reads were aligned to hg38 genome using STAR (Dobin et al., 2013). Reads were demultiplexed as described above for SHARE-ATAC-seq. Aligned reads were annotated to both exons and introns using featurecounts (Liao et al., 2014). UMI-Tools (Smith et al., 2017) was used to collapse UMIs that were within one mismatch of another UMI. UMIs with only one read were removed as potential ambient RNA contamination. A matrix of gene counts by cell was created with UMI-Tools. Cells that expressed 700 genes; joint scRNA- and ATAC-seq, >400 genes; combinatorial TF screen, >500 genes) Scanpy v1.7.2 (Wolf et al., 2018) was used to preprocess the count matrix and cluster cells as described above for 10X scRNA-seq libraries. Harmony (Korsunsky et al., 2019a) as implemented by Scanpy (max_iter_harmony = 30, max_iter_kmeans = 50) was used for batch correction. To map TF ORFs to single cells, TF barcodes were ranked by number of perfectly matching NGS reads and filtered for >10 reads. For single TF overexpression, the TF barcode with the highest number of NGS reads and >50% more reads than the second highest TF were mapped. For combinatorial TF overexpression, the top two or three TF barcodes were mapped. Cells without mapped TF ORFs were excluded from downstream analyses. Scanpy’s sc.pp.subsample was used to subsample datasets by TF ORF.

Pseudotime analysis

Two approaches were used to order cells along pseudotime: diffusion (Haghverdi et al., 2016) and RNA velocity (La Manno et al., 2018). Diffusion pseudotime was determined using Scanpy’s sc.tl.diffmap (n_comps = 15) and sc.tl.dpt functions. RNA velocity pseudotime was determined using scVelo (Bergen et al., 2020). The top 5,000 most dispersed genes were used to estimate velocity (mode = ‘stochastic’) and the velocity_pseudotime function was used to determine pseudotime. For each approach, pseudotimes were computed using each cell expressing GFP or mCherry controls as the root cells and averaged. Genes that were differentially expressed over pseudotime were identified by fitting a linear regression on the raw counts against pseudotime using scipy.stats.linregress. Genes with slopes that were significantly different than 0 (FDR < 0.05) were considered differentially expressed.

Non-negative matrix factorization

To identify gene programs in the scRNA-seq data, non-negative matrix factorization (NMF) was performed using scikit-learn (Pedregosa et al., 2011) (tol = 1e-5, max_iter = 10000). The analysis was performed on log-normalized, centered expression data for the set of variable genes. Negative values were converted to zero to identify enriched gene programs. Positive values were converted to zero and the data multiplied by −1 to identify depleted gene programs. The optimal number of NMF programs for enriched and depleted gene programs was determined by performing NMF analysis over a range of K values (20, 30, 50, 100, 200). The average NMF program weights for each TF ORF were ordered by hierarchical clustering using 1 – correlation coefficient as the distance and Ward’s linkage. Clustering results were examined for groups of TFs with known similarities to select the best value of K. We chose 50 NMF programs each for enriched and depleted gene programs.

Pathway enrichment analysis

Pathway enrichment analysis was performed using g:Profiler (Raudvere et al., 2019). The top 100 differentially expressed genes over diffusion pseudotime or genes with the highest NMF gene program weights were provided as a ranked list for input. GO:BP pathways with between 5 and 500 genes that were significantly enriched (FDR < 0.05) were included. To identify non-overlapping pathways, the enriched pathways were sorted by FDR and any pathway that had more than 50% genes overlapping was excluded. For Figure S3F, Enrichr (Kuleshov et al., 2016) implemented by GSEApy was used to evaluate the enrichment of each pathway in the set of differentially expressed genes for each cluster.

TF potential analysis

We defined a “TF potential” vector for each TF by the relation of TF-induced expression changes to the differentiation trajectory. Specifically, we subclustered cells overexpressing TFs of interest or controls and reordered cells in diffusion pseudotime to attain higher resolution of TF-induced expression changes. For each TF, we used a linear regression model to fit the corresponding expression profiles against pseudotime, defining TF potential as the slope of the linear regression.

Cell type mapping

Expression profiles of differentiated cells were mapped to those of reference cell types from the human fetal cell transcriptome atlas (Cao et al., 2020). The set of common variable genes between the differentiated cells and reference dataset were used for mapping. For mapping by label-transfer, the FindTransferAnchors function from Seurat (Stuart et al., 2019) was run with parameters: dims = 1:50, k.anchor = 15, k.filter = 100, k.score = 50, and max.features = 300. Cells with a maximum prediction score > 0.2 were mapped to the respective cell type. For mapping by dataset integration, the two datasets were integrated using Harmony (Korsunsky et al., 2019a) as described above. To annotate each differentiated cell, the cell type labels of the 10 nearest reference cells, as measured by Euclidean distsance in the latent space, were evaluated. Differentiated cells were assigned the most common cell type if 8 (80%) or more of the reference cells shared the same cell type label. For mapping by Random Forest classification, SingleCellNet (Tan and Cahan, 2019) was used to classify differentiated cells. The classifier was trained using the scn_train function from SingleCellNet with parameters: nRand = 200, nTrees = 2000, nTopGenePairs = 200. Cells with a maximum score > 0.1 were mapped to the respective cell type.

Joint chromatin accessibility and gene expression analysis

Seurat v4 (Hao et al., 2021) was used for the joint chromatin accessibility and gene expression (scATAC- and scRNA-seq) multimodal analysis. Dimensionality reduction was performed on each dataset separately. The scRNA-seq data was normalized and variable features were retained for scaling and PCA. The scATAC-seq data was normalized using term-frequency inverse-document-frequency and the top 250,000 most accessible regions were retained for latent semantic indexing (LSI). Weighted nearest neighbor analysis from Seurat (dims.list = list(1:50, 2:50), prune.SNN = 1/40) was performed using the scRNA-seq PCA and scATAC-seq LSI to simultaneously cluster scATAC- and scRNA-seq data. Marker genes for each cluster were identified using Presto (Korsunsky et al., 2019b).

Combinatorial TF prediction

The average expression profiles of TF combinations were used for prediction. All possible combinations of single TF expression profiles were fitted against each measured double or triple TF profile to select the TF combination with the best fit. Linear regression (fit_intercept = False, positive = True), kernel ridge regression (alpha = 1), and random forest regression (max_depth = 4, n_estimators = 200) from scikit-learn (Pedregosa et al., 2011) were evaluated and scored based on the coefficient of determination. Average expression profiles were scored based on Pearson correlation. To predict double and triple TF profiles using single TF profiles from the TF Atlas, the two datasets were integrated using Harmony (Korsunsky et al., 2019a). The average expression profiles from the TF Atlas differentiated cells were fitted against each double or triple TF profile. To reduce the number of possible combinations, average expression profiles from the TF Atlas were grouped by hierarchical clustering using 1 – correlation coefficient as the distance and Ward’s linkage where indicated.

To predict TF combinations for reference cell types, expression profiles of double or triple TFs were estimated using the mean expression profiles from the TF Atlas differentiated cells. Individual TF profiles were grouped by hierarchical clustering as described above into 365 clusters for double TF profiles and 151 clusters for triple TF profiles to reduce the number of combinations. Group gene signatures for reference cell types from the human fetal cell atlas (Cao et al., 2020) were extracted using CelliD (Cortal et al., 2021) with default parameters. Cell type-specific gene signature scores were computed on all possible estimated expression profiles for multiple TFs. Predicted TF combinations were ranked by cell type-specific gene signature scores. Combinations that did not include any cell type-specific TFs (approximately 10-50% of combinations depending on the cell type) were eliminated. For each cell type, up to 100 TFs that were significantly enriched (FDR < 0.05) based on the human fetal cell atlas (Cao et al., 2020) analysis were considered specific for that cell type. In cases with more than 100 significantly enriched TFs, the top 100 TFs with the highest expression relative to other cell types were included.

Statistics

Statistical tests were applied with the sample size listed in the text and figure legends. Sample size represents the number of independent biological replicates. Data supporting main conclusions represents results from at least two independent experiments. All graphs with error bars report mean ± s.e.m. values. Two-tailed t-tests were performed unless otherwise indicated. PRISM was used for basic statistical analysis and plotting (http://www.graphpad.com), and the R language and programming environment (https://www.r-project.org) was used for the remainder of the statistical analysis. Multiple hypothesis testing correction was applied where indicated.

Key resources table

REAGENT or
RESOURCE
SOURCEIDENTIFIER
Antibodies
See Data S22E for antibodies used in this studyThis paperNA
Bacterial and virus strains
Stbl3Thermo Fisher ScientificC737303
Chemicals, peptides, and recombinant proteins
high-glucose DMEM with GlutaMax and pyruvateThermo Fisher Scientific10569010
fetal bovine serumVWR97068-085
penicillin/streptomycinThermo Fisher Scientific15140122
TrypLE ExpressThermo Fisher Scientific12604021
GeltrexThermo Fisher ScientificA1413202
mTeSR1STEMCELL Technologies85850
ReLeSRSTEMCELL Technologies05873
AccutaseSTEMCELL Technologies07920
ROCK Inhibitor Y27632Enzo Life SciencesALX-270-333-M025
NormocinInvivogenant-nr-1
NeurobasalThermo Fisher Scientific21103049
B-27Thermo Fisher Scientific17504044
GlutaMAXThermo Fisher Scientific35050061
DMEM/F-12 with HEPESThermo Fisher Scientific11330057
EGFMilliporeSigmaE9644
bFGFSTEMCELL Technologies78003
heparinSTEMCELL Technologies07980
doxycyclineMilliporeSigmaD9891
N-2Thermo Fisher Scientific17502048
insulinMillipore Sigma19278
nonessential amino acidsThermo Fisher Scientific11140050
2-mercaptoethanolMillipore SigmaM6250
B-27 minus vitamin AThermo Fisher Scientific12587010
dorsomorphinMillipore SigmaP5499
SB-431542R&D Systems1614
RPMI 1640 with GlutaMaxThermo Fisher ScientificA1895601
B-27 minus insulinThermo Fisher Scientific17504044
Ascorbic acidMillipore SigmaA4403
CHIR99021SelleckchemS1263
IWP4Stemgent04-0036
StemPro-34 SFMThermo Fisher Scientific10639011
STEMdiff APEL 2STEMCELL Technologies05275
KnockOut DMEMThermo Fisher Scientific10829018
KnockOut Serum ReplacementThermo Fisher Scientific10828010
Lipofectamine 3000Thermo Fisher ScientificL3000150
P3000 EnhancerThermo Fisher ScientificL3000150
Opti-MEMThermo Fisher Scientific31985070
HygromycinThermo Fisher Scientific10687010
BlasticidinThermo Fisher ScientificA1113903
PuromycinThermo Fisher ScientificA1113803
RIPA lysis bufferCell Signaling Technologies9806S
protease inhibitor cocktailMilliporeSigma05892791001
Odyssey Blocking Buffer (TBS)LiCOr927-50000
BLOT-QuickBlockerG Biosciences786-011
Pierce ECL Western Blotting SubstrateThermo Fisher Scientific32209
paraformaldehydeVWR15710
goat serumCell Signaling Technologies5425S
Triton X-100MilliporeSigma93443
ProLong Gold Antifade Mountant with DAPIThermo Fisher ScientificP36941
FormaldehydeMilliporeSigma252549
glycineMilliporeSigmaG7126
Protein A/G Magnetic BeadsThermo Fisher Scientific88802
Proteinase KQiagen19133
RNAse AQiagen19101
SPRIselect ReagentBeckman CoulterB23318
QuickExtract DNA Extraction SolutionLucigenQE09050
BSAMilliporeSigmaA9418
Critical commercial assays
Pierce BCA protein assayVWR23227
QIAmp Viral RNA Mini KitQiagen52906
qScript Flex cDNA KitVWR95049-100
P3 Primary Cell 4D-Nucleofector X KitLonzaV4XP-3024
PrimeFlow RNA assay kitThermo Fisher Scientific88-18005-204
Chromium Single Cell 3’ Library & Gel Bead Kit v210x Genomics120237
Chromium Single Cell 3’ Library & Gel Bead Kit v310x Genomics1000075
RNeasy Plus Mini KitQiagen74134
NEBNext Ultra RNA Library Prep Kit for IlluminaNEBE7530
NEBNext Ultra II DNA Library Prep Kit for IlluminaNEBE7645
Click-iT EdU Alexa Fluor 488 Flow Cytometry Assay KitThermo Fisher ScientificC10420
Deposited data
Raw sequencing dataThis paperGSE216481
Processed dataThis paperGSE216481
Human fetal expression atlas(Cao et al., 2020)GSE156793
Experimental models: cell lines
HEK293FTThermo Fisher ScientificR70007
H1 hESCsWiCellWA01
HUES66 hESCsHarvard Stem Cell Institute iPS Core FacilityNA
H9 hESCsWiCellWA09
11a iPSCsPaola Arlotta laboratory, Harvard UniversityNA
Oligonucleotides
See Data S22B-D for oligonucleotides used in this studyThis paperNA
Recombinant DNA
lentiMPHv2Addgene89308
lentiSAMv2Addgene75112
LentiCRISPRv2Addgene52961
pUltra-puro-RTTA3Addgene58750
pLX_TRC209Broad Genetic Perturbation PlatformpLX_TRC209
pTight promoterAddgene31877
psPAX2Addgene12260
pMD2.GAddgene12259
pLX_TRC317Broad Genetic Perturbation PlatformpLX_TRC317
pUC19Addgene50005
Software and Algorithms
CellProfiler (v3.1.8, v4.2.1)(Carpenter et al., 2006) https://cellprofiler.org/
Cell Ranger Single cell Software Suite (v3.1.0)10x Genomics https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome
Scanpy (v1.7.2)(Wolf et al., 2018) https://github.com/scverse/scanpy
Bowtie (v1.2.3)(Langmead et al., 2009) https://github.com/BenLangmead/bowtie
RSEM (v1.3.1)(Li and Dewey, 2011) https://github.com/deweylab/RSEM
MACS (v1.4.2)(Zhang et al., 2008) https://github.com/macs3-project/MACS
HOMER (v4.10.3)(Heinz et al., 2010) http://homer.ucsd.edu/homer/
FlowJo (v10.8.1)BD Biosciences https://www.flowjo.com/
Clampex (v10.7)Molecular Devices https://www.moleculardevices.com/products/axon-patch-clamp-system/acquisition-and-analysis-software/pclamp-software-suite
SHARE-seq preprocessing(Ma et al., 2020) https://github.com/masai1116/SHARE-seq-alignment
SHARE-seq TF alignmentThis paper https://github.com/fengzhanglab/Joung_TFAtlas_manuscript
Harmony (Scanpy)(Korsunsky et al., 2019a) https://github.com/immunogenomics/harmony
scVelo (v0.2.3)(Bergen et al., 2020) https://github.com/theislab/scvelo
g:Profiler(Raudvere et al., 2019) https://biit.cs.ut.ee/gprofiler/gost
scikit-learn (v0.24.2)(Pedregosa et al., 2011) http://scikit-learn.sourceforge.net
GSEApy Enrichr (v0.10.4)(Kuleshov et al., 2016) https://github.com/zqfang/GSEApy
Seurat (v4)(Hao et al., 2021) https://github.com/satijalab/seurat/
SingleCellNet(Tan and Cahan, 2019) https://github.com/pcahan1/singleCellNet
Presto(Korsunsky et al., 2019b) https://github.com/immunogenomics/presto
chromVAR(Schep et al., 2017) https://github.com/GreenleafLab/chromVAR
CelliD(Cortal et al., 2021) https://github.com/RausellLab/CelliD
PRISM (v9.1.0)GraphPad https://www.graphpad.com/scientific-software/prism/

Supplementary Material

Sup Data 2

Data S2 ∣ Duration of established TF-driven differentiation protocols. For each protocol, the cell type, TFs, time required for TF overexpression, and PubMed reference number are listed.

Sup data 1

Data S1 ∣ TF splice isoforms in the MORF library. The MORF library consisted of 1,836 TF genes encoded by 3,548 splice isoforms that overlapped between RefSeq and Gencode annotations, as well as 2 control vectors expressing GFP and mCherry. 593 of the 3,548 isoforms were obtained from the Broad Genomic Perturbation Platform (Broad GPP). Some of the Broad GPP TF ORFs contained V5 epitope tags. The rest of the isoforms were synthesized by Genewiz. All constructs were sequence verified and paired with unique 24-bp barcodes.

Sup data 3

Data S3 ∣ TF enrichment for bulk screening in different cell culture media. Fold enrichment of 3,548 TFs after differentiating in 7 different cell culture media (M1-M7) for 7 days. For unsorted cells, enrichment was measured relative to the lentiviral library. For sorted cells, enrichment was measured as the TF barcode count in cells expressing low levels of pluripotency markers relative to those expressing high levels.

Sup data 5

Data S5 ∣ Splice isoform annotations and differentiation outcome. Splice isoform annotations for 9 TF genes indicating differences between isoforms and whether domains that may be important for function are missing. The diffusion pseudotime P-values are included for each isoform. AA, amino acid.

Sup data 4

Data S4 ∣ Pseudotime analysis of TF overexpression. Diffusion and RNA velocity pseudotime analysis methods were used to order cells from the TF Atlas in a differentiation trajectory. (A) Differentially expressed genes over pseudotime. Linear regression was applied to identify genes that were significantly differentially expressed (FDR < 0.05) over pseudotime. Genes were ranked by the slope of the linear regression fit. The 1,000 most upregulated and downregulated genes are listed. (B) Effects of overexpressing each TF on pseudotime. Pseudotimes for each TF were compared to GFP and mCherry controls to determine the average difference and associated P-value.

Sup data 6

Data S6 ∣ Grouping of TF ORFs using NMF and pairwise correlation. (A) Hierarchical clustering results for TF grouping methods using non-negative matrix factorization (NMF) or pairwise correlation. The order of TFs and assigned clusters are listed for each method. (B) NMF gene programs. The top 100 genes with the highest weights for each of the 100 NMF gene programs are shown. Programs 1-50 are enriched and programs 51-100 are depleted.

Sup data 8

Data S8 ∣ Cluster marker genes for differentiated cells in the TF Atlas. Cells were ordered by diffusion pseudotime relative to Cluster 0. Linear regression was applied to identify genes that were significantly differentially expressed (FDR < 0.05) over pseudotime. The slope of the linear regression fit and associated P-values are listed for each marker gene.

Sup Data 10

Data S10 ∣ Marker genes used to validate candidate TFs for differentiation of reference cell types. Marker genes commonly used for each cell type were selected based on available literature. For each marker gene, the associated cell type, literature references, and whether the marker gene was identified as a differentially expressed gene (DEG) for the respective cell type by the TF Atlas are provided. EMT, epithelial-mesenchymal transition; PNS, peripheral nervous system.

Sup Data 12

Data S12 ∣ Targeted TF screening results. (A) Screens of 90 TFs in hESCs for iNP differentiation. Ranks for each screening method are reported. Ranks were considered “NA” if the results of TF overexpression were not detectable. (B) Screen of 54 TFs in RFX4-iNPs for astrocyte differentiation using flow-FISH. S1 and S2 represent biological replicates. (C) Screen of 80 TFs in EOMES-iCMs for atrial, ventricular, or mature cardiomyocyte differentiation using flow-FISH.

Sup data 13

Data S13 ∣ Differentially expressed genes in bulk RNA-seq datasets. Genes that were significantly differentially expressed (FDR < 0.05) relative to control are listed with associated fold change and P-values for TF ORF overexpression (A) and DYRK1A perturbations (B).

Sup data 15

Data S15 ∣ Cluster marker genes for each scRNA-seq dataset. The top 30 marker genes and associated P-values are listed for each dataset and cluster. (A) Cells that have been spontaneously differentiated from iNPs for 8 weeks. iNPs were derived using RFX4, NFIB, ASCL1, or PAX6 with n = 2 biological replicates. (B) iNPs derived using three iNP differentiation methods: dual SMAD inhibition, embryoid body formation, and RFX4 overexpression combined with dual SMAD inhibition. Data represents n = 2 batch replicates. (C) Cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. (D) Cardiomyocytes differentiated using EOMES overexpression or GSK and WNT inhibition for 4 weeks. Data represents n = 2 biological replicates.

Sup data 18

Data S18 ∣ Genes with TF ChIP-seq peaks. Genes with transcriptional start sites that that were within 10kb of the TF ChIP-seq peak region identified by MACS are listed.

Sup data 21

Data S21 ∣ Predicted TF combinations for reference cell types. (A) Hierarchical clustering of TF ORFs for TF Atlas differentiated cells into 365 clusters. (B) Predicted double TFs for reference cell types from the human fetal cell atlas (Cao et al., 2020). Combinations were ranked based on the cell type-specific gene signature score. Only the top 10% of possible combinations are shown. Combinations are presented as clusters from (A). (C) Same as (A), for 151 clusters. (D) Same as (B), for triple combinations using clusters from (C). Only the top 1% of possible combinations are shown.

Sup data 22

Data S22 ∣ List of reagents used in this study. Reagents included sgRNAs (A), primers (B), TaqMan qPCR probes (C), Flow-FISH probes (D), and antibodies (E).

sup data 9

Data S9 ∣ Differential gene expression analysis for differentiated cells. Smoothened heatmap showing expression of marker genes for each cluster of differentiated cells from Figure 3A . Cells are sorted by cluster followed by diffusion pseudotime.

Sup data 7

Data S7 ∣ Unbiased clustering of TFs based on Pearson correlation of gene expression. (A) Heatmap showing pairwise Pearson correlation for mean expression profiles of 3,266 TF ORFs. TFs are ordered by hierarchical clustering. Relations between some groups of TFs are annotated, prioritizing groups that overlap with those from Figure 2 . Numbers in parentheses indicate the number of TF isoforms in the same group. (B,C) Zoomed in subsets of (A).

sup data 11

Data S11 ∣ Validation of candidate TFs in other stem cell lines for differentiation towards nominated cell types. (A,B) Expression of marker genes for each nominated cell type in H9 hESCs (A) or 11a iPSCs (B) after 7 days of candidate TF or GFP overexpression. Numbers after TF gene names indicate the isoform. n = 4. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant; ND, not detected. (C) Controls for data in Figure 4D - ​ -K K (immunostaining of marker genes in H1 hESCs). Scale bar, 25 μm. (D) Heatmap showing enrichment of TF Atlas candidate TFs in the bulk TF screen across 7 media conditions (M1-M7).

sup data 14

Data S14 ∣ Validation of candidate TFs driving iNP differentiation. (A) Western blot showing expression of candidate TFs measured via the V5 epitope tag after 7 days of differentiation. (B,C) Heatmaps showing correlation between bulk RNA-seq expression profiles of iNPs and human fetal cortex (Nowakowski et al., 2017) (B) or brain organoid (Quadrato et al., 2017) (C) cell types. D07 and D12 indicate number of days of ORF overexpression. RG, radial glia; IPC, intermediate progenitor cell; IN, interneuron; div, dividing; oRG, outer radial glia; tRG, truncated radial glia; vRG, ventricular radial glia; MGE, medial ganglionic eminence; nEN, newborn excitatory neurons, EN, excitatory neurons; PFC, prefrontal cortex; V1, primary visual cortex; nIN, newborn interneurons; CTX, cortex; CGE, cortical ganglionic eminence; STR, striatum; OPC, oligodendrocyte precursor cells; Glyc, cells expressing glycolysis genes; Pro, proliferating progenitors; NE, neuroepithelium; DN, dopaminergic neurons; CLN, callosal neurons; CFN, corticofugal neurons; Meso, mesodermal progenitors. (D) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (NG2) in cells spontaneously differentiated for 1, 2, 4, or 8 weeks from iNPs produced by candidate TFs in HUES66 hESCs. (E,F) Expression of NP marker genes in iNPs generated using 11a iPSCs (E) or H1 hESCs (F) after 7 days of TF overexpression. (G,H) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (NG2 and PDGFRA) in cells spontaneously differentiated from 11a iPSC (G) or H1 hESC (H) iNPs for 8 weeks. Scale bar, 100μm.

sup data 16

Data S16 ∣ Profiling cells spontaneously differentiated from iNPs using scRNA-seq. (A) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. Horizontal lines distinguish between major cell types. Pro, uncommitted progenitors; RP, retinal progenitors; RPE, retinal pigment epithelium; PR, photoreceptors; RGC, retinal ganglion cells; DNP, dorsal neural progenitors; RG, radial glia; Astro, astrocytes; CN, CNS neurons; EPD, ependyma; EP, epithelial progenitors; BE, bronchial epithelium; CE, cranial epithelium; CNC, cranial neural crest; CNCP, cranial neural crest progenitors; P, proliferative cells. (B-D) UMAP of 4,162 neurons from clusters CN 1-3 of Figure S4O. (B,C) Marker genes for general regions of the central nervous systems (B) and neuronal subtypes (C) are shown. Color indicates gene expression. (D) Neurons spontaneously differentiated from each candidate TF are highlighted. Colors indicate biological replicates, S1 and S2.

sup data 19

Data S19 ∣ Optimization of RFX4-iNP and EOMES-iCM differentiation for disease modeling. (A-D) Optimization of RFX4-iNP differentiation. (A) Schematic for different media conditions (M1-M8) tested. SMAD inhibitors dorsomorphin (DM) and SB-431542 (SB) were added to the media at the indicated concentrations. mTeSR stem cell media was changed to different NP media (NP, EB, and DS; see Methods) over 7 days of differentiation. (B) Heatmaps showing expression of neuron marker genes TUJ1 and MAP2 relative to GAPDH control in cells from iNPs that have undergone spontaneous neurogenesis for 2 or 4 weeks. iNPs were differentiated for 5 or 7 days using each of the media conditions in (A) and seeded at low or high densities prior to spontaneous neurogenesis. Colors represent mean expression from n = 4 biological replicates. (C) Same as (A), for additional media conditions tested. (D) Same as (B), for the media conditions shown in (C). (E-G) UMAP of scRNA-seq data from 26,111 cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. Marker genes for general regions of the central nervous system (E), radial glia subtypes (F), and GABAergic interneuron subtypes (G) are shown. Colors indicate gene expression. (H,I) Optimization of EOMES-iCM differentiation. Data shows percent of cells that stained for cardiomyocyte marker TNNT2 at day 10 pooled from n = 3 biological replicates. (H) Different EOMES induction duration and seeding densities. (I) EOMES induction for 2 days or GSK and Wnt inhibition at different seeding densities. (J-N) Modeling neurodevelopmental disorders using RFX4-iNPs with DYRK1A perturbation. KO, knockout; NT, nontargeting. (J-L) Volcano plots showing the number of genes that were significantly differentially expressed (FDR < 0.05) and had an absolute log2 fold change >1 for DYRK1A KO sgRNA 1 (J), KO sgRNA 2 (K), and ORF (L) conditions. For a full list of genes, see Data S13B. The KO sgRNAs 1 and 2 conditions were compared to both NT sgRNAs. The ORF condition was compared to GFP control. n = 3. (M) Representative images of MAP2 immunostaining during spontaneous differentiation for NT sg1 and DYRK1A KO sg2. Scale bar, 100μm. (N) Action potential (AP) properties measured using electrophysiology for different DYRK1A perturbations from n = 12-36 neurons with evoked APs. Values represent mean ± SEM.

sup figure 1

Figure S1 ∣ Bulk TF screening in different cell culture media, related to Figure 1 . (A-D) Comparison of ORF and CRISPRa. HUES66 hESCs were transduced with ORF, ORF with untranslated regions (UTRs), or SAM CRISPRa to upregulate NEUROD1 (A,B) or NEUROG2 (C,D) for 7 days. n = 4. Values represent mean ± SEM. Scale bar, 50 μm. ****P < 0.0001; ***P < 0.001. NT, nontargeting; sg, single guide RNA. (A,C) Expression of NEUROD1 mRNA and protein (A) or NEUROG2 mRNA (C). (B,D) Immunostaining of marker genes for neurons (MAP2) and neural progenitors (PAX6). (E) MORF vector design. WPRE, Woodchuck Hepatitis Virus Posttranscriptional Regulatory Element. (F) Schematic of bulk TF screening in H1 hESCs. MOI, multiplicity of infection. (G) Duration of established TF-driven differentiation protocols. Dashed line indicates the median. (H) Scatterplots comparing the TF barcode distribution for the plasmid library, lentiviral library, and unsorted cells cultured in stem cell media (M7) after 7 days of TF ORF overexpression. CPM, counts per million; BR1 and BR2, biological replicates; skew, ratio of the 90 th and 10 th percentile CPM. (I) Heatmap showing TF barcode counts in each media condition (M1-M7; Methods) relative to the lentivirus library. The 10 most enriched and depleted TFs are labeled. Numbers after the TF gene name indicate the isoform. (J) Heatmap showing pairwise Pearson correlation between each of the conditions in (I), ordered by hierarchical clustering. (K,L) Scatterplots showing the relation between TF barcode counts and ORF length for the lentivirus library (K) and unsorted cells (L), averaged across 7 media conditions. (M) Heatmap showing TF barcode counts in the sorted differentiated cells relative to stem cells. The 50 most enriched TFs are labeled. (N) Subset of data in (M), highlighting TFs with known roles in development or differentiation. (O) Heatmap showing the pairwise Pearson correlation between each of the conditions in (M), ordered by hierarchical clustering. The top 5% of TFs with the highest average fold change were evaluated. (P) Box plots showing enrichment of 67 developmentally critical TFs from (Parekh et al., 2018) and (N). Whiskers indicate the 10 th and 90 th percentiles.

sup figure 2

Figure S2 ∣ TF Atlas data quality control and pseudotime analysis, related to Figure 1 . (A) Violin plots showing distribution of genes, unique molecular identifiers (UMIs), and percent mitochondrial counts per cell in the TF Atlas. (B) Comparison of TF ORF representation between the bulk TF screen and the TF Atlas scRNA-seq. CPM, counts per million. (C) Distribution of cells overexpressing each TF isoform. Dashed lines indicate subsampling thresholds. (D) Scatterplot showing the relation between average TF ORF expression per cell and TF ORF length. (E) Density scatterplot showing, for each cell, expression of the exogenous TF ORF measured via TF barcode counts and the corresponding endogenous TF. (F) UMAPs of TF Atlas scRNA-seq data highlighting cells with indicated ORF. Numbers after TF gene names indicate the isoform. (G) Heatmaps showing percentage of cells with the indicated TF ORF that were assigned to each Louvain cluster from Figure 1B . (H,I) Force-directed graph (FDG) representation of TF Atlas scRNA-seq data. Colors indicate Louvain clusters (H) and diffusion pseudotime (I). (J,K) UMAP of TF Atlas scRNA-seq data. (J) Stream plot of RNA velocities. Line weights indicate speed and colors indicate Louvain clusters. (K) Colors indicate RNA velocity pseudotimes. (L,M) FDG representations of (J) and (K), respectively. (N-Q) Density scatterplots showing RNA velocity pseudotime (N), genes (O), UMIs (P), and TF barcode counts (Q) over diffusion pseudotime for each cell. (R) Smoothened heatmap of the top 1,000 upregulated and downregulated genes over RNA velocity pseudotime. Genes are ordered by change over pseudotime. (S) Scatterplot comparing the differentiation results of the scRNA-seq pseudotime analysis to the bulk TF screen. (T) Significance of the pseudotime difference between cells expressing each TF isoform and those expressing controls. Subset of Figure 1H . Dashed line indicates FDR < 0.05. (U-V) Heatmaps showing pairwise Pearson correlation between TF potential vectors for the KLF (U) and LIM homeodomain (V) TF families. TFs are ordered by hierarchical clustering.

sup figure 3

Figure S3 ∣ Mapping and validating differentiated cells against reference cell types, related to Figures 3 and ​ and4. 4 . (A-C) Integration of TF Atlas differentiated cells and human fetal expression atlas (Cao et al., 2020) datasets. (A,B) UMAPs of scRNA-seq data with colors indicating reference cell types (A) and source dataset (B). (C) Heatmap showing the percentage of major cell types annotated by Seurat label-transfer that were mapped by dataset integration (D,E) SingleCellNet (Tan and Cahan, 2019) annotation of TF Atlas differentiated cells. (D) Heatmap showing the scores for TF Atlas cells (columns) against reference cell types (rows). (E) Heatmap showing the percentage of major cell types annotated by Seurat label-transfer that were mapped by SingleCellNet. (F) Heatmaps showing percentage of cells from each cluster that mapped to the indicated reference cell type (top) and enrichment of Gene Ontology (GO) terms in differentially expressed genes (bottom). EMT, epithelial-mesenchymal transition; ENS, enteric nervous system. CNS, central nervous system; diff., differentiation; pos., positive; reg., regulation of; dev., development; migr., migration. (G,H) Heatmaps showing correlation between TF potential vectors and expression profiles of reference cell types that matched (G) or did not match (H) by label-transfer. The top 10 TFs with the highest correlation for each cell type are included. Numbers after TF gene names indicate the isoform. (I-K) Heatmaps showing distances between expression profiles of TF Atlas differentiated cells (columns) and reference cell types (rows). Values represent Euclidean distance in the latent space for the integrated datasets (Figure S3A,B). Cell types are ordered by hierarchical clustering. Cell types derived from the circulatory system (I), myocytes (J), and endoderm (K) are shown. (L,M) Expression of marker genes measured by quantitative PCR after 7 days of candidate TF or GFP overexpression. n = 4. (L) Heatmap showing expression relative to GFP control. (M) Expression of marker genes in H1 hESCs. (N-P) Left, expression of marker genes measured by immunostaining in H1 hESCs after 7 days of TF ORF overexpression. Right, intensity of marker gene staining normalized to GFP control from n = 6 images. Scale bar, 25 μm. Marker genes for stromal (N), lung ciliated epithelial (O), and intestinal epithelial (P) cells are shown. Values represent mean ± SEM. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns, not significant.

sup figure 4

Figure S4 ∣ Developing a targeted TF ORF screening platform for generating cell types of interest, related to Figure 5 . (A) Prediction of TFs for iNP differentiation based on correlation between TF potential vectors in the TF Atlas and reference iNP expression datasets. The top 20 TFs are listed. Numbers after TF gene names indicate the isoform. (B) FACS histograms showing EGFP expression in reporter cell lines with or without the targeted TF library. High and low bins indicate populations sorted for sequencing of TF barcodes. (C) Scatterplot showing TF enrichment in reporter cell line screens. n = 3. (D,E) Representative FACS plots showing expression of 2 (D) or 10 (E) NP marker genes labeled by pooled FISH probes. High and low bins indicate populations sorted for sequencing of TF barcodes. (F) Scatterplot showing TF enrichment in flow-FISH screens targeting 2 or 10 NP marker genes. n = 3. (G) Comparison of TF enrichment in reporter cell line and flow-FISH screens. (H-J) TF screening using scRNA-seq for 60,997 cells. (H) Violin plots showing distribution of genes, UMIs, and percent mitochondrial counts per cell. (I) UMAP of scRNA-seq data with colors indicating Louvain clusters. (J) Heatmap showing correlation between mean TF-induced expression profiles and human radial glia (RG) from reference datasets (Eze et al., 2021; Nowakowski et al., 2017; Pollen et al., 2015; Quadrato et al., 2017; Zhong et al., 2018). (K) Scatterplot showing TF enrichment in the arrayed screen. Expression was measured by quantitative PCR. N = 3. (L) Top, expression of NP markers VIM and NES measured by immunostaining after 7 days of TF overexpression. Bottom, heatmap showing correlation between bulk RNA-seq expression profiles of iNPs and human fetal cortex cell types (Pollen et al., 2015). Cell culture media is indicated in parentheses. Scale bar, 50μm. D07 and D12 indicate number of days of ORF overexpression. RG, radial glia; IPC, intermediate progenitor cell; N, neuron; IN, interneuron. (M) Schematic of spontaneous differentiation. dox, doxycycline; EGF, epidermal growth factor; FGF, fetal growth factor. (N) Expression of marker genes for neurons (MAP2), astrocytes (GFAP), and oligodendrocyte precursor cells (PDGFRA) in cells spontaneously differentiated for 1, 2, 4, or 8 weeks from iNPs produced by candidate TFs. Scale bar, 100μm. (O-R) ScRNA-seq profiling of 53,113 cells that have been spontaneously differentiated from iNPs for 8 weeks. iNPs were derived using RFX4, NFIB, ASCL1, or PAX6 with n = 2 biological replicates. Pro, uncommitted progenitors; RP, retinal progenitors; RPE, retinal pigment epithelium; PR, photoreceptors; RGC, retinal ganglion cells; DNP, dorsal neural progenitors; RG, radial glia; Astro, astrocytes; CN, CNS neurons; EPD, ependyma; EP, epithelial progenitors; BE, bronchial epithelium; CE, cranial epithelium; CNC, cranial neural crest; CNCP, cranial neural crest progenitors; P, proliferative cells. (O,P) UMAP of scRNA-seq data with colors indicating Louvain clusters (O) and replicates from each TF (P), S1 and S2. (Q) Heatmap showing the percentage of cells from each replicate that were grouped into each cluster. (R) Distribution of general cell types produced by each replicate. (S-V) ScRNA-seq profiling of iNPs differentiated using different methods. EB, embryoid body; DS, dual SMAD; NP, neural progenitors; CN, CNS neurons; CNC, cranial neural crest. Data represents n = 2 batch replicates with 15,211 RFX4-DS, 11,148 EB, and 16,421 DS. (S) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. (T,U) Box plots showing intra- (T) or inter- (U) replicate Wasserstein distances between cells. Whiskers indicate the 5 th and 95 th percentiles. (V) UMAP of scRNA-seq data highlighting genes that differentiate RFX4-DS-iNPs from EB- and DS-iNPs. Color indicates expression.

sup figure 5

Figure S5 ∣ Characterization of RFX4-iNPs and EOMES-iCMs for sequential TF screening and disease modeling, related to Figure 5 . (A-D) ScRNA-seq profiling of 26,111 cells that have been spontaneously differentiated from RFX4-DS-iNPs for 4 or 8 weeks. Data represents n = 2 biological replicates. (A,B) UMAP of scRNA-seq data. Colors indicate expression of marker genes for major cell types (A) and replicates from each time point (B), S1 and S2. (C) Heatmap showing the percentage of cells from each biological replicate that were grouped into each cluster. CN, CNS neurons; RG, radial glia; MNG, meninges; P, proliferative cells. (D) UMAP of scRNA-seq data showing marker genes for neuronal subtypes. Colors indicate gene expression. (E-M) Characterization of EOMES-iCMs. (E) Expression of cardiomyocyte markers TNNT2 and NKX2.5 by immunostaining at day 30 after 2 days of EOMES induction or GSK and Wnt inhibition (GW). Scale bar, 100μm. (F) UMAP of scRNA-seq data from 16,698 cells that have been spontaneously differentiated for 4 weeks after EOMES induction or GW. Data represents n = 2 biological replicates. Colors indicate Louvain clusters. VCM, ventricular cardiomyocytes; ACM, atrial cardiomyocytes; SMM, smooth muscle cells; SKM, skeletal muscle cells; EPTH, epithelial cells; P, proliferating cells. (G) Dot plot showing marker genes for each cluster. Circle size and color indicate percentage and expression level, respectively. (H,I) UMAP of scRNA-seq data with colors indicating marker gene expression (H) and replicates from each differentiation method (I), S1 and S2. (J) Heatmap showing the percentage of cells from each replicate that were grouped into each cluster. (K) Distribution of general cell types produced by each replicate. (L,M) Expression of marker genes for mature cardiomyocytes shown on UMAP (L) and violin plots (M). (N,O) Scatterplots showing TF enrichment in sequential TF screens using flow-FISH. N = 2 replicates, S1 and S2. Numbers after TF gene names indicate the isoform. (N) Screen of 54 TFs in RFX4-iNPs for differentiation of astrocytes. After 7 days, cells were sorted for expression of astrocyte marker genes (NCAN, AQP4, FAM107A, and GFAP). (O) Screen of 80 TFs in EOMES-iCMs for production of atrial, ventricular, or mature cardiomyocytes. After 4 weeks, cells were sorted for expression of atrial (MYBPHL, GJA5, MYH6), ventricular (MYL2, GJA1, and PLN), and mature (MYL2, TNNI3, and TNNT2) cardiomyocyte marker genes. (P-X) Modeling effects of DYRK1A perturbation in RFX4-iNPs derived from 11a iPSCs. sg, single guide RNA; KO, knockout; NT, nontargeting. (P) Percent indels in RFX4-iNPs transduced with DYRK1A KO sgRNAs. n = 3. (Q) DYRK1A expression measured using quantitative PCR probes targeting the endogenous sequence or the ORF sequence. n = 4. (R,S) Western blots of DYRK1A at 7 days after transduction with Cas9 and DYRK1A KO sgRNAs (R) or DYRK1A ORF (S). (T,U) Bulk RNA-seq of DYRK1A perturbations to identify significantly differentially expressed genes (DEGs; FDR < 0.05). n = 3. (T) Venn diagram summarizing the DEGs. (U) Heatmap showing DYRK1A dosage-dependent DEGs. (V-X) Characterization of DYRK1A perturbations by electrophysiology. (V) Representative traces for neurons with or without evoked action potentials (AP) and spontaneous excitatory postsynaptic currents (EPSCs). (W) Proportion of neurons with or without AP and EPSCs from n = 31-45 neurons per condition. (X) Intrinsic membrane properties from n = 12-36 neurons with evoked AP per condition. Values represent mean ± SEM. *P < 0.05; ND, not detected.

sup figure 7

Figure S7 ∣ Combinatorial TF screening, modeling, and prediction, related to Figure 7 . (A) Heatmap showing pairwise Pearson correlation between mean expression profiles from the combinatorial screen of 10 TF ORFs in combinations, including 44 doubles and 3 triples, as well as 10 singles. TF combinations are ordered by hierarchical clustering. (B-F) Modeling effects of TF combinations with linear regression. (B-D) Heatmaps showing the coefficient weights (B,C) and score (D) for linear regression. (E) Annotated relationships for each TF combination based on the linear regression coefficients. (F) Heatmaps showing average expression profile of double TFs with those of respective single TFs for example combinations with annotated relationships. (G-M) Predicting TF combinations using the TF Atlas. (G,H) Percent accuracy for different approaches to predict TFs for measured double (G) or triple (H) TF expression profiles. For comparison to the combinatorial TF screen dataset, profiles of the 10 corresponding TFs from the TF Atlas were used. (I-M) Cell type prediction results for triple TF profiles. Predicted combinations for hepatoblasts (I), bronchiolar and alveolar epithelial cells (J), metanephric cells (K), vascular endothelial cells (L), and trophoblast giant cells (M) are shown. As gene signature scores were discrete, the percentile ranks were reported as ranges. TFs that are part of known combinations, developmentally critical, or specifically expressed in the target cell types are indicated in blue.

sup figure 6

Figure S6 ∣ Joint profiling of chromatin accessibility and gene expression on a subset of TF ORFs, related to Figure 6 . (A,B) Violin plots showing distribution of UMIs and genes for scRNA-seq (A), as well as UMIs and fraction of reads in the top 500,000 peaks for scATAC-seq (B) from the joint profiling dataset. Data included 69,085 cells overexpressing 198 TF ORFs for 4 or 7 days. (C) Representative fragment histogram for scATAC-seq data using the first two megabases of chromosome 1. (D) Transcriptional start site (TSS) enrichment score for scATAC-seq data. (E) RNA (left) and ATAC (right) UMAP of joint profiling data. Colors indicate small local moving (SLM) clusters. (F) Distribution of cells from each time point across clusters from Figure 6A . Asterisks indicate clusters with >75% cells from either time point. (G) Weighted nearest neighbor (WNN) UMAP of joint profiling data from Figure 6A , colored by diffusion pseudotime. (H) Violin plots comparing diffusion pseudotimes of each time point. (I) Heatmaps showing gene regulatory networks (GRN) for downstream TFs nominated by evaluating motif enrichment in ATAC peaks with significant peak-gene associations. Only the 10 most significantly enriched TF ORFs are shown for each cluster. TF ORFs that were identified as top ORFs and downstream TFs are labeled in blue.

28

Acknowledgements

We thank E. Habibi, T. Kamath, E.Z. Macosko, and F. Chen for scRNA-seq analysis advice; K. Waddell, L. Amir-Zilberstein, L. Nguyen, M.S. Cuoco, and D. Dionne for scRNA-seq library generation; O. Rozenblatt-Rosen for help with large-scale scRNA-seq design; J.G. Doench and D.E. Root for discussions on MORF library design; J.M. Engreitz and C.P. Fulco for flow-FISH advice; M.C. Marchetto and F.H. Gage for EB protocols; R. Belliveau for overall research support; A. Tang for illustrations; and the entire Zhang laboratory for support and advice. J.J. is supported by fellowships from the McGovern Institute and NIH F31-MH117886. F.Z. is supported by NIH grants R01-HG009761 and DP1-HL141201, the Howard Hughes Medical Institute (HHMI), the Poitras Center for Psychiatric Disorders Research at MIT, the Hock E. Tan and K. Lisa Yang Center for Autism Research at MIT, the Yang-Tan Center for Molecular Therapeutics at MIT, and the Phillips family and J. and P. Poitras. A.R. was an HHMI Investigator while conducting this work and work was supported by the Klarman Cell Observatory (KCO), NHGRI Center of Excellence for Genome Science, and HHMI. J.D.B. is supported by an NIH grant DP2-HL151353, Gene Regulation Observatory, and Harvard Stem Cell Institute (BA-0002-21-00). Z.F. is supported by the Stanley Center for Psychiatric Research and an NIH grant R01-MH127085. The MORF library synthesis was funded by the Stanley Center for Psychiatric Research. TF Atlas scRNA-seq was funded by the KCO. Reagents are available through Addgene; support forums and computational tools are available via the Zhang lab website (https://zlab.bio/).

Inclusion and diversity

One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in their field of research or within their geographical location. One or more of the authors of this paper self-identifies as a gender minority in their field of research.

Footnotes

Declaration of Interests

J.J. and F.Z. are inventors listed on an International PCT Application related to this work. F.Z. is a scientific advisor and cofounder of Editas Medicine, Beam Therapeutics, Pairwise Plants, Arbor Biotechnologies, and Proof Diagnostics. F.Z. is a scientific advisor for Octant. A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas, and was an SAB member of ThermoFisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics and Asimov until 31 July 2020. Since 1 August 2020, A.R. has been an employee of Genentech and has equity in Roche. A.R. is an inventor on patents and patent applications filed at the Broad related to single cell genomics. J.D.B. holds patents related to ATAC-seq and scATAC-seq and serves on the scientific advisory boards of CAMP4 Therapeutics, seqWell, and CelSee. J.S.G. and O.O.A. are cofounders of Sherlock Biosciences, Proof Diagnostics, Moment Biosciences, and Tome Biosciences. Since 16 November 2020, K.R.G. has been an employee of Genentech.

Data S17 ∣ Profiling of candidate TFs by ChIP-seq. (A) Top 3 de novo or known motifs identified using HOMER motif analysis. The names of the TFs with the closest matching motifs, indicating potential cofactors of candidate TFs, and the associated P-values of enrichment are listed. (B) Heatmap showing percentage of NP-specific TFs or genes that had candidate TF ChIP-seq peaks within 10 kb of the annotated transcriptional start site (TSS). (C) Overlap of NP-specific genes that had candidate TF ChIP-seq peaks within 10 kb of the TSS and were differentially expressed (FDR < 0.05) upon candidate TF overexpression. Genes that were shared between candidate TFs are shown, with blue regions indicating overlap.

Data S20 ∣ Combinatorial TF screening and prediction of TF combinations. (A) UMAP of scRNA-seq profiles from the combinatorial screen of 10 TF ORFs in combinations, including 44 doubles and 3 triples, as well as 10 singles. Colors indicate Louvain clusters. (B) Heatmap showing percentage of cells with the indicated TF combination for each cluster. Percentages are normalized to the total number of cells with the TF ORF in the combinatorial dataset. (C-F) Percent accuracy for different approaches to predict TFs for double (C,D) or triple (E,F) TF combinations using single TF profiles from the TF Atlas. To reduce the number of possible combinations, TFs were grouped into 30 (C,E) or 51 (D,F) clusters based on expression profile similarity.

References